

---
title: "TabularPandas AttributeError: classes"
execute: 
  enabled: false
  freeze: true
date: "6/13/2024"
categories: [fastai, TabularPandas, error, python]

engine: jupyter
---


> Specify *procs* `TabularPandas(..., procs = [Categorify])` when you have categorical columns

:::{.callout-note}
This post was written using:
<br>- `pandas`: 2.2.2
<br>- `fastai`: 2.7.15
:::

In [1]:
import pandas as pd
from fastai.tabular.all import *

In [6]:
# looking at unique values in each columns to split categorical / continuous features
for k in df.keys():
    print(f"Column {k}:\n{Counter(df[k])}")
    print()

Column a:
Counter({2.0: 8435, 1.0: 2580, 3.0: 2011})

Column b:
Counter({12.0: 9853, 32.0: 2911, 80.0: 87, 16.0: 72, 11.0: 47, 8.0: 28, 10.0: 25, 40.0: 3})

Column c:
Counter({0: 1579, 5: 1517, 6: 1271, 7: 1213, 4: 1211, 8: 1081, 3: 898, 9: 831, 10: 682, 2: 608, 1: 342, 11: 278, 14: 216, 12: 206, 16: 184, 13: 165, 15: 145, 17: 131, 18: 94, 19: 88, 20: 66, 21: 57, 23: 54, 29: 24, 26: 18, 24: 16, 27: 16, 22: 12, 30: 12, 25: 7, 28: 4})

Column d:
Counter({15.0: 7521, -1.0: 1433, 8.0: 374, 0.0: 372, 7.0: 365, 9.0: 334, 10.0: 319, 6.0: 316, 5.0: 272, 11.0: 264, 4.0: 241, 12.0: 220, 3.0: 217, 13.0: 205, 14.0: 176, 2.0: 170, 1.0: 164, -3.0: 14, -2.0: 4, -4.0: 3, -27.0: 3, -5.0: 3, -22.0: 2, -26.0: 2, -30.0: 2, -16.0: 2, -7.0: 2, -17.0: 2, -21.0: 1, -23.0: 1, -25.0: 1, -28.0: 1, -31.0: 1, -32.0: 1, -53.0: 1, -56.0: 1, -57.0: 1, -58.0: 1, -59.0: 1, -88.0: 1, -93.0: 1, -98.0: 1, -38.0: 1, -8.0: 1, -11.0: 1, -14.0: 1, -18.0: 1, -9.0: 1, -10.0: 1, -43.0: 1, -49.0: 1, -6.0: 1})

Column e:
Counter({

In [7]:
# define categorical and continuous features
cat_names = ['a', 'b']
y_names = 'label'
cont_names = [c for c in df.keys() if c not in cat_names+[y_names]]


print('cat_names:',cat_names)
print('cont_names:',cont_names)
print('y_names:',y_names)


cat_names: ['a', 'b']
cont_names: ['c', 'd', 'e', 'f']
y_names: label


In [8]:
# split into train and test
val_index = list(df.sample(frac=0.2, random_state=0).index) # 20% from total df
train_index = list(df[~df.index.isin(val_index)].index)

assert (len([i for i in train_index if i in set(val_index)])==0 
        and len([i for i in val_index if i in set(train_index)])==0), 'train and val set are overlapping!'

print('train set len', len(train_index))
print('val set len', len(val_index))

train set len 10421
val set len 2605


### Error Example

In [43]:
# oh no, can't train!

dl = TabularPandas(df, 
                   cat_names=cat_names, 
                   cont_names=cont_names, 
                   y_names=y_names,
                   y_block = CategoryBlock(vocab=df[y_names]), 
                   splits=(train_index, val_index))

dls = dl.dataloaders(bs=64)
print(dls.show_batch())
learn = tabular_learner(dls, metrics=[accuracy])
learn.fit_one_cycle(3)

Unnamed: 0,a,b,c,d,e,f,label
0,1.0,12.0,7.0,9.0,3.9,0.6,0.0
1,2.0,12.0,3.0,15.0,4.1,0.3,0.0
2,2.0,12.0,4.0,-1.0,4.0,0.7,0.0
3,2.0,12.0,11.0,15.0,4.1,1.4,0.0
4,2.0,12.0,4.0,12.0,4.2,0.6,0.0
5,2.0,32.0,14.0,6.0,5.2,0.2,0.0
6,1.0,12.0,4.0,9.0,3.2,0.3,1.0
7,3.0,32.0,5.0,15.0,3.5,0.7,0.0
8,2.0,12.0,3.0,14.0,2.6,0.5,0.0
9,3.0,12.0,0.0,-2.0,4.1,-0.0,0.0


None


AttributeError: classes

How to fix this? 

I actually went down the rabbit hole and provided the `emb_szs` manually as mentioned in the source code hinted by error message above, but there is actually an easier way -- just add `procs=[Categorify]` when initializing TabularPandas.

* In the source code, `emb_szs` is expected to be `{'class_name' : size, ...}`. So for example if column `a` is a categorical column in our df, then `emb_szs = {'a': len(unique value in column 'a')}`.

### Working Example

In [44]:
# now we can train

dl = TabularPandas(df, 
                   cat_names=cat_names, 
                   cont_names=cont_names, 
                   y_names=y_names,
                   y_block = CategoryBlock(vocab=df[y_names]), 
                   splits=(train_index, val_index),
                   procs=[Categorify])  # <------ add procs!

dls = dl.dataloaders(bs=64)
print(dls.show_batch())
learn = tabular_learner(dls, metrics=[accuracy])
learn.fit_one_cycle(3)

Unnamed: 0,a,b,c,d,e,f,label
0,2.0,32.0,15.0,15.0,3.6,0.8,0.0
1,3.0,32.0,9.0,15.0,4.5,0.7,0.0
2,2.0,32.0,9.0,14.0,3.9,0.5,1.0
3,2.0,12.0,6.0,-1.0,4.0,1.0,0.0
4,2.0,12.0,8.0,15.0,3.7,0.9,0.0
5,2.0,12.0,4.0,15.0,3.7,0.7,1.0
6,1.0,12.0,0.0,15.0,4.0,0.7,0.0
7,3.0,12.0,0.0,15.0,3.3,0.5,0.0
8,2.0,12.0,5.0,15.0,4.3,0.9,0.0
9,2.0,12.0,9.0,15.0,4.3,0.9,0.0


None


epoch,train_loss,valid_loss,accuracy,time
0,0.48211,0.418015,0.832246,00:09
1,0.349364,0.333562,0.852591,00:07
2,0.322106,0.323014,0.854511,00:07


----
### Why did we get this error?

In [11]:
get_emb_sz??

[0;31mSignature:[0m [0mget_emb_sz[0m[0;34m([0m[0mto[0m[0;34m:[0m [0;34m'Tabular | TabularPandas'[0m[0;34m,[0m [0msz_dict[0m[0;34m:[0m [0;34m'dict'[0m [0;34m=[0m [0;32mNone[0m[0;34m)[0m [0;34m->[0m [0;34m'list'[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mget_emb_sz[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mto[0m[0;34m:[0m[0mTabular[0m[0;34m|[0m[0mTabularPandas[0m[0;34m,[0m [0;34m[0m
[0;34m[0m    [0msz_dict[0m[0;34m:[0m[0mdict[0m[0;34m=[0m[0;32mNone[0m [0;31m# Dictionary of {'class_name' : size, ...} to override default `emb_sz_rule` [0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mlist[0m[0;34m:[0m [0;31m# List of embedding sizes for each category[0m[0;34m[0m
[0;34m[0m    [0;34m"Get embedding size for each cat_name in `Tabular` or `TabularPandas`, or populate embedding size manually using sz_dict"[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0;34m[[0m[0m_one_emb_sz[0m[0;34m(

In [3]:
from fastai.tabular.model import _one_emb_sz
_one_emb_sz??

[0;31mSignature:[0m [0m_one_emb_sz[0m[0;34m([0m[0mclasses[0m[0;34m,[0m [0mn[0m[0;34m,[0m [0msz_dict[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0m_one_emb_sz[0m[0;34m([0m[0mclasses[0m[0;34m,[0m [0mn[0m[0;34m,[0m [0msz_dict[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."[0m[0;34m[0m
[0;34m[0m    [0msz_dict[0m [0;34m=[0m [0mifnone[0m[0;34m([0m[0msz_dict[0m[0;34m,[0m [0;34m{[0m[0;34m}[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mn_cat[0m [0;34m=[0m [0mlen[0m[0;34m([0m[0mclasses[0m[0;34m[[0m[0mn[0m[0;34m][0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0msz[0m [0;34m=[0m [0msz_dict[0m[0;34m.[0m[0mget[0m[0;34m([0m[0mn[0m[0;34m,[0m [0mint[0m[0;34m([0m[0memb_sz_rule[0m[0;34m([0m[0mn_cat[0m[0;34m)[0m[0;34m)[0m[0;34m)[0m  

We see that the error is due to `get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)` line. The `get_emb_sz` function tries to return `[_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]`. We get error because our dataloaders has no `classes` attributes .

Here, `classes` attributes is what category do we have in each of our categorical columns. In `simple_df` below, we would declare `aa` column as categorical feature, with 3 separate classes `[1, 2, 3]`. The learner doesn't know this because we did not specify to `Categorify` our categorical column when initializing our dataloaders.

In [None]:
simple_df = pd.DataFrame({'aa': [1, 2, 3, 1], 'bb':[1.1, 2.2, 3.3, 5.0], 'label':[1, 0, 1, 1]})
simple_df

Unnamed: 0,aa,bb,label
0,1,1.1,1
1,2,2.2,0
2,3,3.3,1
3,1,5.0,1


In [None]:
# no classes attributes

TabularPandas(simple_df, 
              cat_names = ['aa'], 
              cont_names = ['bb'],
              y_names = ['label'],
              y_block = CategoryBlock(vocab=simple_df[y_names]), 
              splits = ([0,1,2], [3]),
             ).dataloaders(bs=64).classes


AttributeError: classes

In [None]:
# now we have classes attributes

TabularPandas(simple_df, 
              cat_names = ['aa'], 
              cont_names = ['bb'],
              y_names = ['label'],
              y_block = CategoryBlock(vocab=simple_df[y_names]), 
              splits = ([0,1,2], [3]),
              procs = [Categorify]
             ).dataloaders(bs=64).classes


{'aa': ['#na#', 1, 2, 3]}

That's all for now, bye!