# Handling Categorical Variables


Categorical Variables: 
- **Nomial**: no order associated with like gender (male & female) &#8594; using Label Encoder or Mapping Dictionary
- **Ordinal**: order associated
- **Cyclical**: Monday > Tuesday > .. > Sunday
- **Binary**
- **Rare Category**

## Rule of thumb:
- Fill na with string &#8594 convert all values to string
    - `data[feat].fillna("NONE").astype(str)`

In [1]:
import pandas as pd
from sklearn import preprocessing

In [2]:
df = pd.read_csv("../input/dataset/cat_train.csv")
df_backup = df.copy()

In [3]:
df.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0.0,0.0,0.0,F,N,Red,Trapezoid,Hamster,Russia,...,02e7c8990,3.0,Contributor,Hot,c,U,Pw,6.0,3.0,0
1,1,1.0,1.0,0.0,F,Y,Red,Star,Axolotl,,...,f37df64af,3.0,Grandmaster,Warm,e,X,pE,7.0,7.0,0
2,2,0.0,1.0,0.0,F,N,Red,,Hamster,Canada,...,,3.0,,Freezing,n,P,eN,5.0,9.0,0
3,3,,0.0,0.0,F,N,Red,Circle,Hamster,Finland,...,f9d456e57,1.0,Novice,Lava Hot,a,C,,3.0,3.0,0
4,4,0.0,,0.0,T,N,Red,Triangle,Hamster,Costa Rica,...,c5361037c,3.0,Grandmaster,Cold,h,C,OZ,5.0,12.0,0


In [4]:
df.shape

(600000, 25)

## 1. Label Encoding
- Step 1: Fill NaN values since Label Encoding does not handle NaN values
- Step 2: Fit & Transform

### Note:
- Label Encoding 
    - CAN: tree-based models (Decision trees, Random forest, Extra Trees) and any kind of boosted trees models (XGBoost, GBM, or LightGBM)
    - CANNOT: be used in linear models, SVM, or Neural Networks as they expect data to be normalized (or standardized)
        - For these types of Models, we can binarize the data by converting the categories to numbers and then converting them to their binary representation. We are thus splitting one feature into three (in this case) features (or columns)

        ```Python
        #"Freezing": 0 => 0 0 0
        #"Warm"    : 1 => 0 0 1
        #"Cold"    : 2 => 0 1 0
        #"Boiling Hot" ...
        ```

In [4]:
#Let's look at `ord_2`, since the computer do not understand the text, so we can convert to number
df.ord_2.value_counts()

Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
Name: ord_2, dtype: int64

In [5]:
mapping = {
    "Freezing": 0,
    "Warm": 1,
    "Cold": 2,
    "Boiling Hot": 3,
    "Hot": 4,
    "Lava Hot": 5
}

df.loc[:, "ord_2"] = df.ord_2.map(mapping) #This is equivalent to Label Encoding
df.ord_2.value_counts()

0.0    142726
1.0    124239
2.0     97822
3.0     84790
4.0     67508
5.0     64840
Name: ord_2, dtype: int64

In [6]:
df = df_backup.copy() #revert back the origin data

In [7]:
#Step 1: Fill NaN values since Label Encoding does not handle NaN values
#df.ord_2.isna().sum()
#df.ord_2[df.ord_2.isna()==True]
df.loc[:, "ord_2"] = df.ord_2.fillna("None")
#df.ord_2.value_counts()

lbl_enc = preprocessing.LabelEncoder()
lbl_enc.fit(df.ord_2.values)

df.loc[:,"ord_2"] = lbl_enc.transform(df.ord_2.values)

In [8]:
df.ord_2.value_counts()

2    142726
6    124239
1     97822
0     84790
3     67508
4     64840
5     18075
Name: ord_2, dtype: int64

In [13]:
lbl_mapping = dict(zip(lbl_enc.classes_, lbl_enc.transform(lbl_enc.classes_)))
print(lbl_mapping)

{'Boiling Hot': 0, 'Cold': 1, 'Freezing': 2, 'Hot': 3, 'Lava Hot': 4, 'None': 5, 'Warm': 6}


## 2. One Hot Encoding

In [10]:
df = df_backup.copy()

In [14]:
from sklearn import preprocessing

In [19]:
# initialize OneHotEncoder from scikit-learn
# keep sparse = False to get dense array
ohe = preprocessing.OneHotEncoder(sparse=False)
ohe_example_dense = ohe.fit_transform(df['ord_2'].values.reshape(-1, 1))
# print size in bytes for dense array
print(f"Size of dense array: {ohe_example_dense.nbytes/1024}")

Size of dense array: 32812.5


In [22]:
# initialize OneHotEncoder from scikit-learn # keep sparse = True to get sparse array
ohe = preprocessing.OneHotEncoder(sparse=True)
ohe_example_sparse = ohe.fit_transform(df['ord_2'].values.reshape(-1, 1))
# print size of this sparse matrix
print(f"Size of sparse array: {ohe_example_sparse.data.nbytes/1024}")
full_size = (
    ohe_example_sparse.data.nbytes +
    ohe_example_sparse.indptr.nbytes + ohe_example_sparse.indices.nbytes
)
# print full size of this sparse matrix
print(f"Full size of sparse array: {full_size/1024}")

Size of sparse array: 4687.5
Full size of sparse array: 9375.00390625


## 3. Converting categorical variables to numerical variables. 

In [28]:
df = df_backup.copy()
df.shape

(600000, 25)

In [25]:
df[df.ord_2 == "Boiling Hot"].shape

(84790, 25)

In [26]:
df.groupby(["ord_2"])["id"].count()

ord_2
Boiling Hot     84790
Cold            97822
Freezing       142726
Hot             67508
Lava Hot        64840
Warm           124239
Name: id, dtype: int64

In [27]:
# Create a new column with count values based on the count
# Boiling Hot  -> 84790
# Hot          -> 67508
df.groupby(["ord_2"])["id"].transform("count")

0          67508.0
1         124239.0
2         142726.0
3          64840.0
4          97822.0
            ...   
599995    142726.0
599996     84790.0
599997    142726.0
599998    124239.0
599999     84790.0
Name: id, Length: 600000, dtype: float64

- Group by multiple columns and their counts. 
    - For ex: counts by grouping on ord_1 and ord_2

In [31]:
df.groupby(["ord_1", "ord_2"])["id"].count().reset_index(name='count').head()

Unnamed: 0,ord_1,ord_2,count
0,Contributor,Boiling Hot,15634
1,Contributor,Cold,17734
2,Contributor,Freezing,26082
3,Contributor,Hot,12428
4,Contributor,Lava Hot,11919


In [32]:
df.groupby(["ord_1", "ord_2"])["id"].transform("count")

0         12428.0
1         19899.0
2             NaN
3         17373.0
4         15464.0
           ...   
599995    38233.0
599996    22718.0
599997    26082.0
599998    15734.0
599999    15634.0
Name: id, Length: 600000, dtype: float64

## 4. Creating new features from these categorical variables

- We can combine categorical columns into a new feature
- Some domain knowledge might be useful for creating features like this. But if you don’t have concerns about memory and CPU usage, you can go for a greedy approach where you can create many such combinations and then use a model to decide which features are useful and keep them.
- NaN will also convert to string

In [20]:
df["new_feature"] = df.ord_1.astype(str) + "_" + df.ord_2.astype(str)
df["new_feature"].head()

0     Contributor_Hot
1    Grandmaster_Warm
2        nan_Freezing
3     Novice_Lava Hot
4    Grandmaster_Cold
Name: new_feature, dtype: object

## 5. Rare category

- Let’s assume that you have deployed the model which uses this column in production and when the model or the project is live, you get a category in the column that is not present in train. You model pipeline, in this case, will throw an error and there is nothing that you can do about it.
- A **rare category** is a category which is not seen very often, or a new category that is not present in train

For example, for `ord_4`, we see that some values appear only a couple thousand times, and some appear almost 40K times. NaNs are also seen a lot.

In [33]:
df["ord_4"] = df.ord_4.fillna("NONE").astype(str)

- Define our criteria for calling a value “rare”. Let’s say the requirement for a value being rare in this column is a count of less than 2000
- Wherever the value count for a certain category is less than 2000, replace it with rare. So, now, when it comes to test data, all the new, unseen categories will be mapped to “RARE”, and all missing values will be mapped to “NONE”.

In [50]:
df.loc[df["ord_4"].map(df["ord_4"].value_counts()) < 2000, "ord_4"] = "RARE"

In [51]:
df["ord_4"].value_counts()

N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
RARE     3607
G        3404
V        3107
Name: ord_4, dtype: int64