# Categorical data

**LabelEncoder** - code each category with int from 0 to the total number of categories. 

pros:
* easy to use
* number of features does not increase
* good for trees-based models

cons:
* sets order over categories, what can bring false information in the model

In [3]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder

In [2]:
data = pd.read_csv('../Data/student-mat.csv')

data['famsize'].unique()

array(['GT3', 'LE3'], dtype=object)

In [3]:
encoder = LabelEncoder()

data['famsize'] = encoder.fit_transform(data['famsize'])

data['famsize'].unique()

array([0, 1])

**OneHotEncoder** - creates binomial features whose number is equal to the number of categories. 

pros:
* easy to use
* good for linear models

cons:
* new features are linearly dependent, but it can be fixed with dropping of one of new features
* number of features can grow significantly

In [4]:
from sklearn.preprocessing import OneHotEncoder

1st way

In [23]:
data['Medu'].unique()

array([4, 1, 3, 2, 0])

In [16]:
hot_encoder = OneHotEncoder(sparse=False)
new_features = hot_encoder.fit_transform(data['age'].values.reshape(-1,1))

tmp = pd.DataFrame(new_features, columns=['age' + str(i) for i in range(new_features.shape[1])])
data = pd.concat([data,tmp], axis=1)

In [17]:
data.head()

Unnamed: 0,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,failures,...,G2,G3,age0,age1,age2,age3,age4,age5,age6,age7
0,F,18,U,0,A,4,4,2,2,0,...,6,6,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,F,17,U,0,T,1,1,1,2,0,...,5,6,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,F,15,U,1,T,1,1,1,2,3,...,8,10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,F,15,U,0,T,4,2,1,3,0,...,14,15,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,F,16,U,0,T,3,3,1,2,0,...,10,10,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


2nd way

In [28]:
pd.get_dummies(data['Medu'], prefix='Medu').head()

Unnamed: 0,Medu_0,Medu_1,Medu_2,Medu_3,Medu_4
0,0,0,0,0,1
1,0,1,0,0,0
2,0,1,0,0,0
3,0,0,0,0,1
4,0,0,0,1,0


In [29]:
pd.get_dummies(data['Medu'], prefix='Medu', drop_first=True).head()

Unnamed: 0,Medu_1,Medu_2,Medu_3,Medu_4
0,0,0,0,1
1,1,0,0,0
2,1,0,0,0
3,0,0,0,1
4,0,0,1,0


**Hashing**

* Define a hash-function $h:U\to \{1,2, \dots , B\}$. 
* Calculate hash from each category. 
* Binarize the received values $g_j(x)=[h(f(x))=j], j=1,\dots, B$


In [5]:
hash_space = 25
hashing_example = pd.DataFrame([{i: 0.0 for i in range(hash_space)}])

for s in ('job=student', 'marital=single', 'day_of_week=mon'):
    print(s, '->', hash(s) % hash_space)
    hashing_example.loc[0, hash(s) % hash_space] = 1
    
hashing_example

job=student -> 15
marital=single -> 4
day_of_week=mon -> 24


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Pros: allows to reduce the dimensionality of features rearranging them.

**Code category with meaningful number** - encode each category by the number of samples included in it or mean value of other features in this category.

Pros and Cons
* good if you chose meaningful number:)

In [53]:
data['address_new'] = data['address'].map(data.groupby('address').size())

In [55]:
data.head()

Unnamed: 0,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,failures,...,G3,age0,age1,age2,age3,age4,age5,age6,age7,address_new
0,F,18,U,0,A,4,4,2,2,0,...,6,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,307
1,F,17,U,0,T,1,1,1,2,0,...,6,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,307
2,F,15,U,1,T,1,1,1,2,3,...,10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,307
3,F,15,U,0,T,4,2,1,3,0,...,15,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,307
4,F,16,U,0,T,3,3,1,2,0,...,10,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,307


**Conjunction of categorical features** - instead of encoding one categorical feature you can conctinate them and encode final feature.

Pros and Cons:
Can add aditional information to the model.

In [33]:
data['Pstatus' + '+' + 'Medu'] = data['Pstatus'].astype(str) + '+' + data['Medu'].astype(str)

In [35]:
data['Pstatus+Medu'].unique()

array(['A+4', 'T+1', 'T+4', 'T+3', 'T+2', 'A+3', 'A+2', 'T+0', 'A+1'],
      dtype=object)

**Conjunction of rare category** --
If within one feature several categories have very few samples, you can combine them together. The fact that they contain few values also gives additional information.

**Counters**

We replace each categorical attribute with the average value of the target variable for all objects of the same category. 

$$
g_j(x, X) = \frac{\sum_{i=1}^{l} [f_j(x) = f_j(x_i)][y_i = +1]}{\sum_{i=1}^{l} [f_j(x) = f_j(x_i)]}
$$

But we need to add noise, to prevent leakage of the target variable into the training data. 

Additionally, this function can be smoothed like this:

$$
g_j(x, X) = \frac{\sum_{i=1}^{\ell} [f_j(x) = f_j(x_i)][y_i = +1] + C \times global\_mean}{\sum_{i=1}^{\ell} [f_j(x) = f_j(x_i)] + C}
$$

In [1]:
def categorization(df_categorical,y,alpha=0, c=0):
    
    global_mean = np.mean(y)
    for column in df_categorical.columns:
        cat_dic = dict()  
        if len(df_categorical[str(column)].unique()) > 2:
            for category in df_categorical[str(column)].unique():
                counter = (sum(y[df_categorical[str(column)] == category]) + c* global_mean)/ len(y[df_categorical[str(column)] == category])
                cat_dic[category] = np.round(counter + alpha,2)

            df_categorical[str(column)] = df_categorical[str(column)].map(cat_dic)

To improve the model, the following tricks can be used:
* generate paired categorical features for which calculate counters
* calculate several counters for different C
* combine together rare categories and calculate counters for new categories.