# Handling categorical data

In this project I am going to work with the following dataset from Kaggle:

https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists

## Dataset description

Suppose that a company working with Big Data and Data Science wants to hire data scientists among people who have successfully passed some courses conducted by the company. Many people sign up for their training. The company wants to focus of the candidates who really want to work for them after training. Information related to demographics, education, experience is provided by the candidates via a sign-up form.

This dataset is designed to understand the factors that lead a person to leave current job which is useful for HR researches. Based on the provided data, you are going to predict whether a candidate is looking for a job change.

The whole data is divided into train and test parts. Data contains several categorical features – they need to be encoded.

## Feature description

- `enrollee_id`: Unique ID for a candidate
- `city`: City code
- `city_development_index`: Developement index of the city (scaled)
- `gender`: Gender of a candidate
- `relevent_experience`: Relevant experience of a candidate
- `enrolled_university`: Type of University course enrolled in if any
- `education_level`: Education level of a candidate
- `major_discipline`: Education major discipline of a candidate
- `experience`: Candidate total experience in years
- `company_size`: Number of employees in a current employer's company
- `company_type`: Type of a current employer
- `last_new_job`: Difference in years between previous and current jobs
- `training_hours`: training hours completed
- `target`: 0 – Not looking for job change, 1 – Looking for a job change


In [1]:
import numpy as np
import pandas as pd

In [2]:
train = pd.read_csv('data/aug_train.csv')
test = pd.read_csv('data/aug_test.csv')

Before modifying the features, let's study our dataframe a bit.

In [6]:
train.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,28120,city_16,0.91,,Has relevent experience,no_enrollment,High School,,19,100-500,Pvt Ltd,2,20,0.0
1,31820,city_21,0.624,,No relevent experience,no_enrollment,Masters,STEM,2,50-99,Early Stage Startup,1,10,1.0
2,4277,city_71,0.884,Male,Has relevent experience,Part time course,Graduate,STEM,>20,,,2,6,1.0
3,3379,city_159,0.843,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,<10,Pvt Ltd,>4,56,0.0
4,10821,city_16,0.91,Male,Has relevent experience,no_enrollment,Masters,STEM,6,,,4,15,0.0


In [3]:
train.dtypes.value_counts()

object     10
int64       2
float64     2
Name: count, dtype: int64

In [8]:
train.isnull().sum()

enrollee_id                  0
city                         0
city_development_index       0
gender                    4508
relevent_experience          0
enrolled_university        386
education_level            460
major_discipline          2813
experience                  65
company_size              5938
company_type              6140
last_new_job               423
training_hours               0
target                       0
dtype: int64

Some features have a relatively small number of missing values. For instance, `'experience'` has only 65 missing values in the train data. Replacing these missing values with a special category might introduce bias to the data. However, since the number of missing values is not so big, it might be OK.

So, let's replace missing values in `'experience'` feature with a category -1.

In [9]:
train.experience = train.experience.fillna(-1)

`'education_level'` is an ordinal feature. Let's apply the following mapping for the values of `'education_level'` feature:

* `np.nan` -> -1
* `'Primary School'` -> 0
* `'High School'` -> 1
* `'Graduate'` -> 2
* `'Masters'` -> 3
* `'Phd'` -> 4

So, at the same time I impute missing values in `'education_level'` with a new category -1.

In [10]:
map_education = {
    np.nan: -1,
    'Primary School': 0,
    'High School': 1,
    'Graduate': 2,
    'Masters': 3,
    'Phd': 4
}

train.education_level = train.education_level.map(map_education)

Feature `'relevent_experience'` is a binary feature. It has no missing values, which makes its encoding pretty easy.

Let's encode this feature with the following mapping:

*   `"No relevent experience"` -> 0
*   `"Has relevent experience"` -> 1

In [11]:
map_relevent_experience = {
    "No relevent experience": 0,
    "Has relevent experience": 1
}

train.relevent_experience = train.relevent_experience.map(map_relevent_experience)

In our case, `'gender'` is a nominal feature (notice that it is not a binary feature cause it contains three categories + NaNs). I will use One-Hot encoding to encode it.

In [12]:
df_gender_tr = pd.get_dummies(train.gender)
df_gender_tr.columns = ['gender-' + column for column in df_gender_tr.columns]

train = pd.concat([train, df_gender_tr], axis=1)
train.drop('gender', axis=1, inplace=True)

Let's also perform One-Hot encoding for the feature `'enrolled_university'`.

In [13]:
df_enrolled_university_tr = pd.get_dummies(train.enrolled_university)
df_enrolled_university_tr.columns = ['enrolled_university-' + column for column in df_enrolled_university_tr.columns]

train = pd.concat([train, df_enrolled_university_tr], axis=1)
train.drop('enrolled_university', axis=1, inplace=True)

Let's encode feature `'city'` using frequency encoding.

In [14]:
map_city = train.city.value_counts()

train.city = train.city.map(map_city)

Let's encode feature `'last_new_job'` with target encoding with no modifications.

In [16]:
train.last_new_job.fillna(-1, inplace=True)

map_last_new_job = train.groupby('last_new_job').target.mean()
train.last_new_job = train.last_new_job.map(map_last_new_job)

Let's encode feature `'experience'` with M-estimate encoding. In this case each category of `'experience'` will be mapped according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) + m\times y_{\text{mean}}}{\text{count}\left(j, x_{ij}\right) + m}\quad,
$$

where

* $x_{ij}$ is a category being encoded,
* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $m$ is a parameter.

Let's set $m = 0.5$. 

In [17]:
target_exp = train.groupby('experience').target.sum()
count_exp = train.groupby('experience').target.count()
y_mean = train.target.mean()
m = 0.5

map_experience = (target_exp + m * y_mean) / (count_exp + m)
train.experience = train.experience.map(map_experience)

Let's encode feature `'major_discipline'` with Leave-One-Out encoding. This technique is similar to target encoding, but here while computing the encoding for a particular observation, we exclude it from the target encoding formula. Therefore a category $x_{ij}$ for the $i$-th observation will be encoded according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) - y_i}{\text{count}\left(j, x_{ij}\right) - 1}\quad,
$$

where

* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $y_i$ is a target value of the $i$-th observation

In this method the same category can be encoded differently for different observations. Thus, after encoding the train part of data, we should create a mapping which will help to encode the test data.

In [19]:
target_md = train.groupby('major_discipline').target.sum()
count_md = train.groupby('major_discipline').target.count()

original_columns = train.columns
train['major_discipline_loo_encoded'] = (train.major_discipline.map(target_md) - train.target) / (train.major_discipline.map(count_md) - 1)
map_major_discipline = train.groupby('major_discipline').major_discipline_loo_encoded.mean()

train.drop('major_discipline', axis=1, inplace=True)
train.rename(columns={'major_discipline_loo_encoded': 'major_discipline'}, inplace=True)
train = train[original_columns]

Let's encode feature `'company_size'` with Catboost encoding using the implementation from `category_encoders` library.

In [20]:
from category_encoders.cat_boost import CatBoostEncoder


cbe_encoder = CatBoostEncoder(handle_missing='return_nan')
cbe_encoder.fit(train.company_size, train.target)

train.company_size = cbe_encoder.transform(train.company_size, train.target)

Let's encode feature `'company_type'` with Weight of Evidence (WoE) encoding. A category $x_{ij}$ will be encoded according to the following formula:

$$
\hat{x_{ij}} = \ln\left(\frac{\mathbb{P}\left(x_{ij}\mid y=1\right)}{\mathbb{P}\left(x_{ij}\mid y=0\right)}\right)\quad.
$$

Here:

$$
\mathbb{P}\left(x_{ij}\mid y=1\right) = \frac{\text{count}\left(y=1\mid x_{ij}\right)}{\text{count}\left(y=1\right)}
$$
$$
\mathbb{P}\left(x_{ij}\mid y=0\right) = \frac{\text{count}\left(y=0\mid x_{ij}\right)}{\text{count}\left(y=0\right)}
$$

The notation means the following:

* $\text{count}\left(y=1\mid x_{ij}\right)$ denotes the number of observations with the category $x_{ij}$ where the target value is equal to $1$;
* $\text{count}\left(y=0\mid x_{ij}\right)$ denotes the same but for the target value $0$;
* $\text{count}\left(y=1\right)$ denotes the number of observations with the target value equal to $1$;
* $\text{count}\left(y=0\right)$ denotes the same but for the target value $0$.

In [21]:
P_1 = train[train.target == 1].groupby('company_type').target.count() / train[train.target == 1].target.count()
P_0 = train[train.target == 0].groupby('company_type').target.count() / train[train.target == 0].target.count()

map_company_type = np.log(P_1 / P_0)
train.company_type = train.company_type.map(map_company_type)

Now all the categorical features are encoded. Next, I drop `'enrollee_id'` because it is not a representative feature. After this, I split train part of data into the dataframe which contains only features (without target) and the target array.

Before training the models, we should impute the remaining missing values. You might have noticed that I didn't impute missing values for the features `'major_discipline'`, `'company_size'` and `'company_type'`. This is because the number of missing values in these features is relatively big. Let's perform the imputation using KNN approach - where we impute missing values by looking at the similar observations.

In [22]:
X_train = train.drop(['enrollee_id', 'target'], axis=1)
y_train = train['target']

X_train.shape, y_train.shape

((19158, 16), (19158,))

In [23]:
from sklearn.impute import KNNImputer


knn_imputer = KNNImputer(n_neighbors=3)
knn_imputer.fit(X_train, y_train)

X_train = pd.DataFrame(knn_imputer.transform(X_train), columns=X_train.columns)

Finally, let's train a Random Forest classifier from `sklearn` on the train data and check feature importances.

In [24]:
from sklearn.ensemble import RandomForestClassifier


clf = RandomForestClassifier(n_estimators=500, max_depth=8, random_state=13)
clf.fit(X_train, y_train)

impotance_list = clf.feature_importances_
index_max = max(range(len(impotance_list)), key=impotance_list.__getitem__)
X_train.columns[index_max]

'major_discipline'

Finally, let's process the test data performing the similar operations as for the train data.

In [25]:
test.experience = test.experience.fillna(-1)
test.education_level = test.education_level.map(map_education)
test.relevent_experience = test.relevent_experience.map(map_relevent_experience)

df_gender_te = pd.get_dummies(test.gender)
df_gender_te.columns = ['gender-' + column for column in df_gender_te.columns]
test = pd.concat([test, df_gender_te], axis=1)
test.drop('gender', axis=1, inplace=True)

df_enrolled_university_te = pd.get_dummies(test.enrolled_university)
df_enrolled_university_te.columns = ['enrolled_university-' + column for column in df_enrolled_university_te.columns]
test = pd.concat([test, df_enrolled_university_te], axis=1)
test.drop('enrolled_university', axis=1, inplace=True)

test.city = test.city.map(map_city)
test.last_new_job.fillna(-1, inplace=True)
test.last_new_job = test.last_new_job.map(map_last_new_job)
test.experience = test.experience.map(map_experience)
test.major_discipline = test.major_discipline.map(map_major_discipline)
test.company_size = cbe_encoder.transform(test.company_size, test.target)
test.company_type = test.company_type.map(map_company_type)

X_test = test.drop(['enrollee_id', 'target'], axis=1)
y_test = test['target']
X_test = pd.DataFrame(knn_imputer.transform(X_test), columns=X_test.columns)
X_test.shape

(2129, 16)

As a result of the operations above, I obtained `X_test` with a shape (2129, 16). Let's calculate the predictions of the trained Random Forest model on it and check the accuracy of the predictions on test data. And then calculate the predictions of the same model on `X_train` and check the accuracy there.

In [21]:
y_pred_te = clf.predict(X_test)
y_pred_tr = clf.predict(X_train)

In [22]:
from sklearn.metrics import accuracy_score


round(accuracy_score(y_test, y_pred_te), 2)

0.73

In [23]:
round(accuracy_score(y_train, y_pred_tr), 2)

0.98