# Practice assignment: Handling categorical data

In this programming assignment, you are going to work with the following dataset from Kaggle:

https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists

This assignment is graded by your `submission.json`.

The cell below creates a valid `submission.json` file, fill your answers in there. 

You can press "Submit Assignment" at any time to submit partial progress.

In [74]:
%%file submission.json
{
    "q1": 10,
    "q2": 123,
    "q3": 8,
    "q4": 23,
    "q5": 2.06,
    "q6": 0.72,
    "q7": 42824,
    "q8": 38702,
    "q9": 1724.51112,
    "q10": 0.36407,
    "q11": 0.45383,
    "q12": 0.26772,
    "q13": 0.24935,
    "q14": -0.40878,
    "q15": 0.18216,
    "q16": "major_discipline",
    "q17": 0.73
}

Overwriting submission.json


## Dataset description

Suppose that a company working with Big Data and Data Science wants to hire data scientists among people who have successfully passed some courses conducted by the company. Many people sign up for their training. The company wants to focus of the candidates who really want to work for them after training. Information related to demographics, education, experience is provided by the candidates via a sign-up form.

This dataset is designed to understand the factors that lead a person to leave current job which is useful for HR researches. Based on the provided data, you are going to predict whether a candidate is looking for a job change.

The whole data is divided into train and test parts. Data contains several categorical features – they need to be encoded.

## Feature description

- `enrollee_id`: Unique ID for a candidate
- `city`: City code
- `city_development_index`: Developement index of the city (scaled)
- `gender`: Gender of a candidate
- `relevent_experience`: Relevant experience of a candidate
- `enrolled_university`: Type of University course enrolled in if any
- `education_level`: Education level of a candidate
- `major_discipline`: Education major discipline of a candidate
- `experience`: Candidate total experience in years
- `company_size`: Number of employees in a current employer's company
- `company_type`: Type of a current employer
- `last_new_job`: Difference in years between previous and current jobs
- `training_hours`: training hours completed
- `target`: 0 – Not looking for job change, 1 – Looking for a job change


In [2]:
import numpy as np
import pandas as pd

In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

## 1

Before modifying the features, let's study our dataframe a bit. First, call `.dtypes` for `train` dataframe. 

**q1:** How many columns in `train` contain data types different from numbers (ints and floats)?

In [4]:
# your code here
cat_cols = train.columns[train.dtypes == "object"].tolist()
len(cat_cols)

10

## 2

**q2:** How many unique caterogies does a feature `'city'` contain in the train data?

In [5]:
# your code here
train['city'].describe()

count        19158
unique         123
top       city_103
freq          4355
Name: city, dtype: object

## 3

**q3:** How many features in train data contain missing values (NaN)?

In [6]:
# your code here
train.isnull().sum()

enrollee_id                  0
city                         0
city_development_index       0
gender                    4508
relevent_experience          0
enrolled_university        386
education_level            460
major_discipline          2813
experience                  65
company_size              5938
company_type              6140
last_new_job               423
training_hours               0
target                       0
dtype: int64

## 4

Some features have a relatively small number of missing values. For instance, `'experience'` has only 65 missing values in the train data. Replacing these missing values with a special category might introduce bias to the data. However, since the number of missing values is not so big, we might be OK with it.

Replace missing values in `'experience'` feature with a category -1.

**q4:** How many categories does this feature contain in the train data now?

_(hint: you might want to use `fillna` function from `pandas` library in this task, but remember that it's not an in-place function by default. Or you can use `SimpleImputer` from `sklearn`)_

In [7]:
# your code here
from sklearn.impute import SimpleImputer

experience_imp = SimpleImputer(missing_values = np.nan, strategy = 'constant', fill_value = -1)
experience_imp = experience_imp.fit(train[['experience']])
train['experience'] = experience_imp.transform(train[['experience']])
train.experience.describe()

count     19158
unique       23
top         >20
freq       3286
Name: experience, dtype: object

## 5

`'education_level'` is an example of an ordinal feature. Sure, we can define an order on it. For instance, `'High School'` is "bigger" than `'Primary School'`, and `'Phd'` is "bigger" than '`Graduate'`. We can encode this feature with integer numbers in a correct order.

In this task, apply a correct mapping for the values of `'education_level'` feature. The mapping should be the following:

* `'Primary School'` -> 0
* `'High School'` -> 1
* `'Graduate'` -> 2
* `'Masters'` -> 3
* `'Phd'` -> 4

At the same time, impute missing values in `'education_level'` with a new category -1. So another part of the mapping would be:

* `np.nan` -> -1

**q5:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

_(hint: you might want to use `map` function from `pandas` in this task)_

In [8]:
map_education = train['education_level'].map({'Primary School': 0, 'High School': 1, 'Graduate': 2, 'Masters': 3, 'Phd': 4, np.nan: -1})
# your code here
# your code here
train['education_level'] = map_education
train['education_level'].mean()

2.061384278108362

## 6

Feature `'relevent_experience'` is an example of a binary feature. You can also check that it has no missing values, which makes its encoding pretty easy.

Encode this feature with the following mapping:

*   `"No relevent experience"` -> 0
*   `"Has relevent experience"` -> 1

**q6:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

In [9]:
map_relevent_experience = train['relevent_experience'].map({'No relevent experience': 0, 'Has relevent experience': 1})
# your code here
# your code here
train['relevent_experience'] = map_relevent_experience
train['relevent_experience'].mean()

0.7199081323728991

## 7

In our case, `'gender'` is an example of a nominal feature (notice that it is not a binary feature cause it contains three categories + NaNs). We will use One-Hot encoding to encode it. There are several options how to do it in practice: for example, to use `get_dummies` function from `pandas` or `OneHotEncoder` from `sklearn.preprocessing`. Here, we will go with the first option.

If you want to encode a whole dataset with One-Hot encoding, you can directly pass it into the discussed methods. But here we want to encode only one feature. We suggest to do it in four steps:

1. Obtain a One-Hot encoding dataframe for the feature `'gender'`: apply `pd.get_dummies` to this feature. As a result, you should obtain a dataframe with three features (three different categories for gender). Don't include `'NaN'` feature - it will already be included as the encoding (0, 0, 0).
2. Change the column names of this dataframe, so that a feature `'<category_name>'` will become `'gender-<category_name>'`. Of course, it is not a necessary step in general. However, it probably would be more convenient to work with a full dataframe if we remember that these One-Hot encoded features originally came from the feature `'gender'`. On a technical side, you can perform it by changing `df_gender.columns` list.
3. Concatenate original and new dataframes. You can do it by calling `pd.concat` function. Don't forget to set `'axis'` parameter, so that you concatenate dataframes by columns, not by rows. As a result of this step, you should obtain a new `train` dataframe with three new columns named like `'gender-<category_name>'`.
4. Finally, drop `'gender'` feature because we have already encoded it.

As a result of these four steps, you should obtain a new version of `train` dataframe, with dropped `'gender'` feature and three new features named like `'gender-<category_name>'`. A total number of columns by this time should be equal to 16.

**q7:** What is the total number of zero values which appear in the columns starting with `'gender-'`?

_Example: suppose that you obtain the following dataframe:_

| gender-category1 | gender-category2   | gender-category3|
|------|------|------|
|   0  | 1| 0|
|0|0|0|

_Then the answer for the question above should be 5._



In [10]:
# your code here
df_gender = pd.get_dummies(train['gender'], dummy_na=False)
df_gender.rename(columns = {'Female': 'gender_category1', 'Male': 'gender_category2', 'Other': 'gender_category3'}, inplace = True)
train = pd.concat([train, df_gender], axis = 1)

In [11]:
train.drop(columns = ['gender'], inplace = True)

In [12]:
print(train['gender_category1'].value_counts())
print(train['gender_category2'].value_counts())
print(train['gender_category3'].value_counts())

0    17920
1     1238
Name: gender_category1, dtype: int64
1    13221
0     5937
Name: gender_category2, dtype: int64
0    18967
1      191
Name: gender_category3, dtype: int64


In [13]:
17920 + 5937  +18967

42824

In [14]:
# len(train.columns.tolist())

In [15]:
from sklearn.preprocessing import OneHotEncoder

# gender_encoder = OneHotEncoder(handle_unknown = 'ignore')
# gender_encoder.fit(train[['gender']])
# df_gender1 = gender_encoder.transform(train[['gender']]).toarray()[:, :3]
# df_gender1 = pd.DataFrame(df_gender1, columns = ['gender-category1', 'gender-category2', 'gender-category3'])
# train = pd.concat([train, df_gender1], axis = 1)

In [16]:
# print(train['gender-category1'].value_counts()[1] + train['gender-category2'].value_counts()[1] + train['gender-category3'].value_counts()[1])
# print(train['gender-category1'].value_counts())
# print(train['gender-category2'].value_counts())
# print(train['gender-category3'].value_counts())

## 8

Perform One-Hot encoding for the feature `'enrolled_university'`, using the similar procedure as in the previous task:

1. Obtain a One-Hot encoding dataframe for `'enrolled_university'`.
2. Rename its columns.
3. Concatenate original and One-Hot encoding dataframes.
4. Drop `'enrolled_university'` column.

A total number of columns by this time should be equal to 18.

**q8:** What is the total number of zero values which appear in the columns starting with `'enrolled_university-'`?

In [17]:
# your code here
university_encoder = OneHotEncoder(handle_unknown = 'ignore')
university_encoder.fit(train[['enrolled_university']])
df_university = university_encoder.transform(train[['enrolled_university']]).toarray()[:, :3].astype(int)
df_university = pd.DataFrame(df_university, columns = ['enrolled_university-category1', 'enrolled_university-category2', 'enrolled_university-category3'])
train = pd.concat([train, df_university], axis = 1)

In [18]:
train.head()

Unnamed: 0,enrollee_id,city,city_development_index,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target,gender_category1,gender_category2,gender_category3,enrolled_university-category1,enrolled_university-category2,enrolled_university-category3
0,28120,city_16,0.91,1,no_enrollment,1,,19,100-500,Pvt Ltd,2,20,0.0,0,0,0,0,0,1
1,31820,city_21,0.624,0,no_enrollment,3,STEM,2,50-99,Early Stage Startup,1,10,1.0,0,0,0,0,0,1
2,4277,city_71,0.884,1,Part time course,2,STEM,>20,,,2,6,1.0,0,1,0,0,1,0
3,3379,city_159,0.843,1,no_enrollment,3,STEM,>20,<10,Pvt Ltd,>4,56,0.0,0,1,0,0,0,1
4,10821,city_16,0.91,1,no_enrollment,3,STEM,6,,,4,15,0.0,0,1,0,0,0,1


In [19]:
train.drop(columns = ['enrolled_university'], inplace = True)

In [20]:
print(train['enrolled_university-category1'].value_counts())
print(train['enrolled_university-category2'].value_counts())
print(train['enrolled_university-category3'].value_counts())

0    15401
1     3757
Name: enrolled_university-category1, dtype: int64
0    17960
1     1198
Name: enrolled_university-category2, dtype: int64
1    13817
0     5341
Name: enrolled_university-category3, dtype: int64


In [21]:
15401 + 17960 + 5341

38702

In [22]:
len(train.columns)

18

## 9

Encode feature `'city'` using frequency encoding. You should map each category `'city_i'` to its count (a total number of observations in `train` with `city == 'city_i'`). Save this mapping, since later you would apply the same mapping to the test data.

As a result of this task, feature `'city'` should be encoded with category counts in `train`. 

**q9:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).


In [23]:
city_data = train['city'].value_counts()

In [24]:
map_city = train['city'].map(train['city'].value_counts(),na_action = None)
train['city'] = map_city

# your code here
# your code here
train['city'].mean()

1709.7964296899468

## 10

Encode feature `'last_new_job'` with target encoding with no modifications. First, impute missing values in this feature with a new category `'-1'`. Then, map each category of `'last_new_job'` to the mean target value of the observations with the corresponding category. Save this mapping, since later you would apply the same mapping to the test data.

**q10:** What will be the maximum value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.35).

_(hint: you might want to use `groupby` function from `pandas` in this task)_

In [25]:
train['last_new_job'].fillna(-1, inplace = True)
count_data = train.groupby('last_new_job').target.mean()
count_data

last_new_job
-1       0.364066
1        0.264303
2        0.241379
3        0.225586
4        0.221574
>4       0.182371
never    0.301387
Name: target, dtype: float64

In [26]:
map_last_new_job = train['last_new_job'].map(count_data)
train['last_new_job'] = map_last_new_job
train['last_new_job'].max()

0.3640661938534279

In [27]:
# your code here
# train['last_new_job'].fillna(-1, inplace = True)
# map_last_new_job = train['last_new_job'].map(count_data)
# train['last_new_job'] = map_last_new_job
# train['last_new_job'].max()
# your code here
# your code here

## 11

Encode feature `'experience'` with M-estimate encoding. Map each category of `'experience'` according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) + m\times y_{\text{mean}}}{\text{count}\left(j, x_{ij}\right) + m}\quad,
$$

where

* $x_{ij}$ is a category being encoded,
* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $m$ is a parameter.

In this task, set $m = 0.5$. 

**q11:** What will be the maximum value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [28]:
y_mean = train['target'].mean()
experience_data = train.groupby('experience').target.agg(['sum', 'count'])
m = 0.5

In [29]:
map_experience = train['experience'].map((experience_data['sum'] + m*y_mean)/(experience_data['count']+m))
# your code here
# your code here
train['experience'] = map_experience
train['experience'].max()

0.4538271268239785

## 12

Encode feature `'major_discipline'` with Leave-One-Out encoding. Remember that this technique is similar to target encoding, but here while computing the encoding for a particular observation, we exclude it from the target encoding formula. Therefore a category $x_{ij}$ for the $i$-th observation will be encoded according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) - y_i}{\text{count}\left(j, x_{ij}\right) - 1}\quad,
$$

where

* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $y_i$ is a target value of the $i$-th observation

For example, suppose that you have the following train data:

|feature|target|
|-|-|
|A|1|
|A|0|
|B|1|
|B|0|
|A|0|

Then you obtain the following Leave-One-Out encoding:

|feature|feature_loo_encoded|
|-|-|
|A|0.0|
|A|0.5|
|B|0.0|
|B|1.0|
|A|0.5|

It is very important to notice that in this method the same category can be encoded differently for different observations. Thus, after encoding the train part of data, you should create a mapping which will help to encode the test data. In order to do this, simply average train encoding values within each category to obtain the final encoding. For instance, suppose that you obtain the following train dataframe:

|feature|feature_loo_encoded|
|-|-|
|A|0.2|
|A|0.6|
|B|0.3|
|B|0.7|
|A|0.4|

Then, for the test data, you should obtain the following mapping:

* A -> 0.4 (because 0.4 is a mean value of the encoded values for the category A)
* B -> 0.5 (because 0.5 is a mean value of the encoded values for the category B)

Don't impute any missing values in this task. After completing this task, drop the feature `'major_discipline'` and rename `'major_discipline_loo_encoded'` to `'major_discipline'`.

**q12:** What will be the maximum value of the encoding values for `'major_discipline'` for the TEST data in the mapping described above? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [30]:
discipline_data = train.groupby('major_discipline').target.agg(['sum', 'count'])

In [31]:
(discipline_data['sum']['STEM']  - 1)/(discipline_data['count']['STEM'] - 1)

0.26154164653923123

In [32]:
train.index

RangeIndex(start=0, stop=19158, step=1)

In [33]:
discipline_data['sum'][train['major_discipline'][1]]

3791.0

In [34]:
# your code here
map_major_discipline = train.index.map(lambda x: (discipline_data['sum'][train['major_discipline'][x]] - train.target[x])/
                                       (discipline_data['count'][train['major_discipline'][x]] - 1)
                                       if pd.isna(train['major_discipline'][x]) == False else np.nan)
# your code here
# your code here
train['major_discipline_loo_encoded'] = map_major_discipline

In [35]:
test_discipline_data = train.groupby('major_discipline')['major_discipline_loo_encoded'].mean()

In [36]:
train.drop(columns = ['major_discipline'], inplace = True)
train.rename(columns = {'major_discipline_loo_encoded': 'major_discipline'}, inplace = True)

In [37]:
test_discipline_data

major_discipline
Arts               0.209486
Business Degree    0.262997
Humanities         0.210762
No Major           0.246637
Other              0.267717
STEM               0.261593
Name: major_discipline_loo_encoded, dtype: float64

In [38]:
map_test_data = test['major_discipline'].map(lambda x: test_discipline_data[x] if pd.isna(x) == False else np.nan)
test['major_discipline'] = map_test_data
test['major_discipline'].max()

0.2677165354330713

## 13

Encode feature `'company_size'` with Catboost encoding. The technique was described in the lecture. Here, for the sake of simplicity, let's use the implementation from `category_encoders` library.

As you may remember, Catboost encoding depends on how the data is ordered. Normally, you should shuffle the data one time or even several times. In this task, we will assume that the data has already been shuffled, so you don't need to shuffle it again.

Take `CatBoostEncoder` and use the default values for all its parameters, except `handle_missing` - set it to `'return_nan'` so that your encoder don't do anything with missing values. Fit it on the `'company_size'` (train data) and `'target'` and transform this feature. Save the encoder - it will be used later to transform this feature in test data.

Don't impute any missing values in this task.

**q13:** What will be the most popular value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [39]:
from category_encoders.cat_boost import CatBoostEncoder
# your code here
company_size_encoder = CatBoostEncoder(handle_missing = 'return_nan')
train['company_size'] = company_size_encoder.fit_transform(train['company_size'], train['target'])

In [40]:
train['company_size'].value_counts()

0.249348    8
0.124674    6
0.224935    5
0.178478    5
0.249928    5
           ..
0.159303    1
0.147630    1
0.160123    1
0.146642    1
0.180865    1
Name: company_size, Length: 12567, dtype: int64

## 14

Encode feature `'company_type'` with Weight of Evidence (WoE) encoding. A category $x_{ij}$ will be encoded according to the following formula:

$$
\hat{x_{ij}} = \ln\left(\frac{\mathbb{P}\left(x_{ij}\mid y=1\right)}{\mathbb{P}\left(x_{ij}\mid y=0\right)}\right)\quad.
$$

Here:

$$
\mathbb{P}\left(x_{ij}\mid y=1\right) = \frac{\text{count}\left(y=1\mid x_{ij}\right)}{\text{count}\left(y=1\right)}
$$
$$
\mathbb{P}\left(x_{ij}\mid y=0\right) = \frac{\text{count}\left(y=0\mid x_{ij}\right)}{\text{count}\left(y=0\right)}
$$

The notation means the following:

* $\text{count}\left(y=1\mid x_{ij}\right)$ denotes the number of observations with the category $x_{ij}$ where the target value is equal to $1$;
* $\text{count}\left(y=0\mid x_{ij}\right)$ denotes the same but for the target value $0$;
* $\text{count}\left(y=1\right)$ denotes the number of observations with the target value equal to $1$;
* $\text{count}\left(y=0\right)$ denotes the same but for the target value $0$.


For example, suppose that you have the following train data:

|feature|target|
|-|-|
|A|1|
|A|0|
|B|1|
|B|0|
|A|0|

Then you obtain the following WoE encoding mapping:

* A -> $\ln\left(\frac{\frac{1}{2}}{\frac{2}{3}}\right) \approx -0.288$ 
* B -> $\ln\left(\frac{\frac{1}{2}}{\frac{1}{3}}\right) \approx 0.405$ 

Don't impute any missing values in this task.

**q14:** What will be the most popular value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [41]:
company_type_data = train.groupby(['company_type', 'target']).enrollee_id.count()
company_type_data

company_type         target
Early Stage Startup  0.0        461
                     1.0        142
Funded Startup       0.0        861
                     1.0        140
NGO                  0.0        424
                     1.0         97
Other                0.0         92
                     1.0         29
Public Sector        0.0        745
                     1.0        210
Pvt Ltd              0.0       8042
                     1.0       1775
Name: enrollee_id, dtype: int64

In [42]:
target_0_count = train.target.value_counts()[0]
target_1_count = train.target.value_counts()[1]

In [43]:
map_company_type = train['company_type'].map(lambda x: np.log((company_type_data.xs([x, 1])*target_0_count)
                                                              /(company_type_data.xs([x,0])*target_1_count))
                                            if pd.isna(x) == False else x)
# your code here
# your code here
map_company_type.value_counts()
train['company_type'] = map_company_type

## 15

We have encoded all categorical features. Next, we drop `'enrollee_id'` because it is not a representative feature. After this, we split train part of data into the dataframe which contains only features (without target) and the target array.

Before training the models, we should impute the remaining missing values. You might have noticed that we didn't impute missing values for the features `'major_discipline'`, `'company_size'` and `'company_type'`. This is because the number of missing values in these features is relatively big (you can check it yourself). In practice, you might just create a special category (like `'-1'`) for each of these features before the encoding. However, in this task, let's perform the imputation using KNN approach - where we impute missing values by looking at the similar observations.

Import `KNNImputer` from `sklearn`. It works only with the dataset with numbers - this is why we didn't run it before the encoding. Set `n_neighbors=3`, and let other parameters have the default values. Fit it on the train dataframe with features, and then transform it. Notice that after the transformation we will obtain `numpy.array` - make `pandas.DataFrame` out of it with the same columns as before.

Save the KNN imputer - you will need it to process the test data. Check that there are no missing values in the train data anymore.

**q15:** What is the mean value of the `'company_size'` feature in the train data after the imputation? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [44]:
X_train = train.drop(['enrollee_id', 'target'], axis=1)
y_train = train['target']

X_train.shape, y_train.shape

((19158, 16), (19158,))

In [45]:
from sklearn.impute import KNNImputer
# your code here
knn_imputer = KNNImputer(n_neighbors = 3)
X_train= pd.DataFrame(knn_imputer.fit_transform(X_train),columns = X_train.columns)

In [46]:
X_train['company_size'].mean()

0.18215611873304263

## 16

Finally, train a Random Forest classifier from `sklearn` on the train data. Set `n_estimators=500`, `max_depth=8`, `random_state=13`, and let other parameters have the default values. Check feature importances. 

**q16:** What is the name of the most important feature for this model? Provide the name of the feature.

In [47]:
from sklearn.ensemble import RandomForestClassifier
# your code here

clf = RandomForestClassifier(n_estimators = 500, max_depth = 8, random_state = 13)
clf.fit(X_train, y_train)
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)

In [48]:
feature_importances

Unnamed: 0,importance
major_discipline,0.762782
city_development_index,0.095917
education_level,0.040324
city,0.028299
experience,0.014585
company_size,0.00968
company_type,0.009142
training_hours,0.007989
relevent_experience,0.007781
enrolled_university-category1,0.007628


## 17

In this last task, process the test data so that it is possible to make Random Forest predictions on for it. Perform the similar operations as for the train data, but remember that now you work with test observations and therefore all operations are in the inference mode.

1. (task 4) Feature `'experience'`: impute missing values by -1. 
2. (task 5) Feature `'education_level'`: perform ordinal encoding mapping.
3. (task 6) Feature `'relevent_experience'`: perform binary mapping.
4. (task 7) Feature `'gender'`: perform One-Hot encoding and obtain three new features starting with `'gender-'`. Drop `'gender'` feature.
5. (task 8) Feature `'enrolled_university'`: perform One-Hot encoding and obtain three new features starting with `'enrolled_university-'`. Drop `'enrolled_university'` feature.
6. (task 9) Feature `'city'`: perform frequency encoding mapping.
7. (task 10) Feature `'last_new_job'`: impute missing values by -1 and perform target encoding mapping.
8. (task 11) Feature `'experience'`: perform M-estimate encoding mapping.
9. (task 12) Feature `'major_discipline'`: perform Leave-One-Out encoding mapping.
10. (task 13) Feature `'company_size'`: perform Catboost encoding mapping.
11. (task 14) Feature `'company_type'`: perform WoE encoding mapping.
12. (task 15) Split `test` into `X_test` (a dataframe with no `'enrollee_id'` and `'target'`) and `y_test` (an array with target values). Impute missing values by using KNN imputer which you used before (now only in a transform mode).

As a result of the operations described above, you should obtain `X_test` which is a `pandas.DataFrame` with a shape (2129, 16). Calculate the predictions of the trained Random Forest model on it. Check the accuracy of the predictions on test data.

Then calculate the predictions of the same model on `X_train` and check the accuracy there. Compare the accuracies on train and test data. Do you notice something? What, in your opinion, caused such difference in the accuracies?

**q17:** As a result of this task, provide the accuracy for the test data, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

In [49]:
test.columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target'],
      dtype='object')

In [50]:
train.columns

Index(['enrollee_id', 'city', 'city_development_index', 'relevent_experience',
       'education_level', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target', 'gender_category1',
       'gender_category2', 'gender_category3', 'enrolled_university-category1',
       'enrolled_university-category2', 'enrolled_university-category3',
       'major_discipline'],
      dtype='object')

In [51]:
#exprience
train['experience'].fillna(-1, inplace = True)
train['experience'].isnull().sum()

0

In [52]:
#education level
test['education_level'] = test['education_level'].map({'Primary School': 0, 'High School': 1, 'Graduate': 2, 'Masters': 3, 'Phd': 4, np.nan: -1})
test['education_level'].isnull().sum()

0

In [53]:
#relevent experience
test['relevent_experience'] = test['relevent_experience'].map({'No relevent experience': 0, 'Has relevent experience': 1})
test['relevent_experience'].isnull().sum()

0

In [54]:
#gender
df_gender_test = pd.get_dummies(test['gender'], dummy_na=False)
df_gender_test.rename(columns = {'Female': 'gender_category1', 'Male': 'gender_category2', 'Other': 'gender_category3'}, inplace = True)
test = pd.concat([test, df_gender_test], axis = 1)

In [55]:
test.drop(columns = ['gender'], inplace = True)

In [56]:
#enrolled university
df_university_test = university_encoder.transform(test[['enrolled_university']]).toarray()[:, :3].astype(int)
df_university_test = pd.DataFrame(df_university_test, columns = ['enrolled_university-category1', 'enrolled_university-category2', 'enrolled_university-category3'])
test = pd.concat([test, df_university_test], axis = 1)

In [57]:
test.drop(columns = ['enrolled_university'], inplace = True)

In [58]:
#city
test['city'] = test['city'].map(city_data)
test['city'].isnull().sum()

0

In [59]:
#last_new_job
test['last_new_job'] = test['last_new_job'].map(count_data)

In [60]:
test['last_new_job']

0       0.264303
1       0.264303
2       0.301387
3       0.264303
4       0.182371
          ...   
2124    0.221574
2125    0.241379
2126    0.301387
2127    0.264303
2128    0.241379
Name: last_new_job, Length: 2129, dtype: float64

In [61]:
#experience
y_mean_test = test['target'].mean()
test['experience'] = test['experience'].map((experience_data['sum'] + m*y_mean_test)/(experience_data['count']+m))

In [62]:
#company size
test['company_size'] = company_size_encoder.transform(test['company_size'])

In [63]:
#company type
test['company_type'] = test['company_type'].map(lambda x: np.log((company_type_data.xs([x, 1])*target_0_count)
                                                              /(company_type_data.xs([x,0])*target_1_count))
                                            if pd.isna(x) == False else x)

In [64]:
X_test = test.drop(['enrollee_id', 'target'], axis=1)
y_test = test['target']

X_test.shape, y_test.shape

((2129, 16), (2129,))

In [65]:
X_test

Unnamed: 0,city,city_development_index,relevent_experience,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,gender_category1,gender_category2,gender_category3,enrolled_university-category1,enrolled_university-category2,enrolled_university-category3
0,89,0.827,1,2,0.261593,0.217374,0.171313,,0.264303,21,0,1,0,1,0,0
1,4355,0.920,1,2,0.261593,0.288106,,-0.408782,0.264303,98,1,0,0,0,0,1
2,2702,0.624,0,1,,0.453847,,-0.408782,0.301387,15,0,1,0,0,0,1
3,48,0.827,1,3,0.261593,0.227442,0.233865,-0.408782,0.264303,39,0,1,0,0,0,1
4,4355,0.920,1,2,0.261593,0.153092,0.190717,-0.408782,0.182371,72,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2124,4355,0.920,0,2,0.210762,0.141859,,-0.164182,0.221574,15,0,1,0,0,0,1
2125,586,0.897,1,3,0.261593,0.153780,,,0.241379,30,0,1,0,0,0,1
2126,275,0.887,0,0,,0.352998,,-0.408782,0.301387,18,0,1,0,0,0,1
2127,304,0.804,1,1,,0.294735,0.161450,-0.164182,0.264303,84,0,1,0,1,0,0


In [66]:
X_train.columns

Index(['city', 'city_development_index', 'relevent_experience',
       'education_level', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'gender_category1',
       'gender_category2', 'gender_category3', 'enrolled_university-category1',
       'enrolled_university-category2', 'enrolled_university-category3',
       'major_discipline'],
      dtype='object')

In [67]:
X_test = X_test.reindex(columns = X_train.columns)

In [70]:
X_test.columns

Index(['city', 'city_development_index', 'relevent_experience',
       'education_level', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'gender_category1',
       'gender_category2', 'gender_category3', 'enrolled_university-category1',
       'enrolled_university-category2', 'enrolled_university-category3',
       'major_discipline'],
      dtype='object')

In [68]:
#impute missing values
X_test= pd.DataFrame(knn_imputer.fit_transform(X_test),columns = X_test.columns)

In [71]:
X_train.columns

Index(['city', 'city_development_index', 'relevent_experience',
       'education_level', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'gender_category1',
       'gender_category2', 'gender_category3', 'enrolled_university-category1',
       'enrolled_university-category2', 'enrolled_university-category3',
       'major_discipline'],
      dtype='object')

In [73]:
from sklearn.metrics import accuracy_score
# your code here

y_train_predict = clf.predict(X_train)
train_score = accuracy_score(y_train, y_train_predict)
y_test_predict = clf.predict(X_test)
test_score = accuracy_score(y_test, y_test_predict)

print(train_score)
print(test_score)

0.9754671677628145
0.7299201503053077
