# 1) What Is Feature Engineering ?

__transformation__ of raw data into features suitable for modeling and improve the accuracy of the algorithm.


# 2) Why Is Feature Engineering?
In practice, data rarely comes in the form of ready-to-use matrices. That's why every task begins with feature engineering.



In [None]:
import pandas as pd
import numpy as np
from functools import reduce 
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# for regression problems
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# for classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# to split and standarize the datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# to evaluate regression models
from sklearn.metrics import mean_squared_error

# to evaluate classification models
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings('ignore')

## Text
Text is a type of data that can come in different formats; there are so many text processing methods that cannot fit in a single article. Nevertheless, we will review the most popular ones.

<img src=https://habrastorage.org/webt/r7/sq/my/r7sqmyj1nmqmzltaftt40zi7-gw.png width=50%>

In [None]:
texts = [['i', 'have', 'a', 'cat'], 
        ['he', 'have', 'a', 'dog'], 
        ['he', 'and', 'i', 'have', 'a', 'cat', 'and', 'a', 'dog']]

dictionary = list(enumerate(set(list(reduce(lambda x, y: x + y, texts)))))

print (dictionary)

def vectorize(text): 
    vector = np.zeros(len(dictionary)) 
    for i, word in dictionary: 
        num = 0 
        for w in text: 
            if w == word: 
                num += 1 
        if num: 
            vector[i] = num 
    return vector

for t in texts: 
    print(vectorize(t))

## Tabular
Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the things described by each row.

In [None]:
print(os.listdir("../input"))

In [None]:
data = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
data.head()

# 3) How to Engineer Features

#### 1. Imputation
The act of replacing missing data with statistical estimates of the missing values. The goal of any imputation technique is to produce a **complete dataset** that can then be then used for machine learning.
#### 2. Encoding categorical variables 
Transform the strings of categorical variables into numbers, so that we can feed these variables in machine learning algorithms.
#### Normalisation; Engineering mixed variables, rare values; Remove outliers .... 

# 4) Missing values
Missing data occur when __no data__ / __no value__ is stored for a certain observation within a variable. 

## Why is data missing?
There are 3 mechanisms that lead to missing data, 2 of them involve missing data randomly or almost-randomly, and the third one involves a systematic loss of data.

### Missing Completely at Random, MCAR:

A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. 
When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.


### Missing at Random, MAR: 

MAR occurs when there is a systematic relationship between the propensity of missing values and the observed data. In other words, the probability an observation being missing depends only on available information (other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women that decided not to disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.

### Missing Not at Random, MNAR: 

Missing of values is not at random (MNAR) if their being missing depends on information not recorded in the dataset. In other words, there is a mechanism or a reason why missing values are introduced in the dataset.

Examples:

When a financial company asks for bank and identity documents from customers in order to prevent identity fraud, typically, fraudsters impersonating someone else will not upload documents, because they don't have them, precisely because they are fraudsters. Therefore, there is a systematic relationship between the missing documents and the target we want to predict: fraud.

## Real Life example: 

### Predicting Survival on the Titanic: understanding society behaviour and beliefs

Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

In [None]:
data = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
data.head()

In [None]:
# you can determine the total number of missing values using
# the isnull method plus the sum method on the dataframe
data.isnull().sum()

In [None]:
# alternatively, you can call the mean method after isnull
# to visualise the percentage of the dataset that 
# contains missing values for each variable

data.isnull().mean()

We can see that there are missing data in the variables Age, Cabin (in which the passenger was travelling) and Embarked, which is the port from which the passenger got into the Titanic.

### Missing data Not At Random (MNAR): Systematic missing values
In this dataset, both the missing values of the variables Cabin and Age, were introduced systematically. For many of the people who did not survive, the age they had or the cabin they were staying in, could not be established. The people who survived could be asked for that information.

Can we infer this by looking at the data?

In a situation like this, we could expect a greater number of missing values for people who did not survive.

In [None]:
# we create a dummy variable that indicates whether the value
# of the variable cabin is missing

data['AMT_REQ_CREDIT_BUREAU_WEEK_null'] = np.where(data.AMT_REQ_CREDIT_BUREAU_WEEK.isnull(), 1, 0)

# find percentage of null values
data.AMT_REQ_CREDIT_BUREAU_WEEK.mean()

In [None]:
# and then we evaluate the mean of the missing values in
# cabin for the people who survived vs the non-survivors.

# group data by Survived vs Non-Survived
# and find nulls for cabin
data.groupby(['TARGET'])['AMT_REQ_CREDIT_BUREAU_WEEK_null'].mean()

We observe that the percentage of missing values is higher for people who did not survive (0.87), respect to people that survived (0.60).
This finding is aligned with our hypothesis that the data is missing because after the people died, the information could not be retrieved.

Having said this, to truly underpin whether the data is missing not at random, we would need to get extremely familiar with the way data was collected. Analysing datasets, can only point us in the right direction or help us build assumptions.

In [None]:
# we repeat the exercise for the variable age:
# First we create a dummy variable that indicates
# whether the value of the variable Age is missing

data['AMT_REQ_CREDIT_BUREAU_WEEK_null'] = np.where(data.AMT_REQ_CREDIT_BUREAU_WEEK.isnull(), 1, 0)

# and then look at the mean in the different survival groups:
# there are more NaN for the people who did not survive
data.groupby(['TARGET'])['AMT_REQ_CREDIT_BUREAU_WEEK_null'].mean()

Again, we observe an increase in missing data for the people who did not survive the tragedy. The analysis therefore suggests: 

**There is a systematic loss of data: people who did not survive tend to have more information missing. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.**

### Missing data Completely At Random (MCAR)

In the titanic dataset, there were also missing values for the variable Embarked, let's have a look.

In [None]:
# slice the dataframe to show only those observations
# with missing value for Embarked

data[data.OWN_CAR_AGE.isnull()]

These 2 women were travelling together, Miss Icard was the maid of Mrs Stone.

A priori, there does not seem to be an indication that the missing information in the variable Embarked is depending on any other variable, and the fact that these women survived, means that they could have been asked for this information.

Very likely this missingness was generated at the time of building the dataset and therefore we could assume that it is completely random. We can assume that the probability of data being missing for these 2 women is the same as the probability for this variable to be missing for any other person. Of course this will be hard, if possible at all, to prove.

# 5) Imputation Methods

## Mean and median imputation
Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution).

### Assumptions

Mean/median imputation has the assumption that the data are missing completely at random (MCAR). If this is the case, we can think of replacing the NA with the  most frequent occurrence of the variable, which is the mean if the variable has a Gaussian distribution, or the median otherwise.

The rationale is to replace the population of missing values with the most frequent value, since this is the most likely occurrence.

### Advantages

- Easy to implement
- Fast way of obtaining complete datasets

### Limitations

- Distortion of original variance
- Distortion of covariance with remaining variables within the dataset

When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations (since the imputed values do not differ from the mean or from each other). Therefore leading to underestimation of the variance.

In addition, estimates of covariance and correlations with other variables in the dataset may also be affected.  This is because we may be destroying intrinsic correlations since the mean/median that now replace NA will not preserve the relation with the remaining variables.

### Final note
Replacement of NA with mean/median is widely used in the data science community and in various data science competitions. If the data was missing completely at random, this would be contemplated by the mean imputation, and if it wasn't this would be captured by the additional variable.

In addition, both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions.

In [None]:
# let's look at the percentage of NA

data.isnull().mean()

### Imputation important

Imputation should be done over the training set, and then propagated to the test set. This means that the mean/median to be used to fill missing values both in train and test set, should be extracted from the train set only. And this is to avoid overfitting.

In [None]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data, data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
X_train.AMT_REQ_CREDIT_BUREAU_YEAR.median()

In [None]:
# let's make a function to create 2 variables from Age:
# one filling NA with median, and another one filling NA with zeroes

def impute_na(df, variable, median):
    df[variable+'_median'] = df[variable].fillna(median)
    df[variable+'_zero'] = df[variable].fillna(0) 

In [None]:
impute_na(X_train, 'AMT_REQ_CREDIT_BUREAU_YEAR', X_train.AMT_REQ_CREDIT_BUREAU_YEAR.median())
X_train.head(15)

In [None]:
impute_na(X_test, 'AMT_REQ_CREDIT_BUREAU_YEAR', X_train.AMT_REQ_CREDIT_BUREAU_YEAR.median())

#### Mean/median imputation alters the variance of the original distribution of the variable


In [None]:
# we can see that the distribution has changed slightly with now more values accumulating towards the median
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['AMT_REQ_CREDIT_BUREAU_YEAR'].plot(kind='kde', ax=ax)
X_train.AMT_REQ_CREDIT_BUREAU_YEAR_median.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

As mentioned above, the median imputation distorts the original distribution of the variable Age. The transformed variable shows more values around the median value.

In [None]:
# filling NA with zeroes creates a peak of population around 0, as expected
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['AMT_REQ_CREDIT_BUREAU_YEAR'].plot(kind='kde', ax=ax)
X_train.AMT_REQ_CREDIT_BUREAU_YEAR_zero.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')


Filling NA with 0s also distorts the distribution of the original variable, generating an accumulation of values around 0. We will see in the next lecture a method of NA imputation that preserves variable distribution.

### Machine learning model performance on different imputation methods

#### Logistic Regression

In [None]:
# Let's compare the performance of Logistic Regression using Age filled with zeros or alternatively the median

# model on NA imputed with zeroes
logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_zero']], y_train)
print('Train set zero imputation')
pred = logit.predict_proba(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_zero']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set zero imputation')
pred = logit.predict_proba(X_test[['AMT_REQ_CREDIT_BUREAU_YEAR_zero']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))
print()

# model on NA imputed with median
logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_median']], y_train)
print('Train set median imputation')
pred = logit.predict_proba(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_median']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set median imputation')
pred = logit.predict_proba(X_test[['AMT_REQ_CREDIT_BUREAU_YEAR_median']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

We see that median imputation leads to better performance of the logistic regression. Why?

Children were more likely to survive the catastrophe (0.57 for children vs 0.38 for the entire Titanic). Thus, smaller values of Age are a good indicator of survival.

When we replace NA with zeroes, we are masking the predictive power of Age. After zero imputation it looks like children did not have a greater chance of survival, and therefore the model loses predictive power.

On the other hand, replacing NA with the median, preserves the predictive power of the variable Age, as smaller Age values will favour survival.

## Random sample imputation

Random sampling imputation is in principle similar to mean/median imputation, in the sense that it aims to preserve the statistical parameters of the original variable, for which data is missing.

Random sampling consist of taking a random observation from the pool of available observations of the variable, and using that randomly extracted value to fill the NA. In Random Sampling one takes as many random observations as missing values are present in the variable.

By random sampling observations of the variable for those instances where data is available, we guarantee that the mean and standard deviation of the variable are preserved.


### Assumptions

Random sample imputation assumes that the data are missing completely at random (MCAR). If this is the case, it makes sense to substitute the missing values, by values extracted from the original variable distribution. 

From a probabilistic  point of view, values that are more frequent (like the mean or the median) will be selected more often (because there are more of them to select from), but other less frequent values will be selected as well. Thus, the variance of the variable is preserved. 

The rationale is to replace the population of missing values with a population of values with the same distribution of the variable.


### Advantages

- Easy to implement
- Fast way of obtaining complete datasets
- Preserves the variance of the variable

### Limitations

- Randomness

Randomness may not seem much of a concern when replacing missing values for data competitions, where the whole batch of missing values is replaced once and then the dataset is scored and that is the end of the problem. However, in business scenarios the situation is very different. 

Imagine for example the scenario of Mercedes-Benz, where they are trying to predict how long a certain car will be in the garage before it passes all the security tests. Today, they receive a car with missing data in some of the variables, they run the machine learning model to predict how long this car will stay in the garage, the model replaces missing values by a random sample of the variable and then produces an estimate of time. Tomorrow, when they run the same model on the same car, the model will randomly assign values to the missing data, that may or may not be the same as the ones it selected today, therefore, the final estimation of time in the garage, may or may not be the same as the one obtained the day before.

In addition, imagine also that Mercedes-Benz evaluates 2 different cars that have exactly the same values for all of the variables, and missing values in exactly the same subset of variables. They run the machine learning model for each car, and because the missing data is randomly filled with values, the 2 cars, that are exactly the same, may end up with different estimates of time in the garage. 

This may sound completely trivial and unimportant, however, businesses must follow a variety of regulations, and some of them require that the same treatment be provided to the same situation. So if instead of cars, these were people applying for a loan, or people seeking some disease treatment, the machine learning model would end up providing different solutions to candidates that are otherwise in the same conditions. And this is not fair or acceptable.

It is still possible to replace missing data by random sample, but these randomness needs to be controlled, so that individuals in the same situation end up with the same scores and therefore solutions.

Finally, another potential limitation of random sampling, similarly to replacing with the mean and median, is that estimates of covariance and correlations with other variables in the dataset may also be washed off by the randomness.

### Final note

Replacement of missing values by random sample, although similar in concept to replacement by the median or mean, is not as widely used in the data science community as the mean/median imputation, presumably because of the element of randomness.

However, it is a valid approach, with advantages over mean/median imputation as it preserves the distribution of the variable. And if you are mindful of the element of randomness and account for it somehow, this may as well be your method of choice.

In [None]:
# load the Titanic Dataset with a few variables for demonstration

data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare', 'Survived'])
data.head()

In [None]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data, data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
def impute_na(df, variable, median):
    df[variable+'_median'] = df[variable].fillna(median)
    df[variable+'_zero'] = df[variable].fillna(0)
    
    # random sampling
    df[variable+'_random'] = df[variable]
    # extract the random sample to fill the na
    random_sample = X_train[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    # pandas needs to have the same index in order to merge datasets
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample

In [None]:
impute_na(X_train, 'AMT_REQ_CREDIT_BUREAU_YEAR', X_train.AMT_REQ_CREDIT_BUREAU_YEAR.median())
X_train.head(20)

In [None]:
impute_na(X_test, 'AMT_REQ_CREDIT_BUREAU_YEAR', X_train.AMT_REQ_CREDIT_BUREAU_YEAR.median())

#### Random sampling preserves the original distribution of the variable

In [None]:
# we can see that the distribution of the variable after filling NA is exactly the same as that one before filling NA
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['AMT_REQ_CREDIT_BUREAU_YEAR'].plot(kind='kde', ax=ax)
X_train.AMT_REQ_CREDIT_BUREAU_YEAR_random.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

In [None]:
# let's compare the performance of logistic regression on Age NA imputed by zeroes, or median or random sampling

logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_zero']], y_train)
print('Train set zero imputation')
pred = logit.predict_proba(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_zero']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set zero imputation')
pred = logit.predict_proba(X_test[['AMT_REQ_CREDIT_BUREAU_YEAR_zero']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))
print()
logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_median']], y_train)
print('Train set median imputation')
pred = logit.predict_proba(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_median']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set median imputation')
pred = logit.predict_proba(X_test[['AMT_REQ_CREDIT_BUREAU_YEAR_median']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))
print()
logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_random']], y_train)
print('Train set random sample imputation')
pred = logit.predict_proba(X_train[['AMT_REQ_CREDIT_BUREAU_YEAR_random']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set random sample imputation')
pred = logit.predict_proba(X_test[['AMT_REQ_CREDIT_BUREAU_YEAR_random']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

We can see that replacing the NA with a random sample of the dataset, does not perform as well as when replacing with the median. However, this is entirely due to randomness. I invite you to change the seed (random_sate) in the impute_na function, then recreate the X_train and X_test, and you will see how the performance of logistic regression varies. In some cases, the performance will be better.

So if the performance of median imputation vs random sample imputation are similar, which method should I use?

Choosing which imputation method to use, will depend on various things:
- are NA missing completely at random?
- do you want to preserve the distribution of the variable?
- are you willing to accept an element of randomness in your imputation method?
- are you aiming to win a data competition? or to make business driven decisions?

There is no 'correct' answer to which imputation method you can use, it rather depends on what you are trying to achieve.

## Adding a variable to capture NA

In previous lectures we studied how to replace missing values by mean/median imputation or by extracting a random sample of the variable for those instances where data is available, and using those values to replace the missing values. We also discussed that these 2 methods assume that the missing data are missing completely at random (MCAR).

So what if the data are not missing completely at random? By using this procedure, we would be missing important, predictive information.

How can we prevent that?

We can capture the importance of missingness by creating an additional variable indicating whether the data was missing for that observation (1) or not (0). The additional variable is a binary variable: it takes only the values 0 and 1, 0 indicating that a value was present for that observation, and 1 indicating that the value was missing for that observation.


### Advantages

- Easy to implement
- Captures the importance of missingess if there is one

### Disadvantages

- Expands the feature space

This method of imputation will add 1 variable per variable in the dataset with missing values. So if a dataset contains 10 features, and all of them have missing values, we will end up with a dataset with 20 features. The original features where we replaced the missing values by the mean/median (or random sampling), and additional 10 features, indicating for each of the variables, whether the value was missing or not.

This may not be a problem in datasets with tens to a few hundreds of variables, but if your original dataset contains thousands of variables, by creating an additional variable to indicate NA, you will end up with very big datasets.

In addition, data tends to be missing for the same observation on multiple variables, so it may also be the case, that many of your added variables will be actually similar to each other.


### Final note

Typically, mean/median imputation is done together with adding a variable to capture those observations where the data was missing (see lecture "Replacing NA with the median/mean"), thus covering 2 angles: if the data was missing completely at random, this would be contemplated by the mean imputation, and if it wasn't this would be captured by the additional variable.


In [None]:
# load the Titanic Dataset with a few variables for demonstration

data = pd.read_csv('../input/home-credit-default-risk/application_train.csv', usecols = ['OWN_CAR_AGE','TARGET'])
data.head()

In [None]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data, data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
# create variable indicating missingness

X_train['OWN_CAR_AGE_NA'] = np.where(X_train['OWN_CAR_AGE'].isnull(), 1, 0)
X_test['OWN_CAR_AGE_NA'] = np.where(X_test['OWN_CAR_AGE'].isnull(), 1, 0)

X_train.head()

In [None]:
# let's replace the NA with the median value in the training set
X_train['OWN_CAR_AGE'].fillna(X_train.OWN_CAR_AGE.median(), inplace=True)
X_test['OWN_CAR_AGE'].fillna(X_train.OWN_CAR_AGE.median(), inplace=True)

X_train.head(20)

In [None]:
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))

X_train.columns = ['TARGET','OWN_CAR_AGE','OWN_CAR_AGE_NA']
X_test.columns = ['TARGET','OWN_CAR_AGE','OWN_CAR_AGE_NA']

In [None]:
# we compare the models built using Age filled with median, vs Age filled with median + additional
# variable indicating missingness

logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['OWN_CAR_AGE']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['OWN_CAR_AGE']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['OWN_CAR_AGE']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization
logit.fit(X_train[['OWN_CAR_AGE','OWN_CAR_AGE_NA']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['OWN_CAR_AGE','OWN_CAR_AGE_NA']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['OWN_CAR_AGE','OWN_CAR_AGE_NA']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

## End of the distribution imputation

On occasions, one has reasons to suspect that missing values are not missing at random. And if the value is missing, there has to be a reason for it. Therefore, we would like to capture this information.

Adding an additional variable indicating missingness may help with this task (as we discussed in the previous lecture). However, the values are still missing in the original variable, and they need to be replaced if we plan to use the variable in machine learning.

Sometimes, we may also not want to increase the feature space by adding a variable to capture missingness.

So what can we do instead?

We can replace the NA, by values that are at the far end of the distribution of the variable.

The rationale is that if the value is missing, it has to be for a reason, therefore, we would not like to replace missing values for the mean and make that observation look like the majority of our observations. Instead, we want to flag that observation as different, and therefore we assign a value that is at the tail of the distribution, where observations are rarely represented in the population.

### Advantages

- Easy to implement
- Captures the importance of missingess if there is one

### Disadvantages

- Distorts the original distribution of the variable
- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
- If the number of NA is big, it will mask true outliers in the distribution
- If the number of NA  is small, the replaced NA may be considered an outlier and pre-processed in a subsequent step of feature engineering


### Final note

This method is used in finance companies. When capturing the financial history of customers, if some of the variables are missing, the company does not like to assume that missingness is random. Therefore, a different treatment is provided to replace them, by placing them at the end of the distribution.

In [None]:
# load the Titanic Dataset with a few variables for demonstration

data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare','Survived'])
data.head()

In [None]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data, data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
X_train.OWN_CAR_AGE.hist(bins=50)

In [None]:
# far end of the distribution
X_train.OWN_CAR_AGE.mean()+3*X_train.OWN_CAR_AGE.std()

In [None]:
# we see that there are a few outliers for Age, according to its distribution
# these outliers will be masked when we replace NA by values at the far end 
# see below

sns.boxplot('OWN_CAR_AGE', data=data)

In [None]:
def impute_na(df, variable, median, extreme):
    df[variable+'_far_end'] = df[variable].fillna(extreme)
    df[variable].fillna(median, inplace=True)

In [None]:
# let's replace the NA with the median value in the training and testing sets
impute_na(X_train, 'OWN_CAR_AGE', X_train.OWN_CAR_AGE.median(), X_train.OWN_CAR_AGE.mean()+3*X_train.OWN_CAR_AGE.std())
impute_na(X_test, 'OWN_CAR_AGE', X_train.OWN_CAR_AGE.median(), X_train.OWN_CAR_AGE.mean()+3*X_train.OWN_CAR_AGE.std())

X_train.head(20)

In [None]:
# we see an accumulation of values around the median for the median imputation
X_train.OWN_CAR_AGE.hist(bins=50)

In [None]:
# we see an accumulation of values at the far end for the far end imputation

X_train.OWN_CAR_AGE_far_end.hist(bins=50)

In [None]:
# indeed, far end imputation now indicates that there are no outliers in the variable
sns.boxplot('OWN_CAR_AGE_far_end', data=X_train)

In [None]:
# on the other hand, replacing values by the median, now generates the impression of a higher
# amount of outliers

sns.boxplot('OWN_CAR_AGE', data=X_train)

In [None]:
# we compare the models built using Age filled with median, vs Age filled with values at the far end of the distribution
# variable indicating missingness

logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['OWN_CAR_AGE']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['OWN_CAR_AGE']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['OWN_CAR_AGE']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['OWN_CAR_AGE_far_end']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['OWN_CAR_AGE_far_end']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['OWN_CAR_AGE_far_end']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

## Arbitrary value imputation

Replacing the NA by artitrary values should be used when there are reasons to believe that the NA are not missing at random. In situations like this, we would not like to replace with the median or the mean, and therefore make the NA look like the majority of our observations.

Instead, we want to flag them. We want to capture the missingness somehow.

In previous lectures we saw 2 methods to do this:

1) adding an additional binary variable to indicate whether the value is missing (1) or not (0)

2) replacing the NA by a value at a far end of the distribution

Here, I suggest an alternative to option 2, which I have seen in several Kaggle competitions. It consists of replacing the NA by an arbitrary value. Any of your creation, but ideally different from the median/mean/mode, and not within the normal values of the variable.

The problem consists in deciding which arbitrary value to choose.

### Advantages

- Easy to implement
- Captures the importance of missingess if there is one

### Disadvantages

- Distorts the original distribution of the variable
- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
- Hard to decide which value to use
 If the value is outside the distribution it may mask or create outliers

### Final note

When variables are captured by third parties, like credit agencies, they place arbitrary numbers already to signal this missingness. So if not common practice in data competitions, it is common practice in real life data collections.

In [None]:
# load the Titanic Dataset with a few variables for demonstration

data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare','Survived'])
data.head()

In [None]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data, data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
def impute_na(df, variable):
    df[variable+'_zero'] = df[variable].fillna(0)
    df[variable+'_hundred']= df[variable].fillna(100)

In [None]:
# let's replace the NA with the median value in the training set
impute_na(X_train, 'OWN_CAR_AGE')
impute_na(X_test, 'OWN_CAR_AGE')

X_train.head(20)

In [None]:
# we compare the models built using Age filled with zero, vs Age filled with 100

logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['OWN_CAR_AGE_zero']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['OWN_CAR_AGE_zero']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['OWN_CAR_AGE_zero']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

logit = RandomForestClassifier(random_state=44) # c big to avoid regularization
logit.fit(X_train[['OWN_CAR_AGE_hundred']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['OWN_CAR_AGE_hundred']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['OWN_CAR_AGE_hundred']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

We can see that replacing NA with 100 makes the models perform better than replacing NA with 0. This is, if you remember from the lecture "Replacing NA by mean or median" because children were more likely to survive than adults. Then filling NA with zeroes, distorts this relation and makes the models loose predictive power. See below for a re-cap.

Final notes
The arbitrary value has to be determined for each variable specifically. For example, for this dataset, the choice of replacing NA in age by 0 or 100 are valid, because none of those values are frequent in the original distribution of the variable, and they lie at the tails of the distribution.

However, if we were to replace NA in fare, those values are not good any more, because we can see that fare can take values of up to 500. So we might want to consider using 500 or 1000 to replace NA instead of 100.

As you can see this is totally arbitrary. And yet, it is used in the industry.

Typical values chose by companies are -9999 or 9999, or similar.

# 6) Engineer labels of categorical variables

In this section, I will describe a variety of methods to transform the strings of categorical variables into numbers, so that we can feed these variables in machine learning algorithms.

## One Hot Encoding

One hot encoding, consists of replacing the categorical variable by different boolean variables, which take value 0 or 1, to indicate whether or not a certain category / label of the variable was present for that observation.

Each one of the boolean variables are also known as **dummy variables** or binary variables.

For example, from the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is "male" and 0 otherwise. 

### Advantages

- Straightforward to implement
- Makes no assumption
- Keeps all the information of the categorical variable

### Disadvantages

- Does not add any information that may make the variable more predictive
- If the variable has loads of categories, then OHE increases the feature space dramatically

### Notes

If our datasets have a few multi-label variables, we will end up very soon with datasets with thousands of columns or more. And this may make training of our algorithms slow.

In addition, many of these dummy variables may be similar to each other, since it is not unusual for 2 or more variables to share the same combinations of 1 and 0s.

In [None]:
import pandas as pd

data = pd.read_csv('../input/home-credit-default-risk/application_train.csv', usecols=['CODE_GENDER'])
data.head()

In [None]:
# one hot encoding

pd.get_dummies(data).head()

In [None]:
# for better visualisation
pd.concat([data, pd.get_dummies(data)], axis=1).head()

In [None]:
data = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
data.head()

In [None]:
# let's make a copy of the dataset, in which we encode the categorical variables using OHE

data_OHE = pd.concat([data[['TARGET', 'OWN_CAR_AGE', 'CNT_CHILDREN']], # numerical variables 
                      pd.get_dummies(data.CODE_GENDER),   # binary categorical variable
                      pd.get_dummies(data.FLAG_OWN_REALTY)],  # k categories in categorical
                    axis=1)

data_OHE.head()

In [None]:
# and now let's separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(data_OHE[['OWN_CAR_AGE', 'CNT_CHILDREN', 'F', 'M', 'XNA', 'N', 'Y']].fillna(0),
                                                    data_OHE.TARGET,
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
def impute_na(df, variable, extreme):
    df[variable].fillna(extreme, inplace=True)
    
impute_na(X_train, 'OWN_CAR_AGE', X_train.OWN_CAR_AGE.mean()+3*X_train.OWN_CAR_AGE.std())
impute_na(X_test, 'OWN_CAR_AGE', X_train.OWN_CAR_AGE.mean()+3*X_train.OWN_CAR_AGE.std())

In [None]:
# and finally a logistic regression

logit = RandomForestClassifier(random_state=44)
logit.fit(X_train, y_train)
print('Train set')
pred = logit.predict_proba(X_train)
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test)
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

## Count or frequency encoding

Another way to refer to variables that have a multitude of categories, is to call them variables with **high cardinality**.

We observed in the previous lecture, that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

There is not any rationale behind this transformation, other than its simplicity.

### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

-  If 2 labels appear the same amount of times in the dataset, that is, contain the same number of observations, they will be merged: may loose valuable information
- Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power

In [None]:
data = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
data.head()

### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count/total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data[['OWN_CAR_AGE', 'CNT_CHILDREN']].fillna(0),
                                                    data.TARGET,
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
# And now let's replace each label in X2 by its count

# first we make a dictionary that maps each label to the counts
X_frequency_map = X_train.OWN_CAR_AGE.value_counts().to_dict()

# and now we replace X2 labels both in train and test set with the same map
X_train.Sex = X_train.OWN_CAR_AGE.map(X_frequency_map)
X_test.Sex = X_test.OWN_CAR_AGE.map(X_frequency_map)

X_train.head()

In [None]:
# And now let's replace each label in X2 by its count

# first we make a dictionary that maps each label to the counts
X_frequency_map = X_train.CNT_CHILDREN.value_counts().to_dict()

# and now we replace X2 labels both in train and test set with the same map
X_train.Embarked = X_train.CNT_CHILDREN.map(X_frequency_map)
X_test.Embarked = X_test.CNT_CHILDREN.map(X_frequency_map)

X_train.head()

In [None]:
# and finally a logistic regression

logit = RandomForestClassifier(random_state=44)
logit.fit(X_train, y_train)
print('Train set')
pred = logit.predict_proba(X_train)
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test)
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

### Note

I want you to keep in mind something important:

If a category is present in the test set, that was not present in the train set, this method will generate missing data in the test set. This is why it is extremely important to handle rare categories.

Then we can combine rare label replacement plus categorical encoding with counts like this: we may choose to replace the 10 most frequent labels by their count, and then group all the other labels under one label (for example "Rare"), and replace "Rare" by its count.

## Target guided ordinal encoding

In the previous lectures in this section on how to engineer the labels of categorical variables, we learnt how to convert a label into a number, by using one hot encoding or replacing by frequency or counts. These methods are simple, make no assumptions and work generally well in different scenarios.

There are however methods that allow us to capture information while pre-processing the labels of categorical variables. These methods include:

- Ordering the labels according to the target
- Replacing labels by the risk (of the target)
- Replacing the labels by the joint probability of the target being 1 or 0
- Weight of evidence.

### Advantages

- Capture information within the label, therefore rendering more predictive features
- Create a monotonic relationship between the variable and the target
- Do not expand the feature space

### Disadvantage

- Prone to cause over-fitting


### Ordering  labels according to the target

Ordering the labels according to the target means assigning a number to the label, but this numbering, this ordering, is informed by the mean of the target within the label.

Briefly, we calculate the mean of the target for each label/category, then we order the labels according to these mean from smallest to biggest, and we number them accordingly.

See the example below:

In [None]:
# let's load again the titanic dataset

data = pd.read_csv('../input/home-credit-default-risk/application_train.csv', usecols=['OWN_CAR_AGE', 'TARGET'])
data.head()

In [None]:
# let's first fill NA values with an additional label

data.OWN_CAR_AGE.fillna('OWN_CAR_AGE', inplace=True)
data['OWN_CAR_AGE'] = data['OWN_CAR_AGE'].astype(str).str[0]
data.OWN_CAR_AGE.unique()

In [None]:
data.head()

In [None]:
# Let's separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[['OWN_CAR_AGE', 'TARGET']], data.TARGET, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

In [None]:
# now we order the labels according to the mean target value

X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean().sort_values()

In [None]:
# and now we create a dictionary that maps each label to the number
ordered_labels = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean().sort_values().index
ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
ordinal_label

This method assigned the number 0 to T, the category with the lowest target mean, and 8 to B, the category with the highest target mean.

In [None]:
# replace the labels with the ordered numbers
# both in train and test set (note that we created the dictionary only using the training set)

X_train['OWN_CAR_AGE_ordered'] = X_train.OWN_CAR_AGE.map(ordinal_label)
X_test['OWN_CAR_AGE_ordered'] = X_test.OWN_CAR_AGE.map(ordinal_label)

In [None]:
# check the results

X_train.head()

In [None]:
# let's inspect the newly created monotonic relationship with the target

#first we plot the original variable for comparison, there is no monotonic relationship

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean().plot()
fig.set_title('Normal relationship between variable and target')
fig.set_ylabel('TARGET')

In [None]:
# plot the transformed result: the monotonic variable

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE_ordered'])['TARGET'].mean().plot()
fig.set_title('Monotonic relationship between variable and target')
fig.set_ylabel('Survived')

In [None]:
# let's load again the titanic dataset

data = pd.read_csv('../input/home-credit-default-risk/application_train.csv', usecols=['OWN_CAR_AGE', 'TARGET'])
data.head()

In [None]:
# let's first fill NA values with an additional label

data.OWN_CAR_AGE.fillna('Missing', inplace=True)
data['OWN_CAR_AGE'] = data['OWN_CAR_AGE'].astype(str).str[0]

In [None]:
# Let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data[['OWN_CAR_AGE', 'TARGET']], data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
# let's calculate the target frequency for each label

X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean()
ordered_labels = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean().to_dict()
X_train['OWN_CAR_AGE_ordered'] = X_train.OWN_CAR_AGE.map(ordered_labels)
X_test['OWN_CAR_AGE_ordered'] = X_test.OWN_CAR_AGE.map(ordered_labels)
X_train.head()

In [None]:
# plot the original variable

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean().plot()
fig.set_title('Normal relationship between variable and target')
fig.set_ylabel('Survived')

In [None]:
# plot the transformed result: the monotonic variable

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE_ordered'])['TARGET'].mean().plot()
fig.set_title('Monotonic relationship between variable and target')
fig.set_ylabel('Survived')

## Probability ratio encoding


In [None]:
# let's load again the titanic dataset

data = pd.read_csv('../input/home-credit-default-risk/application_train.csv', usecols=['OWN_CAR_AGE', 'TARGET'])
data.head()

In [None]:
# let's first fill NA values with an additional label

data.OWN_CAR_AGE.fillna('Missing', inplace=True)
data['OWN_CAR_AGE'] = data['OWN_CAR_AGE'].astype(str).str[0]

In [None]:
# Let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data[['OWN_CAR_AGE', 'TARGET']],
                                                    data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
# now let's  calculate the probability of target = 0 (people who did not survive)
prob_df = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean()
prob_df = pd.DataFrame(prob_df)
prob_df['Difficult'] = 1-prob_df.TARGET
prob_df['ratio'] = prob_df.TARGET/prob_df.Difficult
ordered_labels = prob_df['ratio'].to_dict()
X_train['OWN_CAR_AGE_ordered'] = X_train.OWN_CAR_AGE.map(ordered_labels)
X_test['OWN_CAR_AGE_ordered'] = X_test.OWN_CAR_AGE.map(ordered_labels)
X_train.head()

In [None]:
# plot the original variable

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean().plot()
fig.set_title('Normal relationship between variable and target')
fig.set_ylabel('TARGET')

In [None]:
# plot the transformed result: the monotonic variable

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE_ordered'])['TARGET'].mean().plot()
fig.set_title('Monotonic relationship between variable and target')
fig.set_ylabel('TARGET')

## Weight  of evidence

Weight of Evidence (WoE) was developed primarily for the credit and financial industries to help build more predictive models to evaluate the risk of loan default. That is, to predict how likely the money lent to a person or institution is to be lost. Thus, Weight of Evidence is a measure of the "strength” of a grouping technique to separate good and bad risk (default). 

It is computed from the basic odds ratio: ln( (Proportion of Good Credit Outcomes) / (Proportion of Bad Credit Outcomes))

WoE will be 0 if the P(Goods) / P(Bads) = 1. That is, if the outcome is random for that group. If P(Bads) > P(Goods) the odds ratio will be < 1 and the WoE will be < 0; if, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.

WoE is well suited for Logistic Regression, because the Logit transformation is simply the log of the odds, i.e., ln(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in logistic regression, the predictors are all prepared and coded to the same scale, and the parameters in the linear logistic regression equation can be directly compared.

The WoE transformation has three advantages:

- It establishes a monotonic relationship to the dependent variable.
- It orders the categories on a "logistic" scale which is natural for logistic regression
- The transformed variables, can then be compared because they are on the same scale. Therefore, it is possible to determine which one is more predictive.

The WoE also has three drawbacks:

- May incur in loss of information (variation) due to binning to few categories (we will discuss this further in the discretisation section)
- It does not take into account correlation between independent variables
- Prone to cause over-fitting

In [None]:
# let's load again the titanic dataset

data = pd.read_csv('../input/home-credit-default-risk/application_train.csv', usecols=['OWN_CAR_AGE', 'TARGET'])
data.head()

In [None]:
# let's first fill NA values with an additional label

data.OWN_CAR_AGE.fillna('Missing', inplace=True)
data['OWN_CAR_AGE'] = data['OWN_CAR_AGE'].astype(str).str[0]

In [None]:
# Let's divide into train and test set

X_train, X_test, y_train, y_test = train_test_split(data[['OWN_CAR_AGE', 'TARGET']], data.TARGET, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

In [None]:
# and now the probability of target = 0 
# and we add it to the dataframe

prob_df = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean()
prob_df = pd.DataFrame(prob_df)
prob_df['Difficult'] = 1-prob_df.TARGET
# since the log of zero is not defined, let's set this number to something small and non-zero

prob_df.loc[prob_df.TARGET == 0, 'TARGET'] = 0.00001
prob_df['WoE'] = np.log(prob_df.TARGET/prob_df.Difficult)
ordered_labels = prob_df['WoE'].to_dict()

X_train['OWN_CAR_AGE_ordered'] = X_train.OWN_CAR_AGE.map(ordered_labels)
X_test['OWN_CAR_AGE_ordered'] = X_test.OWN_CAR_AGE.map(ordered_labels)

X_train.head()

In [None]:
# plot the original variable

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE'])['TARGET'].mean().plot()
fig.set_title('Normal relationship between variable and target')
fig.set_ylabel('Survived')

In [None]:
# plot the transformed result: the monotonic variable

fig = plt.figure()
fig = X_train.groupby(['OWN_CAR_AGE_ordered'])['TARGET'].mean().plot()
fig.set_title('Monotonic relationship between variable and target')
fig.set_ylabel('Survived')