# Python for Data Science

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI-Core/Python-for-Data-Science/blob/main/Part%203%20-%20Feature%20Engineering.ipynb)


## The Problem

__Loan default prediction__ is one of the most critical and crucial problems faced by financial institutions and organizations as it has a noteworthy effect on the profitability of these institutions. In recent years, there is a tremendous increase in the volume of _non–performing_ loans which results in a jeopardizing effect on the growth of these institutions. 

Therefore, to maintain a healthy portfolio, banks put stringent monitoring and evaluation measures in place to ensure the timely repayment of loans by borrowers. Despite these measures, a major proportion of loans become delinquent. _Delinquency_ occurs when a borrower misses a payment against his/her loan.

Given the information like mortgage details, borrowers-related details, and payment details, your objective is to build a system that can predict the defaulter status of loans for the next month given the defaulter status for the previous 12 months (in the number of months).



In [4]:
import pandas as pd
df = pd.read_csv(
    "https://raw.githubusercontent.com/AI-Core/Python-for-Data-Science/main/part-2-output.csv")


# Part 3 - Feature Engineering

After having performed EDA and getting an understanding of your data, it's up to you to create features from the data based on this understanding. 
This process of creating new features is known as _feature engineering_.

Feature engineering is one of the most crucial steps in machine learning.
Creating the right features by applying __data understanding__ and __business knwoledge__ can improve the overall performance metrics by leaps and bounds. 

## Computing new features

The typical approach is to generate a huge amount of different features using a variety of strategies.
- Some of these features can be crafted by your intuition.
- Some of these features can be informed by your interpretion of your EDA.
- Some of these features can be generated by making random transformations and combinations of features which can end up being very valuable to the model. So you can choose to be creative here.

### Features that make sense intuitively

One idea is to calculate the number of months since their first payment and the start of their loan.

In [5]:
# number of days before the first payment from the originations date

# convert date columns to datetime format
df['origination_date'] = pd.to_datetime(df['origination_date'])  
df['first_payment_date'] = pd.to_datetime(df['first_payment_date'])

df['months_until_first_payment'] = (df['first_payment_date'] - df['origination_date']) // pd.Timedelta('29 days') # this will give us the number of months, dividing by 29 to cover corner case of February


Intuitively, there's probably not much more information contained in the exact number of days given to pay back the loan than there is in the number of months. Perhaps simplifying this value will make it easier to learn some useful relationships.

Note: There is no point in keeping the both loan term in days and loan term in months as features, as they are completely dependent on one another. Here, we assume that this simplification will be helpful, and can discard the less helpful feature based on days.

In [6]:
# converting loan term into number of months, diving by 29 to cover the corner case of Feb 
df['loan_term'] =  df['loan_term'] // 29 

Another piece of business logic that makes sense if you know about loans work, is to determine whether the individual has taken insurance for the loan. In unforeseen circumstances, the insurance policy provides coverage for a certain amount of time and repays the monthly loan payments to be made by the individual.
Instead of just knowing the percentage of insurance, it would be beneficial just to know if a __person has insurance or not__.

In [7]:
#loan insurance covered Yes/No
df['has_insurance'] = df['insurance_percent'].apply( lambda x :  0 if x == 0 else 1)

Take a look at these features in your dataset now that you've created them.

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,loan_id,source,financial_institution,interest_rate,unpaid_principal_bal,loan_term,origination_date,first_payment_date,loan_to_value,...,m6,m7,m8,m9,m10,m11,m12,m13,months_until_first_payment,has_insurance
0,0,268055008619,Z,"Turner, Baldwin and Rhodes",4.25,214000,12,2012-03-01,2012-05-01,95,...,0,1,0,0,0,0,0,1,2,1
1,1,672831657627,Y,"Swanson, Newton and Miller",4.875,144000,12,2012-01-01,2012-03-01,72,...,0,0,0,0,0,1,0,1,2,0
2,2,742515242108,Z,Thornton-Davis,3.25,366000,6,2012-01-01,2012-03-01,49,...,0,0,0,0,0,0,0,1,2,0
3,3,601385667462,X,OTHER,4.75,135000,12,2012-02-01,2012-04-01,46,...,0,0,0,1,1,1,1,1,2,0
4,4,273870029961,X,OTHER,4.75,124000,12,2012-02-01,2012-04-01,80,...,5,6,7,8,9,10,11,1,2,0


## Feature Selection

Now that you've created a bunch of new features, you need to determine which ones to use.
There are different approaches for different types of variables. 


The most common approaches are:
1. Correlation Coefficient: a metric used to find out the [correlation](https://en.wikipedia.org/wiki/Correlation) betwen continous variables. 
2. Chi-Squared test: test of independence for categorical columns.

### 1. Using Correlation Coefficients to Remove Highly Correlated Features
As we mentioned, we typically use this for continous variables, but sometimes interesting patterns can be seen for categorical variables as well.

So let's visualise the correlation between all the columns.

In [9]:
# correlation for continous columns

continuous_columns = [
    "interest_rate",
    "unpaid_principal_bal",
    "loan_term",
    "loan_to_value",
    "debt_to_income_ratio",
    "borrower_credit_score",
    "insurance_percent",
    "days_until_first_payment"
]

categorical_columns = [
    "source",
    "financial_institution",
    "loan_purpose",
    "has_insurance",
    "m1",
    "m2",
    "m3",
    "m4",
    "m5",
    "m6",
    "m7",
    "m8",
    "m9",
    "m10",
    "m11",
    "m12"
]

corr = df.corr()

fig = px.imshow(corr, text_auto=True, aspect="auto")
fig.show()


NameError: name 'px' is not defined

What we're looking for here is the more yellow cells (but not along the diagonal), which indicate a high correlation between two features.

- `interest_rate` and `loan_term` is highly correlated. The more the loan term , the more the interest rate. 
- `has_insurance` and `insurance_percent` are also very highly correlated because `has_insurance` is derived from `insurance_percent`.

We should remove either one of the columns in both the cases.


### 2. Using a Chi-Squared Test to Remove Highly Correlated Categorical Features

A [chi-squared](https://en.wikipedia.org/wiki/Chi-squared_test) test is a [statistical hypothesis](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) test primarily used to examine whether two categorical variables are independent in influencing the test statistic.

In [None]:

from scipy.stats import chi2_contingency

In [None]:
results = pd.DataFrame(columns=['col','p-value'])

for col in categorical_columns:
    _, p, _, _ = chi2_contingency( pd.crosstab(df[col], df['m13']))
    s = results.shape[0]
    results.loc[s,'col'] = col
    results.loc[s,'p-value'] = p

results

Unnamed: 0,col,p-value
0,source,0.000795
1,financial_institution,0.061209
2,loan_purpose,0.0
3,insurance_covered,0.064511
4,m1,0.0
5,m2,0.0
6,m3,0.0
7,m4,0.0
8,m5,0.0
9,m6,0.0


Typically, a p-value < 0.05 is considered significant. Hence, insignificant columns are:
1. financial_institution
2. insurance_covered

Now let's drop those categorical columns which we found to be insignificant.

In [None]:
df = df.drop(labels=['financial_institution', 'has_insurance', 'insurance_percent'], axis=1)

We will also drop the datetime columns because they dont provide any value unless we extract some information from them like `months_until_first_payment`. You can try an extract as much information from them as you want like `month_name` when loan was issued etc.
For now, we will drop these columns as well. 

In [None]:
df = df.drop(labels=["origination_date","first_payment_date"], axis=1)

# Data Preprocessing

There are a few data preprocessing techniques that need to be applied specifically if you intent to use your data for machine learning. They include:

1. Feature Encoding
2. Feature Scaling


## Feature and Label Encoding
There are generally 2 types of encoding techniques 
1. Label Encoding - used for ordinal variables
2. One-Hot Encoding : used for nominal variables


In [None]:
# dropping non significant columns from X and the target column
X = df
y = df.drop(labels=['m13'], axis= 1)


In [None]:
# apply one-hot encoding to nominal variables
X = pd.get_dummies(data=X, columns=['source', 'loan_purpose'])

## Feature Scaling

*Feature scaling* is the process of making all features on the same order of magnitude, rather than having some that are 1000x larger than others, for example.

Feature scaling is one of the most important data preprocessing step in machine learning. Algorithms that compute the distance between the features are biased significantly by numerically larger values if the data is not scaled.

Tree-based algorithms are fairly insensitive to the scale of the features. Also, feature scaling helps machine learning models trained using gradient based optimisation (including deep learning algorithms) converge and converge faster.

Normalization and standardization are the most popular scaling techniques.

### Normalisation

Normalisation, also known as min-max scaling, is calculated as:

# $X_{new} = \frac{X - X_{min}}{X_{max} - X_{min}}$

This scales the range to [0, 1]. Geometrically speaking, transformation squishes the n-dimensional data into an n-dimensional unit hypercube.
Normalization is useful when there are no outliers, as it cannot cope with them.
Usually, we would scale age and not incomes because only a few people have high incomes but the age is close to uniform

### Standardisation

Standardisation, also known as Z-Score Normalization, is the transformation of features by subtracting the mean of the data and dividing by its standard deviation.
The resulting value for each feature is often called a Z-score.

# $X_{new} = \frac{X - \mu}{\sigma}$

Standardization can be helpful in cases where the data follows a Gaussian distribution. 
However, this does not have to be necessarily true. 
Geometrically speaking, it translates the data to centre it around its original mean and squishes or expands the points around that mean so that it has a unit standard deviation in every direction.
We are just changing mean and standard deviation to those of a normal distribution, thus the shape of the original distribution is not affected.

Standardisation can handle outliers because the range of the new data is not completely determined by the max and min values.


In [None]:
X

Unnamed: 0_level_0,interest_rate,unpaid_principal_bal,loan_term,loan_to_value,number_of_borrowers,debt_to_income_ratio,borrower_credit_score,m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,months_until_first_payment,source_X,source_Y,source_Z,loan_purpose_A23,loan_purpose_B12,loan_purpose_C86
loan_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
268055008619,4.250,214000,12,95,1,22.0,694.0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,1
672831657627,4.875,144000,12,72,1,44.0,697.0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,1,0,0,1,0
742515242108,3.250,366000,6,49,1,33.0,780.0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,1,0
601385667462,4.750,135000,12,46,2,44.0,633.0,0,0,0,0,0,0,0,0,1,1,1,1,2,1,0,0,0,1,0
273870029961,4.750,124000,12,80,1,43.0,681.0,0,1,2,3,4,5,6,7,8,9,10,11,2,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
382119962287,4.125,153000,12,88,2,22.0,801.0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,1,0,0
582803915466,3.000,150000,4,35,1,37.0,796.0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,1,0
837922316947,3.875,166000,12,58,2,49.0,724.0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,1,0
477343182138,4.250,169000,12,74,2,13.0,755.0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,1,0,0


In [None]:
# apply normalisation using sklearn
from sklearn.preprocessing import MinMaxScaler
normalisation = MinMaxScaler()
X = normalisation.fit_transform(X)

## Save the output data

In [10]:
df.to_csv('part-3-output.csv')

## Key Takeaways

- Feature engineering is about coming up with new features that might contain useful information
- Feature scaling is important because some machine learning models will struggle to find the patterns in your data without it
- Normalisation and standardisation are two of the most popular approaches to feature scaling