<a href="https://colab.research.google.com/github/Nagarjunasagar/Loan_repayment_prediction/blob/main/Loan_Repayment_Prediction_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Loan Repayment Prediction**

## **1. Introduction**
Objective of this work is to predict the Loan repayment ability of the applicant based on the historical loan application data.

## This is an standard superviced classification problem statement
- Supervised : Training data which includes labelles is used to train the model to predict labels.
- Clasiification : The label is a binary variable where, 
  - 0 Represents Repayment of Loan on time
  - 1 Represents Having difficulty in repayment of loan





## **2. Data**
Data is from kaggle, rovided by 'Home Credit' a service dedicated to provided lines of credit (loans) to the unbanked population, as part of 'Home Credit Default Risk Competition'.
Actual dataset on kaggle has 7 different sources of data and data in the form of .csv files. Here in this work I used only two files 'application_train.csv' and 'application_test.csv' source of these data is 'Home credit' itself(Some other data is from a Bureau, which is not included in this work). 

## Understanding data 
- Training Data: 
 - 'application_train.csv' is the training data wich has information about each  loan application at 'Home Credit', Every loan has its own row and is defined by the feature `SK_ID_CURR`. 
 - 'application_train.csv' comes with `TARGET` is a binary variable representing, 0 : Loan repaid and 1 : The loan was not paid.

- Testing data :  
 - 'application_test.csv' is the testing data, has all the features/columns in training dataset except `TARGET`
- More details about columns is provided in 'columns_description.xlsx' file
 


## **3. Libraries**
We are using typical Datascience Stack `numpy`, `pandas`, `sklearn`, `matplotlib`, `seaborn`

In [70]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd

# matplotlib and seaborn for plotting
import seaborn as sns
import matplotlib.pyplot as plt

# To suppress warnings
import warnings
warnings.filterwarnings('ignore')

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

import os


## **4. Read the data from datasets**
- List all the files
- Read train data
- Read test data

In [71]:
# List files available
print(os.listdir("/content/drive/MyDrive/Loan_repayment_prediction"))

['application_test.csv', 'application_train.csv', 'columns_description.xlsx']


In [None]:
# Training data
app_train = pd.read_csv('/content/drive/MyDrive/Loan_repayment_prediction/application_train.csv')
print('Training data shape: ', app_train.shape)
app_train.head()

In [None]:
# Testing data features
app_test = pd.read_csv('/content/drive/MyDrive/Loan_repayment_prediction/application_test.csv')
print('Testing data shape: ', app_test.shape)
app_test.head()


- Training data has 307511 rows and 122 columns, 
- Testing data has 48744 rows and 121 columns
- Testing test is considerably smaller without `TARGET` column


# Exploratory Data Analysis
- The goal of Exploratory Data Analysis(EDA) is to learn what our data can tell us. This is an open-ended process where we use statistics to make figures to find trends, patterns, outliers and relationships  within the data.
- It generally starts with high level overview, then narrows to a specific areas as we find intriguing areas of data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.




### **1. Examining the `TARGET` column**
`TARGET` column is a binary variable, which has 0 and 1 as its values,
where: 
- 0 Represents Repayment of Loan on time
- 1 Represents Having difficulty in repayment of loan

In [None]:
app_train["TARGET"].value_counts()

In [None]:
app_train["TARGET"].astype(int).plot.hist()

This is an imbalanced class problem where the total number of a class of data (positive) is far less than the total number of another class of data (negative). There are far more loans that were repaid on time than loans that were not repaid. 

## **2. Examining the missing values in the data**
Lets look at the columns that had missing values and number of missing values in them. Since we cant examine based on their values, lets convert in to Percentage, and list the top.

In [None]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(20)

## **3. Columns and their types**
Since the number of columns/features are high lets examine the number of columns based on datatype. `int64` and `float64` are numeric variables and `object` columns cotains strings that are categorical features.

In [None]:
app_train.dtypes.value_counts()

Let's now look at the number of unique entries in each of the `object `(categorical) columns

In [None]:
# Number of unique classes in each object column
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

- Most of the categorical variables has small number of unique entries
- We need to find a way to deal with these categorical values
- Because Machine learning models cant learn from the text data in this case

### **Encoding Categorical variables**
Before we go any further, we need to deal with pesky categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:
1. **Label Encoding** : 
- Label Encoding is good for variables with 2 unique categories.
- Assigns each unique category in a categorical variable with an integer. No new columns are created.
- For label encoding, we use the Scikit-Learn `LabelEncoder`

- The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category.

2. **One-hot Encoding** : 
- One-hot encoding is safe option for categorical variables with many classes, because it does not impose arbitrary values to categories.
- Create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns.
-  For one-hot encoding we use the pandas `get_dummies(df)` function.

- The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).



Using Label encoding for variables/columns having 2 or less unique categories

In [None]:
# Create LabelEncoder
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
  if app_train[col].dtype == 'object':
    # for 2 or less unique categories
    if len(list(app_train[col].unique())) <= 2:
      # Train on training data
      le.fit(app_train[col])
      # Transeform both trainin and testing data
      app_train[col] = le.transform(app_train[col])
      app_test[col] = le.transform(app_test[col])

      #keep track of how many columns were label encoded
      le_count += 1

print("%d Colums were label encoded" % le_count)





Using One-hot encoding for columns/variables having more than 2 unique categories 

In [None]:
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print("Training features shape :", app_train.shape)
print("Testing features shape :",app_test.shape)

### **Alligning Training and Tesing data**

There must same number of features/columns in both Training and Testing data but One-hot encoding create more columns in Training data because there were some categorical variables with categories not represented in the testing data.
To remove the columns in Training data that are not present in testing data, Dtaframes must be `align` 
- First TARGET column must be extracted fro training data because that is not the part of testing data but we need it for future.
- When we do align we must set `axis=1 `so that Dataframes must align based on columns not based on rows

In [None]:
train_labels = app_train['TARGET']

#Align Training and Testing data keep the columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis=1)

#Add the target back in Training data
app_train["TARGET"] = train_labels

print("Trainin Features Shape :", app_train.shape)
print("Testing features Shape : ", app_test.shape)

- Now both Training and Testing data has same number of column so that can be fed to Machine Learning model
- The number of features has grown significantly due to One-hot encoding
- At some point we probably want to try dimensionality reduction (removing features that are not relevant) to reduce the size of the datasets.

## **4. Anomalies/Outliers** 
**Exploratory Data Analysis Continued :**
While doing EDA, finding Anomalies within the data is important, anomalies in the data are,
- May be due to miss-typed numbers
- Errors in measuring equipment
- Valid extreme measurements
- One way to support anomalies quantitatively is by looking at the statistics of a column using the `describe` method.

In [None]:
app_train['DAYS_BIRTH'].describe()

The numbers in the `DAYS_BIRTH` column are negative because they are recorded relative to the current loan application. To see these stats in years, we can mutliple by -1 and divide by the number of days in a year:

In [None]:
(app_train['DAYS_BIRTH']/-365) .describe()

After converting `DAYS_BIRTH` to years, ages looks good. There were no outliers detected either on higher or lower end

Now lets examine `DAYS_EMPLOYED` that is days employed.

In [None]:
(app_train['DAYS_EMPLOYED']/365).describe()

max = 1000 YEARS..!!!! Looks unusal right.. Thats an outlier

Examining `DAYS_EMPLOYED` shows that it has outliers.

In [None]:
import matplotlib.pyplot as plt
app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employed');
plt.xlabel("Days Employed")


In [None]:
# Lets look for the number of outliers we had inthe column
anom = app_train[app_train['DAYS_EMPLOYED'] == 365243]
# other values
non_anom = app_train[app_train['DAYS_EMPLOYED'] != 365243]

print("The number of outliers with employed days = 365243, are :", len(anom))
print("The non_anomalies default on %0.2f%% of loans"  % (100 * non_anom['TARGET'].mean()))
print("The anomalies default on %0.2f%% of loans"  % (100 * anom['TARGET'].mean()))


There are 55374 outlier entries with same value
Well that is extremely interesting! It turns out that the anomalies have a lower rate of default.

Handling the anomalies depends on the exact situation, with no set rules. One of the safest approaches is just to set the anomalies to a missing value and then have them filled in (using Imputation) before machine learning. In this case, since all the anomalies have the exact same value, we want to fill them in with the same value in case all of these loans share something in common. The anomalous values seem to have some importance, so we want to tell the machine learning model if we did in fact fill in these values. As a solution, we will fill in the anomalous values with not a number (np.nan) and then create a new boolean column indicating whether or not the value was anomalous.

In [None]:
# create anomalous flag column
app_train['DAYS_EMPLOYED_ANOM'] = app_train['DAYS_EMPLOYED'] == 365243

In [None]:
# Convert anomalous values to nan values
app_train['DAYS_EMPLOYED'].replace({365243 : np.nan}, inplace = True)
app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employed anom')

The distribution looks to be much more in line with what we would expect, and we also have created a new column to tell the model that these values were originally anomalous (becuase we will have to fill in the nans with some value, probably the median of the column). The other columns with DAYS in the dataframe look to be about what we expect with no obvious outliers.

**As an extremely important note, anything we do to the training data we also have to do to the testing data. Let's make sure to create the new column and fill in the existing column with np.nan in the testing data.**

In [None]:
# lets count the number of outliers with value = 365243 in the 'DAYS_EMPLOYED ' column of test dataframe
test_anom = app_test[app_test['DAYS_EMPLOYED'] == 365243] 

In [None]:
len(test_anom)

there are 9274 entries in the column with value 365243 

Lets make a new column from this column to store anomalous entries, because in the actual column we need to replace the anomalous values with 'nan' values

In [None]:
# Creating anew column
app_test['TEST_ANOM_DAYS_EMPLOYED'] = app_test['DAYS_EMPLOYED'] == 365243

# Replace the anomalous values in the app_test['DAYS_EMPLOYED'] column to nan values
app_test['DAYS_EMPLOYED'].replace({365243 : np.nan}, inplace = True)
app_test['DAYS_EMPLOYED'].plot.hist(title = 'app_test DAYS_EMPLOYED')

In [None]:
# Confirming the number of columns in both train and test set after all the above modifications
print(app_train.shape)
print(app_test.shape)

## **5. Correlation**

Now that we have dealt with the categorical variables and the outliers, let's continue with the EDA. One way to try and understand the data is by looking for correlations between the features and the target. We can calculate the Pearson correlation coefficient between every variable and the `target` using the `.corr` dataframe method.

The correlation coefficient is not the greatest method to represent "relevance" of a feature, but it does give us an idea of possible relationships within the data. Some general interpretations of the absolute value of the correlation coefficent are:

- .00 - .19 “very weak”
- .20 - .39 “weak”
- .40 - .59 “moderate”
- .60 - .79 “strong”
- .80 - 1.0 “very strong”

In [None]:
# Find the correlation with the Target and Sort
correlations =app_train.corr()['TARGET'].sort_values()

In [None]:
# print correlations
print("Most positive Correlations are:\n", correlations.tail(15))
print("\nMost negetive Correlations are:", correlations.head(15))

Let's take a look at some of more significant correlations: 
- The `DAYS_BIRTH` is the most positive correlation. (except for TARGET because the correlation of a variable with itself is always 1!) Looking at the documentation, DAYS_BIRTH is the age in days of the client at the time of the loan in negative days (for whatever reason!). 
- The correlation is positive, but the value of this feature is actually negative, meaning that as the client gets older, they are less likely to default on their loan (ie the target == 0). 
- That's a little confusing, so we will take the absolute value of the feature and then the correlation will be negative.

### Effect of Age on  Repayment

In [None]:
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])

As the client gets older, there is a negative linear relationship with the target meaning that as clients get older, they tend to repay their loans on time more often.

Let's start looking at this variable. First, we can make a histogram of the age. We will put the x axis in years to make the plot a little more understandable.

In [None]:
(app_train['DAYS_BIRTH']/365).plot.hist(title='Client Age in years')

In [None]:
# Set the style of plots
plt.style.use('fivethirtyeight')

# Plot the distribution of ages in Years
plt.hist(app_train['DAYS_BIRTH']/365, edgecolor = 'k', bins = 25)
plt.title('Age of Client')
plt.xlabel('Age in Years')
plt.ylabel('Count')

By itself, the distribution of age does not tell us much other than that there are no outliers as all the ages are reasonable. 
- To visualize the effect of the age on the target, we will next make a kernel density estimation plot (KDE) colored by the value of the target. 
- A kernel density estimate plot shows the distribution of a single variable and can be thought of as a smoothed histogram (it is created by computing a kernel, usually a Gaussian, at each data point and then averaging all the individual kernels to develop a single smooth curve). 
- We will use the seaborn kdeplot for this graph.

In [None]:
plt.figure(figsize = (10,8))

#KDE plot of loans that repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH']/365, label = 'TARGET == 0')

#KDE plot of loans which were not repayed on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH']/365, label = 'TARGET == 1')

# Labeling of plot
plt.title('Distribution of Ages'); plt.xlabel('Age in Years'); plt.ylabel('Density')

- The target == 1 curve skews towards the younger end of the range. 
- Although this is not a significant correlation (-0.07 correlation coefficient)
- This variable is likely going to be useful in a machine learning model because it does affect the target. 


- Let's look at this relationship in another way: average failure to repay loans by age bracket.
- To make this graph, first we cut the age category into bins of 5 years each. 
- Then, for each bin, we calculate the average value of the target, 
- This tells us the ratio of loans that were not repaid in each age category.

In [None]:
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)

In [None]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups

In [None]:
plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

**There is a clear trend:**  
- Younger applicants are more likely to not repay the loan! 
- The rate of failure to repay is above 10% for the youngest three age groups 
- Beolow 5% for the oldest age group.

This is information that could be directly used by the bank:
- Because younger clients are less likely to repay the loan, maybe they should be provided with more guidance or financial planning tips. This does not mean the bank should discriminate against younger clients, but it would be smart to take precautionary measures to help younger clients pay on time

### Exterior Sources

- The 3 variables with the strongest negative correlations with the target are 
 - EXT_SOURCE_1, 
 - EXT_SOURCE_2, and 
 - EXT_SOURCE_3. 
- According to the documentation, these features represent a "normalized score from external data source". 
- I'm not sure what this exactly means, 
- But it may be a cumulative sort of credit rating made using numerous sources of data.

Let's take a look at these variables.

First, we can show the correlations of the EXT_SOURCE features with the target and with each other.

In [None]:
# Extract the EXT_SOURCE variables and show correlations
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs

### Plotting Heatmap 
Correlation for External sources, TARGET and DAYS_BIRTH

In [None]:
plt.figure(figsize = (8, 6))

# Heatmap of correlations
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

- All three EXT_SOURCE featureshave negative correlations with the target  
- Indicating that as the value of the `EXT_SOURCE` increases, the client is more likely to repay the loan. 
- We can also see that `DAYS_BIRTH `is positively correlated with `EXT_SOURCE_1` indicating that maybe one of the factors in this score is the client age.



### Effect of External sources on TARGET
Next we can look at the distribution of each of these features colored by the value of the target. This will let us visualize the effect of this variable on the target.

In [None]:
plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

`EXT_SOURCE_3` displays the greatest difference between the values of the target. 
- We can clearly see that this feature has some relationship to the likelihood of an applicant to repay a loan. 
- The relationship is not very strong, in fact they are all considered very weak, but these variables will still be useful for a machine learning model to predict whether or not an applicant will repay a loan on time.

**Pairs Plot :**

- As a final exploratory plot, we can make a pairs plot of the `EXT_SOURCE` variables and the `DAYS_BIRTH` variable.
- The Pairs Plot is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables.
- Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with
  - scatterplots on the upper triangle, 
  - histograms on the diagonal, and 
  - 2D kernel density plots and correlation coefficients on the lower triangle.



In [None]:
# Copy the data for plotting
plot_data = ext_data.drop(columns = ['DAYS_BIRTH']).copy()

# Add in the age of the client in years
plot_data['YEARS_BIRTH'] = age_data['YEARS_BIRTH']

# Drop na values and limit to first 100000 rows
plot_data = plot_data.dropna().loc[:100000, :]

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3, diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(plot_data.columns) if x != 'TARGET'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

plt.suptitle('Ext Source and Age Features Pairs Plot', size = 32, y = 1.05);

In this plot
- The red indicates loans that were not repaid and the blue are loans that are paid.
- We can see the different relationships within the data. There does appear to be a moderate positive linear relationship between the EXT_SOURCE_1 and the DAYS_BIRTH (or equivalently YEARS_BIRTH), 
- Indicating that this feature may take into account the age of the client.

# Feature Engineering

"Applied machine learning is basically feature engineering." - Andrew Ng


Feature engineering is important in solving any Machine Learning problem, the most useful features that we create out of data defines how successful we are in adressing the problem.

Feature engineering has a greater return on investment than model building and hyperparameter tuning

While choosing the right model and optimal settings are important, the model can only learn from the data it is given. Making sure this data is as relevant to the task as possible is the job of the data scientist.

Feature engineering refers to a geneal process and can involve both 
- **Feature construction:** Adding new features from the existing data, 
- **Feature selection:** choosing only the most important features or other methods of dimensionality reduction. 

There are many techniques we can use to both create features and select features.

We will do a lot of feature engineering when we start using the other data sources, but in this notebook we will try only two simple feature construction methods:

- Polynomial features
- Domain knowledge features

## **1. Polynomial Features**

- One simple feature construction method is called polynomial features. In this method, we make features that are powers of existing features as well as interaction terms between existing features.
- For example, we can create variables EXT_SOURCE_1^2 and EXT_SOURCE_2^2 and also variables such as EXT_SOURCE_1 x EXT_SOURCE_2, EXT_SOURCE_1 x EXT_SOURCE_2^2, EXT_SOURCE_1^2 x EXT_SOURCE_2^2, and so on.
- These features that are a combination of multiple individual variables are called interaction terms because they capture the interactions between variables.
- In other words, while two variables by themselves may not have a strong influence on the target, combining them together into a single interaction variable might show a relationship with the target.
- Interaction terms are commonly used in statistical models to capture the effects of multiple variables, but I do not see them used as often in machine learning. Nonetheless, we can try out a few to see if they might help our model to predict whether or not a client will repay a loan.

### Creating Polynomial Features 
- In the following code, we create polynomial features using the `EXT_SOURCE` variables and the `DAYS_BIRTH` variable.
- **Scikit-Learn** has a useful class called `PolynomialFeatures` that creates the polynomials and the interaction terms up to a specified degree. 
- We can use a **degree of 3** to see the results (when we are creating polynomial features, we want to avoid using too high of a degree, both because the number of features scales exponentially with the degree, and because we can run into problems with overfitting).

In [None]:
# Make a new dataframe for polynomial features
poly_features_train = app_train[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH','TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']]

In [None]:
!pip install sklearn

In [None]:
# imputer for handling missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'median')

poly_target = poly_features_train['TARGET']

poly_features_train = poly_features_train.drop(columns = ['TARGET'])

#Nedd to impute missing values
poly_features_train = imputer.fit_transform(poly_features_train)
poly_features_test = imputer.fit_transform(poly_features_test)

from sklearn.preprocessing import PolynomialFeatures

# Create a polynmial object with specified degree 3
poly_transformer = PolynomialFeatures(degree=3)

In [None]:
# Train the polynomial features
poly_transformer.fit(poly_features_train)

# Transform the features
poly_features_train = poly_transformer.transform(poly_features_train)
poly_features_test = poly_transformer.transform(poly_features_test)

In [None]:
print("Polynomials features Shape", poly_features_train.shape)

This creates a considerable number of new features. To get the names we have to use the polynomial features get_feature_names method.

In [None]:
poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH'])[:15]

- There are 35 features with individual features raised to powers up to degree 3 and interaction terms. 
- Now, we can see whether any of these new features are correlated with the target.

In [None]:
# Create a dataframe of the features
poly_features = pd.DataFrame(poly_features_train, columns = poly_transformer.get_feature_names(['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']))

# Add in the target
poly_features['TARGET'] = poly_target

#find correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values()

# Display most positive correlations
print('most positive correlations are :\n',poly_corrs.tail())

# Display most negetive correlations
print('\nmost negetive correlations are :\n',poly_corrs.head())



- Several of the new variables have a greater (in terms of absolute magnitude) correlation with the target than the original features. 
- When we build machine learning models, we can try with and without these features to determine if they actually help the model learn.

- We will add these features to a copy of the training and testing data and then evaluate models with and without the features. 
- **Many times in machine learning, the only way to know if an approach will work is to try it out..!!!!!**

In [None]:
# Put test features in to dataframe
poly_features_test = pd.DataFrame(poly_features_test, columns = poly_transformer.get_feature_names(['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']))

# Merge polynomial features in to Training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left' )

# Mege Polynomial features in to testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left' )


# Align the Dataframes
app_train_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

#Print the new shapes
print('Shape of New training Dataframe\n', app_train_poly.shape())
print('Shape of New testing Dataframe\n', app_test_poly.shape())