In [None]:
import jovian
#jovian.commit(project="zerotogbms-a1")

# Home Credit Default Risk Prediction

## Problem statement:
Building a model to predict how capable each applicant is of repaying a loan, so that sanctioning loan only for the applicants who are likely to repay the loan.
### application_train/application_test: 
The main training data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. Here we will use only the Training data.


## Exploratory Data Analysis

In [None]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import os
import warnings
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from sklearn.model_selection import train_test_split
init_notebook_mode(connected=True)
import cufflinks as cf
cf.go_offline()
import pickle
import gc
warnings.filterwarnings('ignore')
%matplotlib inline


In [None]:
import opendatasets as od

In [None]:
dataset_url='https://www.kaggle.com/c/home-credit-default-risk/data'
od.download(dataset_url,force=True)

In [None]:
import os
data_dir = './home-credit-default-risk'

In [None]:
os.listdir(data_dir)

In [None]:
import pandas as pd 
application = pd.read_csv('home-credit-default-risk/application_train.csv')

In [None]:
application

Data contains 307511 rows and 122 columns. Our target column 'Target'.

In [None]:
application.dtypes

In [None]:
application['TARGET']=application['TARGET'].astype('category')

In [None]:
count = application.isnull().sum().sort_values(ascending=False)
percentage = ((application.isnull().sum()/len(application)*100)).sort_values(ascending=False)
percentage[percentage > 50]

Here is many missing values. we will tackle this later.

#### Target column Distribution

In [None]:
cf.set_config_file(theme='polar')
contract_val = application['TARGET'].value_counts()
contract_df = pd.DataFrame({'labels': contract_val.index,
                   'values': contract_val.values
                  })
contract_df.iplot(kind='pie',labels='labels',values='values', title='Target Distribution', hole = 0.6)


Data is highly imabalance. As we can visualize that 91% data dont have any issue to repay.

#### Target column with Contract Type

In [None]:
(pd.crosstab(application.TARGET,application.NAME_CONTRACT_TYPE,dropna=False))/len(application)*100

In [None]:
px.histogram(application,x='NAME_CONTRACT_TYPE',color='TARGET',title='Distribution of Contract Type')

Application whose contarct Type is cash loan are more. This means who apply for cash loan increases chances to get loan

#### Target column with Gender

In [None]:
(pd.crosstab(application.TARGET,application.CODE_GENDER,dropna=False))/len(application)*100

In [None]:
px.histogram(application,x='CODE_GENDER',color='TARGET',title='Distribution of Gender Type')

Its Suprocising to know that Female Apllicant is more than male

#### Target column with Suit Type

In [None]:
(pd.crosstab(application.TARGET,application.NAME_TYPE_SUITE,dropna=False))/len(application)*100

In [None]:
px.histogram(application,x='NAME_TYPE_SUITE',color='TARGET',title='Distribution of Suite Type')

Applicant who is single there chances increases to get loan.

#### Target column with Income Type

In [None]:
(pd.crosstab(application.TARGET,application.NAME_INCOME_TYPE,dropna=False))/len(application)*100

In [None]:
px.histogram(application,x='NAME_INCOME_TYPE',color='TARGET',title='Distribution of Income Type')

Surprisingly who applicant is businessman are not getting loan. But whose montly income is fixed there chances increses to get loan

#### Target column with Education_type

In [None]:
(pd.crosstab(application.TARGET,application.NAME_EDUCATION_TYPE,dropna=False))/len(application)*100

In [None]:
px.histogram(application,x='NAME_EDUCATION_TYPE',color='TARGET',title='Distribution of Education Type')

Completing Secondary Special is higher chances than Higher education to get loan

#### Target with Occupation Type

In [None]:
(pd.crosstab(application.TARGET,application.OCCUPATION_TYPE,dropna=False))/len(application)*100

In [None]:
px.histogram(application,x='OCCUPATION_TYPE',color='TARGET',title='Distribution of OCCUPATION_TYPE')

applicant whose background IT sector, HR Staff, Secretaries, waiters/Cleaner &  Private sectors ae very less chamces to get home loan.

### Income Distribution

In [None]:
application[application['AMT_INCOME_TOTAL'] < 2000000]['AMT_INCOME_TOTAL'].iplot(kind='histogram', bins=100,
   xTitle = 'Total Income', yTitle ='Count of applicants',
             title='Distribution of AMT_INCOME_TOTAL')
print(application['AMT_INCOME_TOTAL'].mean())

In [None]:
(application[application['AMT_INCOME_TOTAL'] > 1000000]['TARGET'].value_counts())/len(application[application['AMT_INCOME_TOTAL'] > 1000000])*100


In [None]:
a=application[application['AMT_INCOME_TOTAL'] < 2000000]
px.scatter(a,x= 'AMT_INCOME_TOTAL',y='AMT_CREDIT')

In [None]:
px.scatter(application, x='AMT_CREDIT',y='AMT_GOODS_PRICE',color='TARGET')

Inome Amount graph is right skewed. Most of applicant Income is less than 2000000 but there is not high relation between income and credit amount. But there is High correlation between credit amount and Goods Price 

#### Amount of Credit Distribution

In [None]:
application['AMT_CREDIT'].iplot(kind='histogram', bins=100,
            xTitle = 'Credit Amount',yTitle ='Count of applicants',
            title='Distribution of AMT_CREDIT')


In [None]:
np.log(application['AMT_CREDIT']).iplot(kind='histogram', bins=100,
        xTitle = 'log(Credit Amount)',yTitle ='Count of applicants',
        title='Distribution of log(AMT_CREDIT)')


Amount Credit Distribution is firstly positively distributed but after transforming data by log its normaly ditributed data.


#### Age Distribution

In [None]:
cf.set_config_file(theme='pearl')
(application['DAYS_BIRTH']/(-365)).iplot(kind='histogram', 
             xTitle = 'Age', bins=50,
             yTitle='Count of type of applicants in %',
             title='Distribution of Clients Age')


Chances of getting loan in age between 30-65

# Data Preprocessing 

SK_ID_CURR column is not importatnt for further anlaysis so we drop that column. Then seperate column of numerical and categorical

In [None]:
application=application.drop(['SK_ID_CURR'],axis=1)

In [None]:
numeric_cols = application.select_dtypes(include=np.number).columns.tolist()
print('TARGET' in numeric_cols)
categorical_cols = application.select_dtypes('object').columns.tolist()
print('TARGET' in categorical_cols)

### Encoding categorical variable
Encoding a categorical variable with 1 and 0. 1= Presence and 0= absence

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(application[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
application[encoded_cols] = encoder.transform(application[categorical_cols])

### Missing Values 
Filling a missing numerical value with mean

In [None]:
imputer = SimpleImputer(strategy = 'mean')
imputer.fit(application[numeric_cols])
application[numeric_cols] = imputer.transform(application[numeric_cols])

### Scalling Feature
Scalling the numerical value with MinMaxScaler. The value converted between 0-1

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(application[numeric_cols])
application[numeric_cols] = scaler.transform(application[numeric_cols])

### Splitting Data
Splitting a data for training a machine learning and for  cross validation. So while we train model simentenously we test the data. 

In [None]:
train_df, val_df = train_test_split(application, test_size=0.25, random_state=42)


In [None]:
target=train_df['TARGET']

In [None]:
jovian.commit()

### Handle Imabalance Data with SMOTE
Our data is highly imbalance . So we use SMOTE to oversample of miniority which is a applicant who have issue to repay a loan.

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter
smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(train_df[encoded_cols+numeric_cols], target)

print('Original dataset shape', Counter(target))
print('Resample dataset shape', Counter(y_smote))


In [None]:
train_df=train_df.drop(['TARGET'],axis=1)
train_df

In [None]:
jovian.commit()

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(solver='liblinear')

In [None]:
model.fit(x_smote[numeric_cols + encoded_cols], y_smote)

### Making Prediction

In [None]:
train_preds = model.predict(x_smote[numeric_cols + encoded_cols])

In [None]:
model.classes_

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_smote,train_preds )

In [None]:
val_target=val_df['TARGET']
val_df1=val_df.drop(['TARGET'],axis=1)
val_preds = model.predict(val_df1[numeric_cols + encoded_cols])
accuracy_score(val_preds,val_target)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
def predict_and_plot(inputs, targets, name=''):
    preds = model.predict(inputs)
    
    accuracy = accuracy_score(targets, preds)
    print("Accuracy: {:.2f}%".format(accuracy * 100))
    
    cf = confusion_matrix(targets, preds, normalize='true')
    plt.figure()
    sns.heatmap(cf, annot=True)
    plt.xlabel('Prediction')
    plt.ylabel('Target')
    plt.title('{} Confusion Matrix'.format(name));
    
    return preds

In [None]:
train_preds = predict_and_plot(x_smote[numeric_cols + encoded_cols], y_smote, 'Training')

In [None]:
val_preds = predict_and_plot(val_df[numeric_cols + encoded_cols], val_target, 'Validatiaon')

Accuracy for Validation data is around 70% which is quite faire. so we can say that our SMOTE is play useful role to generalize the model.

In [None]:
jovian.commit()

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model = DecisionTreeClassifier(random_state=42)

In [None]:
from sklearn.tree import plot_tree, export_text

In [None]:
model.fit(x_smote[numeric_cols + encoded_cols], y_smote)

In [None]:
train_preds = model.predict(x_smote[numeric_cols + encoded_cols])

In [None]:
accuracy_score(train_preds,y_smote)

In [None]:
model.score(val_df[numeric_cols + encoded_cols], val_target)

Acurracy of train model is 100% which means model learning a each and every algorithm. Its Case of Overfitting it will minimize it with reducing a depth of model.

### Visualization

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=x_smote.columns, max_depth=2, filled=True);

### Feature Importance

In [None]:
importance_df = pd.DataFrame({
    'feature': x_smote.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

From decision Tree we can say that this is top 10most importatnt feature to predict the target.

##  Hyperparameter Tuning and Overfitting

As we saw in the previous section, our decision tree classifier memorized all training examples, leading to a 100% training accuracy, while the validation accuracy was only marginally better than a dumb baseline model. This phenomenon is called overfitting, and in this section, we'll look at some strategies for reducing overfitting. The process of reducing overfitting is known as _regularlization_.


The `DecisionTreeClassifier` accepts several arguments, some of which can be modified to reduce overfitting.

In [None]:
def max_depth_error(md):
    model = DecisionTreeClassifier(max_depth=md, random_state=42)
    model.fit(x_smote[numeric_cols + encoded_cols], y_smote)
    train_acc = 1 - model.score(x_smote[numeric_cols + encoded_cols],y_smote)
    val_acc = 1 - model.score(val_df[numeric_cols + encoded_cols], val_target)
    return {'Max Depth': md,'Training Error': train_acc,  'Validation Error': val_acc}

In [None]:
errors_df = pd.DataFrame([max_depth_error(md) for md in range(1, 21)])
errors_df

In [None]:
plt.figure()
plt.plot(errors_df['Max Depth'], errors_df['Training Error'])
plt.plot(errors_df['Max Depth'], errors_df['Validation Error'])
plt.title('Training vs. Validation Error')
plt.xticks(range(0,21, 2))
plt.xlabel('Max. Depth')
plt.ylabel('Prediction Error (1 - Accuracy)')
plt.legend(['Training', 'Validation'])

In [None]:
model = DecisionTreeClassifier(max_depth=11, random_state=42).fit(x_smote[numeric_cols + encoded_cols], y_smote)
model.score(val_df[numeric_cols + encoded_cols], val_target)

In [None]:
print(model.max_leaf_nodes)

In [None]:
model1 = DecisionTreeClassifier(max_leaf_nodes=80, random_state=42)
model1.fit(x_smote, y_smote)
model1.score(x_smote, y_smote)

In [None]:
model1.score(val_df[encoded_cols+numeric_cols], val_target)

In [None]:
model1.tree_.max_depth

## Predict Test Data

In [None]:
application1 = pd.read_csv('home-credit-default-risk/application_test.csv')

In [None]:
application1[encoded_cols] = encoder.transform(application1[categorical_cols])
application1[numeric_cols] = imputer.transform(application1[numeric_cols]) 
application1[numeric_cols] = scaler.transform(application1[numeric_cols])

In [None]:
a=model.predict(application1[encoded_cols+numeric_cols])

In [None]:
import collections
collections.Counter(a)

In [None]:
jovian.commit()

## Training a Random Forest

While tuning the hyperparameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees trained with slightly different parameters. This is called a random forest model. 

The key idea here is that each decision tree in the forest will make different kinds of errors, and upon averaging, many of their errors will cancel out. This idea is also commonly known as the "wisdom of the crowd":



In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier(n_jobs=-1, random_state=42)

In [None]:
%%time
model.fit(x_smote, y_smote)
model.score(x_smote, y_smote)

In [None]:
importance_df = pd.DataFrame({
    'feature': x_smote.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

In [None]:
jovian.commit()

## Summary and References

The following topics were covered in this tutorial:

- Downloading a real-world dataset
- Explrotory Data Analysis
- Preparing a dataset for training
- Training and interpreting Logistic Regression
- Training and interpreting decision trees
- Overfitting, hyperparameter tuning & regularization
- Making predictions on test data


In [None]:
jovian.commit()

In [None]:
import jovian
jovian.submit(assignment="zerotogbms-project")