# Intro
* According to the given task, this notebook is dedicated to the problem of classifying possible defaulters. <br>
* The following is a brief exploratory data analysis, feature engineering, and methods for classifying unbalanced datasets. <br>
* As a measure of the quality of the model, roc_auc_score on 20% of train_dataset is accepted.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data=pd.read_csv('/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv')
data.info()

In [None]:
data.head()

# EDA

## 📊 Ratio of risky loans

In [None]:
import matplotlib.pyplot as plt

total=list(data.Risk_Flag.value_counts())
Flag0=total[0]
Flag1=total[1]

plt.figure(figsize=(8,8))
plt.pie([Flag0, Flag1], labels=['Non-Risk:\n%d total' %Flag0,'Risk:\n%d total' %Flag1], autopct='%1.2f%%')

## 📊 Number of loans by state

In [None]:
import seaborn as sns
g=sns.catplot(x='STATE', data=data, height=12, aspect=1.5, kind='count', palette='deep')
g.set_xticklabels(rotation=60)

## 📊 Proportion of risky loans by state

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,12))
plt.xticks(rotation=60)
sns.barplot(x='STATE', y='Risk_Flag', data=data, palette='deep')

## 📊 Income vs Experience

In [None]:
sns.catplot(x='Experience', y='Income', data=data, kind="violin", height=8, aspect=1.6, palette='deep')

## 📊 Count of loans by Ages

In [None]:
sns.displot(x='Age', data=data, height=8, aspect=1.5, hue='Risk_Flag', bins=20)

## 📊 Portion of risky loans by age, marital status and house ownership

In [None]:
#binnig ages
data['Age_group']=pd.qcut(data.Age,5)

g = sns.FacetGrid(data=data, row='House_Ownership', col='Married/Single', height=5, aspect=1.5)
g.map_dataframe(sns.barplot,x='Age_group', y='Risk_Flag', ci=None)
g.set_xticklabels(rotation=60)

# Feature Engineering

In [None]:
#encode categorical
data['Married/Single']=data['Married/Single'].map({'single':0, 'married':1})
data['House_Ownership']=data['House_Ownership'].map({'norent_noown':0, 'rented':1, 'owned':2})
data['Car_Ownership']=data['Car_Ownership'].map({'no':0, 'yes':1})

In [None]:
#Each of variables below are negatively correlated with the Risk_Flag
data.corr().Risk_Flag.drop(['Risk_Flag','Id']).plot.bar()

In [None]:
#get dummies and drop columns to avoid multicollinearity
dummies=pd.get_dummies(data[['STATE', 'Profession']])
dummies.drop(dummies.columns[[0, -1]], axis=1, inplace=True)

In [None]:
#Selected features
features=['Income', 'Age', 'Experience', 
          'CURRENT_JOB_YRS', 'CURRENT_HOUSE_YRS', 'Married/Single','House_Ownership', 'Car_Ownership']
X=pd.concat([data[features], dummies], axis=1)
y=data['Risk_Flag']

X.head()

# Models overview
Since the data is imbalanced, let's try some methods of resampling the dataset such as:
* Random under-sampling using Balanced Random Forest Classifier;
* Over-sampling using Adaptive Synthetic (ADASYN) algorithm;
* Over-sampling using SMOTE and under-sampling with Tomek links (SMOTETomek). 

In all cases, the Random Forest Classifier will be used.

For more information about handling the imbalanced datasets visit [imbalanced-learn.org](https://imbalanced-learn.org/stable/introduction.html) 

In [None]:
#splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

 # Random Forest with random undersampling
 A balanced random forest classifier (BalancedRandomForestClassifier) randomly under-samples each boostrap sample to balance it.

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score, plot_roc_curve, plot_confusion_matrix

brf=BalancedRandomForestClassifier().fit(X_train, y_train)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (16,6))
plt.title('asfafasf')   
ax1.set_title('Confusion matrix (Balanced RF)')
ax2.set_title('ROC curve (Balanced RF)')
ax2.plot([0,1], [0,1], 'g--', alpha=0.25)
    
plot_confusion_matrix(brf, X_test, y_test, cmap=plt.cm.Blues, normalize='true', ax=ax1)
plot_roc_curve(brf, X_test, y_test, ax=ax2)

y_pred = brf.predict(X_test)

acc_brf=accuracy_score(y_test, y_pred)
f1_brf=f1_score(y_test, y_pred)
roc_brf=roc_auc_score(y_test, y_pred)
print('Roc_Auc score: %.3f' %roc_brf)    

# Over-sampling  with ADASYN
The essential idea of Adaptive Synthetic (ADASYN) is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. [[1]](https://ieeexplore.ieee.org/document/4633969)

In [None]:
from imblearn.over_sampling import ADASYN 

print ('Initial size:', X_train.shape)

ada = ADASYN(random_state=42)
X_ada, y_ada = ada.fit_resample(X_train, y_train)
        
print ('Resampled size:', X_ada.shape)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_ada=RandomForestClassifier().fit(X_ada, y_ada)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (16,6))  
ax1.set_title('Confusion matrix (RF and ADASYN)')
ax2.set_title('ROC curve (RF and ADASYN)')
ax2.plot([0,1], [0,1], 'g--', alpha=0.25)
    
plot_confusion_matrix(rf_ada,X_test, y_test, cmap=plt.cm.Blues, normalize='true', ax=ax1)
plot_roc_curve(rf_ada, X_test, y_test, ax=ax2)

y_pred = rf_ada.predict(X_test)

acc_ada=accuracy_score(y_test, y_pred)
f1_ada=f1_score(y_test, y_pred)
roc_ada=roc_auc_score(y_test, y_pred)
print('Roc_Auc score: %.3f' %roc_ada)    

# Combination of over- and under-sampling using SMOTE and Tomek links
SMOTE-Tomek Links method combines the SMOTE ability to generate synthetic data for minority class and Tomek Links ability to remove the data that are identified as Tomek links from the majority class (that is, samples of data from the majority class that is closest with the minority class data). [[2]](https://towardsdatascience.com/imbalanced-classification-in-python-smote-tomek-links-method-6e48dfe69bbc)

In [None]:
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks

print ('Initial size:', X_train.shape)

smt=SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))
X_smt, y_smt = smt.fit_resample(X_train, y_train)
        
print ('Resampled size:', X_smt.shape)

In [None]:
rf_smt=RandomForestClassifier().fit(X_smt, y_smt)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (16,6))
    
ax1.set_title('Confusion matrix (RF and SMOTETomek)')
ax2.set_title('ROC curve (RF and SMOTETomek)')
ax2.plot([0,1], [0,1], 'g--', alpha=0.25)
    
plot_confusion_matrix(rf_smt,X_test, y_test, cmap=plt.cm.Blues, normalize='true', ax=ax1)
plot_roc_curve(rf_smt, X_test, y_test, ax=ax2)

y_pred = rf_smt.predict(X_test)

acc_smt=accuracy_score(y_test, y_pred)
f1_smt=f1_score(y_test, y_pred)
roc_smt=roc_auc_score(y_test, y_pred)
print('Roc_Auc score: %.3f' %roc_smt)    

# Threshold moving
What if we need to identify as accurately as possible all risky loans? In this case, we can change the threshold of belonging to the risky class. This will increase in the number of False-Positive predictions and decrease False-Negatives. <br> 
The choice of the threshold depends on our strategy: do we want to have a larger number of potential clients or do we need to have minimal risks, even at the cost of losing a part of reliable borrowers. 

In [None]:
#Threshold change not allowed in RF, so we need to detemine the probability of belonging to classes
y_prob = rf_smt.predict_proba(X_test)
threshold=[x for x in np.linspace(0.5, 0.95, 10)]
roc=[]
acc=[]
for t in threshold:
    y_t=[0 if x[0]>t else 1 for x in y_prob]
    roc.append(roc_auc_score(y_test, y_t))
    acc.append(accuracy_score(y_test, y_t))

In [None]:
plt.figure(figsize=(12,8))
plt.title('ROC AUC and Accuracy vs. Threshold')
plt.plot(threshold, roc, label='ROC AUC Score')
plt.plot(threshold, acc, label='Accuracy Score')
plt.xlabel("Probabability threshold for non-risk class")
plt.ylabel("Score")
plt.legend(loc='lower left')

In [None]:
from sklearn.metrics import confusion_matrix

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (16,6))

y_t=[0 if x[0]>0.75 else 1 for x in y_prob]

acc_tr=accuracy_score(y_test, y_t)
f1_tr=f1_score(y_test, y_t)
roc_tr=roc_auc_score(y_test, y_t)

ax1.set_title('Confusion matrix of the model (RF and SMOTETomek) \nwith 0.75 probability threshold')
ax2.set_title('Confusion matrix of the standard model (RF and SMOTETomek) \nwith 0.5 probability threshold')

sns.heatmap(confusion_matrix(y_test, y_t), annot=True, fmt='d',cmap=plt.cm.Blues, ax=ax1)
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d',cmap=plt.cm.Blues, ax=ax2)
ax1.set_xlabel('Predicted Risk Flag')
ax1.set_ylabel('True Risk Flag')
ax2.set_xlabel('Predicted Risk Flag')
ax2.set_ylabel('True Risk Flag')

As seen above, the threshold of 75% allows to detect the majority of risky clients at the cost of almost doubling False-Positive predictions.

# Conclusion
The applied methods allow achieving comparable classification results, which can probably be improved by hyperparameters tuning.<br>
Changing the probability threshold can help to increase some model metrics, but leads to decrease of another metrics.

In [None]:
results=pd.DataFrame.from_dict({'Balanced RF': [roc_brf,acc_brf,f1_brf], 'RF and ADASYN': [roc_ada,acc_ada,f1_ada], 
              'RF and SMOTETomek': [roc_smt,acc_smt,f1_smt], 'RF and SMOTETomek (0.75 Tr.)':[roc_tr,acc_tr,f1_tr]}, 
                       orient='index', columns=['ROC AUC', 'Accuracy', 'F1 score'])
print(results)