# **Team B Submission**:

# Project Summary : **"RUN This BLock"**
In this project we will be doing credit risk modelling of peer to peer lending Bondora systems.Data for the study has been retrieved from a publicly available data set of a leading European P2P lending platform (Bondora).The retrieved data is a pool of both defaulted and non-defaulted loans from the time period between 1st March 2009 and 27th January 2020. The data comprises of demographic and financial information of borrowers, and loan transactions.In P2P lending, loans are typically uncollateralized and lenders seek higher returns as a compensation for the financial risk they take. In addition, they need to make decisions under information asymmetry that works in favor of the borrowers. In order to make rational decisions, lenders want to minimize the risk of default of each lending decision, and realize the return that compensates for the risk.

In this notebook we will preprocess the raw dataset and will create new preprocessed csv that can be used for building credit risk models.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # for visualization
import plotly.express as px # for visualization
import matplotlib.pyplot as plt # for visualization
%matplotlib inline

# To display all the columns of dataframe
pd.set_option('display.max_columns', 500)
import warnings
warnings.filterwarnings("ignore")

In [2]:
df=pd.read_csv('Bondora_preprocessed.csv', low_memory=False)

## Feature Engineering:

### 1. Handling Outliers:

In [3]:
# Let's compute IQR for each numerical feature
df_IQR = df[df.select_dtypes([float, int]).columns].quantile(.75) - df[df.select_dtypes([float, int]).columns].quantile(.25)

In [4]:
# Let's compute maximum and minimum limits
df_Max =  df[df.select_dtypes([float, int]).columns].quantile(.75) + (1.5*df_IQR)
df_Min =  df[df.select_dtypes([float, int]).columns].quantile(.25) - (1.5*df_IQR)

Now we will replace outliers of each column with Lower and Upper bounds of each column:

In [5]:
# Loop for replacing outliers above upper bound with the upper bound value:
for column in df.select_dtypes([float, int]).columns :
   
    col_IQR = df[column].quantile(.75) - df[column].quantile(.25)
    col_Max =  df[column].quantile(.75) + (1.5*col_IQR)
    df[column][df[column] > col_Max] =  col_Max

In [6]:
# Loop for replacing outliers under lower bound with the lower bound value:
for column in df.select_dtypes([float, int]).columns :
    col_IQR = df[column].quantile(.75) - df[column].quantile(.25)
    col_Min =  df[column].quantile(.25) - (1.5*col_IQR)
    df[column][df[column] < col_Min] =  col_Min

### 3. Feature Selection

In [7]:
# A function to select highly correlated features.
def Correlation(dataset, threshold): 
    correltated_features = set() # as a container of highly correlated features
    correlation_matrix = dataset.corr()
    for i in range(len(correlation_matrix.columns)):
        for j in range(i):
            if abs(correlation_matrix.iloc[i, j]) > threshold:
                column_name = correlation_matrix.columns[i]
                correltated_features.add(column_name)
    return correltated_features

In [8]:
# let's selected features with a correlation factor > 0.8
Correlation(df, 0.8)

{'Amount', 'AmountOfPreviousLoansBeforeLoan', 'NoOfPreviousLoansBeforeLoan'}

In [9]:
# Now we can drop these features from our dataset
df.drop(columns= ['Amount', 'AmountOfPreviousLoansBeforeLoan', 'NoOfPreviousLoansBeforeLoan'], inplace = True )

### 4. Feature Encoding

###### Let's divide our features to "Target" feature and "Independnt features" :

In [10]:
Target_feature = df.LoanStatus
Ind_features   = df.drop(columns = ['LoanStatus'])

In [11]:
# Target_feature Encoding:
Target_feature = np.where(Target_feature == 'NoDefault', 1, 0)

In [12]:
# Ind_features Encoding:
Ind_features = pd.get_dummies(Ind_features)

In [13]:
Ind_features.shape

(77394, 136)

### 5. Feature Scaling

In [14]:
from sklearn.preprocessing import StandardScaler 

Scalar = StandardScaler()

Ind_features = Scalar.fit_transform(Ind_features)

### 6. Feature Extraction and Dimensionality-reduction using (PCA) 

In [15]:
# importing PCA class
from sklearn.decomposition import PCA

# Create a PCA object with number of component = 25
pca = PCA(n_components = 110)

# Let's fit our data using PCA
Ind_features_pca = pca.fit_transform(Ind_features)


# Percentage of information we have after apllying 2-d PCA
sum(pca.explained_variance_ratio_) * 100

99.75561670015288

## PD Modeling (Classification)


### 7. Spiliting Data into training and testing sets

In [16]:
X = Ind_features_pca
y = Target_feature

In [17]:
# Let's use Train Test Split 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size = .75, stratify=y)

In [18]:
X_train.shape, y_train.shape

((58045, 110), (58045,))

In [19]:
X_test.shape, y_test.shape

((19349, 110), (19349,))

### I. Gradient Boosting Classifier:

**Base Model:**

In [29]:
from sklearn.ensemble import GradientBoostingClassifier

#Training Base model
base_gb = GradientBoostingClassifier(n_estimators=500,learning_rate=0.05,random_state=100,max_features=5)

base_gb.fit(X_train, y_train)

base_gb_pred = base_gb.predict(X_test)

In [30]:
# Build a Metric Report 
from sklearn.metrics import classification_report, confusion_matrix , accuracy_score

print("Confusion Matrix:")
print(confusion_matrix(y_test, base_gb_pred))

print("Classification Report")
print(classification_report(y_test, base_gb_pred))

print("GBC accuracy is %2.2f" % accuracy_score(y_test, base_gb_pred))

Confusion Matrix:
[[8801 1898]
 [1602 7048]]
Classification Report
              precision    recall  f1-score   support

           0       0.85      0.82      0.83     10699
           1       0.79      0.81      0.80      8650

    accuracy                           0.82     19349
   macro avg       0.82      0.82      0.82     19349
weighted avg       0.82      0.82      0.82     19349

GBC accuracy is 0.82


In [31]:
# Hyperparameter Tunning
from sklearn.model_selection import RandomizedSearchCV

gb_model = GradientBoostingClassifier(n_estimators = 50, max_depth = 5)

# First create a list of learning rates
param_dist = {'learning_rate': [0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90]}

random_model = RandomizedSearchCV(gb_model, param_distributions=param_dist)

random_model.fit(X_train, y_train)

gb_preds = random_model.best_estimator_.predict(X_test)

In [32]:
print("Best Estimator:")
print(random_model.best_estimator_)

print("Confusion Matrix:")
print(confusion_matrix(y_test, gb_preds))

print("Classification Report")
print(classification_report(y_test, gb_preds))

print("GBC accuracy is %2.2f" % accuracy_score(y_test, gb_preds))

Best Estimator:
GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=50)
Confusion Matrix:
[[8976 1723]
 [1390 7260]]
Classification Report
              precision    recall  f1-score   support

           0       0.87      0.84      0.85     10699
           1       0.81      0.84      0.82      8650

    accuracy                           0.84     19349
   macro avg       0.84      0.84      0.84     19349
weighted avg       0.84      0.84      0.84     19349

GBC accuracy is 0.84



**Final Takeaway - GBC on PCA Dataset**
Accuracy: ~ 85%

Achieved using:
+ learning rate of 0.3
+ n_estimators 50
+ max_depth 5

### II. Support Vector Machine Classifier:

**Base-model**

In [20]:
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [22]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("Classification Report")
print(classification_report(y_test, y_pred))

print("GBC accuracy is %2.2f" % accuracy_score(y_test, y_pred))

Confusion Matrix:
[[8687 2012]
 [ 451 8199]]
Classification Report
              precision    recall  f1-score   support

           0       0.95      0.81      0.88     10699
           1       0.80      0.95      0.87      8650

    accuracy                           0.87     19349
   macro avg       0.88      0.88      0.87     19349
weighted avg       0.88      0.87      0.87     19349

GBC accuracy is 0.87


In [24]:
from sklearn.model_selection import RandomizedSearchCV

# Choosing ranges of hyperparameters
param_dist = {"C": [0.1, 0.5, 1, 3, 5],
              "kernel": ['linear','rbf']}

# Run a randomized search over the hyperparameters
svc_search = RandomizedSearchCV(svm.SVC(), param_distributions=param_dist)

# Fit the model on the training data
svc_search.fit(X_train, y_train)

# Make predictions on the test data
svc_preds = svc_search.best_estimator_.predict(X_test)

In [None]:
print("Best Estimator:")
print(svc_search.best_estimator_)

print("Confusion Matrix:")
print(confusion_matrix(y_test, svc_preds))

print("Classification Report")
print(classification_report(y_test, svc_preds))

print("GBC accuracy is %2.2f" % accuracy_score(y_test, svc_preds))

# Modeling (Regression) **Current Task**

## Preprocessing (Creating **Target Variables**)

In [None]:
loan_data = df.copy()

In [None]:
loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77394 entries, 0 to 77393
Data columns (total 39 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   BidsPortfolioManager                    77394 non-null  int64  
 1   BidsApi                                 77394 non-null  int64  
 2   BidsManual                              77394 non-null  float64
 3   NewCreditCustomer                       77394 non-null  bool   
 4   VerificationType                        77394 non-null  object 
 5   LanguageCode                            77394 non-null  object 
 6   Age                                     77394 non-null  int64  
 7   Gender                                  77394 non-null  object 
 8   Country                                 77394 non-null  object 
 9   AppliedAmount                           77394 non-null  float64
 10  Amount                                  77394 non-null  fl

### The **EMI is calculated based on the following mathematical formula**:

* **EMI = P × r × (1 + r) ^ n / ((1 + r) ^ n – 1)**

> Where,
> 
> **P** = Loan amount. **"Amount"**
>
> **r** = Rate of interest, which is calculated on a monthly basis. **Interest**
>
> **n** = Loan tenure (in months). **LoanDuration**


* **Loan tenure:** is the amount of time you are given to repay your loan

* **Amount** is not evident in this dataset (after FeatureEngineering), so I'll upload it from the original daaset.

In [None]:
main_data = pd.read_csv('/content/drive/MyDrive/Technocolabs_Team/Bondora_EDA.csv')
print(main_data.shape)
main_data['Amount'].head()

(77394, 39)


0    115.0408
1    140.6057
2    319.5409
3     57.5205
4    319.5436
Name: Amount, dtype: float64

In [None]:
main_data_amt = main_data.loc[loan_data.index, 'Amount']
main_data_amt.head()

0    115.0408
1    140.6057
2    319.5409
3     57.5205
4    319.5436
Name: Amount, dtype: float64

In [None]:
loan_data['Amount'] = main_data_amt.values

In [None]:
loan_data_temp = loan_data[['LoanDuration', 'Interest', 'Amount']]
loan_data_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77394 entries, 0 to 77393
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LoanDuration  77394 non-null  int64  
 1   Interest      77394 non-null  float64
 2   Amount        77394 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 1.8 MB


In [None]:
loan_data_temp.isnull().sum()

LoanDuration    0
Interest        0
Amount          0
dtype: int64

In [None]:
def cal_EMI(P, r, n):
  P = P.values
  r = r.values
  n = n.values
  #print(P.shape[0])
  result_1 = np.empty(0)
  result_2 = np.empty(0)
  result = np.empty(0)
  for i in range(P.shape[0]):
    #print(P[i])
    #print(r[i])
    #print(n[i])
    # EMI = P × r × (1 + r) ^ n / ((1 + r) ^ n – 1)
    #print(P[i] * (1 + r[i]))
    result_1 = np.append(result_1, P[i] * r[i] * np.power((1 + r[i]),n[i]))
    result_2 = np.append(result_2, np.power((1 + r[i]),n[i]) - 1)
    result = np.append(result, (result_1[i] / result_2[i]))

  return result

In [None]:
loan_data_temp['EMI'] = cal_EMI(loan_data_temp['Amount'], loan_data_temp['Interest'], loan_data_temp['LoanDuration'])

In [None]:
loan_data['EMI'] = loan_data_temp['EMI']

In [None]:
loan_data['EMI'].head()

0    3451.2240
1    3655.7482
2    7988.5225
3    2588.4225
4    9586.3080
Name: EMI, dtype: float64

### **Eligible Loan Amount DONE**

**Under Concsideration (Reaserch)**

**ELA** = Assets (Income) - Liabilities of the borrower

* Assets:
> **FreeCash** = ELA
>
> **TotalIncome** - **LiabilitiesTotal** = ELA
>
> Let's Check both of them out in the data.


Eligible Loan Amount means, with respect to a **Mortgage Loan** that is an Eligible Loan, **the lesser of**:

(i) the Principal Balance of such Eligible Loan, **AppliedAmount**

(ii) the Market Value of such Eligible Loan **PurchasePrice** | **BidPrinciple**

**My Approach**

1. Calculate AppliedAmount + AppliedAmount*Interest = Total Liabilities Amount
2. Divide on the loan tenure (months)
3. If the result is less than (TotalIncome- LiabilitiesTotal)*30/100
>Then allow the Applied Amount, If not allow only the result of the previous calculation.

In [None]:
loan_data_temp = loan_data[['AppliedAmount', 'Interest', 'IncomeTotal', 'LiabilitiesTotal', 'LoanDuration']]
loan_data_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77394 entries, 0 to 77393
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   AppliedAmount     77394 non-null  float64
 1   Interest          77394 non-null  float64
 2   IncomeTotal       77394 non-null  float64
 3   LiabilitiesTotal  77394 non-null  float64
 4   LoanDuration      77394 non-null  int64  
dtypes: float64(4), int64(1)
memory usage: 3.0 MB


In [None]:
loan_data_temp[loan_data_temp['IncomeTotal']==3665].shape

(0, 5)

In [None]:
# Step 1
loan_data_temp['Ava_Inc'] = ((loan_data_temp['IncomeTotal']-loan_data_temp['LiabilitiesTotal'])*0.3)
loan_data_temp['Total_Loan_Amnt'] = np.round((df['AppliedAmount'] + (df['AppliedAmount'] * df['Interest']) /100)*df['LoanDuration'])
loan_data_temp.head()

Unnamed: 0,AppliedAmount,Interest,IncomeTotal,LiabilitiesTotal,LoanDuration,Ava_Inc,Total_Loan_Amnt
0,319.5582,30.0,10500.0,0.0,12,3150.0,4985.0
1,191.7349,25.0,10800.0,0.0,1,3240.0,240.0
2,319.5582,25.0,7000.0,0.0,20,2100.0,7989.0
3,127.8233,45.0,11600.0,0.0,15,3480.0,2780.0
4,319.5582,30.0,6800.0,0.0,12,2040.0,4985.0


In [None]:
# Step 2
def eligible_loan_amnt(df):
  Ava_Inc = df['Ava_Inc'].values
  Total_Loan_Amnt = df['Total_Loan_Amnt'].values
  ELA = np.empty(0)
  for i in range(len(Ava_Inc)):
    if Total_Loan_Amnt[i] <= Ava_Inc[i]:
      ELA = np.append(ELA, Total_Loan_Amnt[i])
    else:
      ELA = np.append(ELA, Ava_Inc[i])
  return ELA

In [None]:
loan_data_temp['ELA'] = eligible_loan_amnt(loan_data_temp)

In [None]:
loan_data_temp.head()

Unnamed: 0,AppliedAmount,Interest,IncomeTotal,LiabilitiesTotal,LoanDuration,Ava_Inc,Total_Loan_Amnt,ELA
0,319.5582,30.0,10500.0,0.0,12,3150.0,4985.0,3150.0
1,191.7349,25.0,10800.0,0.0,1,3240.0,240.0,240.0
2,319.5582,25.0,7000.0,0.0,20,2100.0,7989.0,2100.0
3,127.8233,45.0,11600.0,0.0,15,3480.0,2780.0,2780.0
4,319.5582,30.0,6800.0,0.0,12,2040.0,4985.0,2040.0


In [None]:
loan_data['ELA'] = loan_data_temp['ELA']
loan_data.columns

Index(['BidsPortfolioManager', 'BidsApi', 'BidsManual', 'NewCreditCustomer',
       'VerificationType', 'LanguageCode', 'Age', 'Gender', 'Country',
       'AppliedAmount', 'Amount', 'Interest', 'LoanDuration', 'MonthlyPayment',
       'UseOfLoan', 'Education', 'MaritalStatus', 'EmploymentStatus',
       'EmploymentDurationCurrentEmployer', 'OccupationArea',
       'HomeOwnershipType', 'IncomeTotal', 'ExistingLiabilities',
       'LiabilitiesTotal', 'RefinanceLiabilities', 'DebtToIncome', 'FreeCash',
       'Rating', 'Restructured', 'CreditScoreEsMicroL',
       'PrincipalPaymentsMade', 'InterestAndPenaltyPaymentsMade',
       'PrincipalBalance', 'InterestAndPenaltyBalance',
       'NoOfPreviousLoansBeforeLoan', 'AmountOfPreviousLoansBeforeLoan',
       'PreviousRepaymentsBeforeLoan',
       'PreviousEarlyRepaymentsCountBeforeLoan', 'LoanStatus', 'EMI', 'ELA'],
      dtype='object')

**ROI**

In [None]:
loan_data_temp = loan_data[['LoanDuration', 'Interest', 'Amount']]

In [None]:
loan_data_temp.head()

Unnamed: 0,LoanDuration,Interest,Amount
0,12,30.0,115.0408
1,1,25.0,140.6057
2,20,25.0,319.5409
3,15,45.0,57.5205
4,12,30.0,319.5436


### **Preferred ROI**

* We weren't able to determine the procedure of handling Risk related to loan in order to determine **Preferred ROI**.

* In order to complete the task in hand and complete it, we'll calculate **ROI** instead.
>**ROI** = Investment Gain / Investment Base
>
> **ROI** = Amount lended * interest/100

* **InterestAndPenaltyDebtServicingCost**	Service cost related to the recovery of the debt based on the interest and penalties of the investment

* **InterestAndPenaltyWriteOffs**	Interest that was written off on the investment

* **PrincipalDebtServicingCost**	Service cost related to the recovery of the debt based on the principal of the investment

* **PrincipalWriteOffs**	Principal that was written off on the investment

* **PurchasePrice**	Investment amount or secondary market purchase price

In [None]:
loan_data_temp = loan_data[['Amount', 'Interest']]
loan_data_temp.head()

Unnamed: 0,Amount,Interest
0,115.0408,30.0
1,140.6057,25.0
2,319.5409,25.0
3,57.5205,45.0
4,319.5436,30.0


In [None]:
loan_data_temp['InterestAmount'] = (loan_data_temp['Amount']*(loan_data_temp['Interest']/100))
loan_data_temp['TotalAmount'] = (loan_data_temp['InterestAmount'] + loan_data_temp['Amount'])
loan_data_temp['ROI'] = (loan_data_temp['InterestAmount'] / loan_data_temp['TotalAmount'])*100
loan_data['ROI'] = loan_data_temp['ROI']

In [None]:
loan_data_temp.head()

Unnamed: 0,Amount,Interest,InterestAmount,TotalAmount,ROI
0,115.0408,30.0,34.51224,149.55304,23.076923
1,140.6057,25.0,35.151425,175.757125,20.0
2,319.5409,25.0,79.885225,399.426125,20.0
3,57.5205,45.0,25.884225,83.404725,31.034483
4,319.5436,30.0,95.86308,415.40668,23.076923


## **Multi-Linear Regression Modeling:**

### Preprocessing:

**Splitting Train and Test**

In [None]:
X = loan_data.drop(["ROI","EMI","ELA"], axis=1)
y = loan_data[["ROI","EMI","ELA"]]

**Dummy variables**

In [None]:
# Let's perform categorical features encoding:
X = pd.get_dummies(X)

**Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler 

scalar = StandardScaler()

X = scalar.fit_transform(X)

y = scalar.fit_transform(y)

**Splitting Test and Train**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=0)

### **LinearRegression Model**

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train,y_train)

predictions = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print('mean_squared_error : ', mean_squared_error(y_test, predictions))
print('R2_score : ', r2_score(y_test, predictions))

mean_squared_error :  1.1814217675500299e+17
R2_score :  -4.002559983603161e+17


### **Lasso Regression**

**Base Model**







In [None]:
from sklearn.linear_model import Lasso 

lasso = Lasso(alpha=0.1)

lasso.fit(X_train,y_train)

y_pred = lasso.predict(X_test)

In [None]:
print('Lasso Base Model:')
print("mean_squared_error: ", mean_squared_error(y_test,y_pred))
print("r2_score: ", r2_score(y_test, y_pred))

Lasso Base Model:
mean_squared_error:  0.09195002180567276
r2_score:  0.824406014356566


In [None]:
np.sqrt(mean_squared_error(y_test,y_pred))

0.3032326199564828

**Hyperparameter Techniaue**

In [None]:
# Defining a Random Forest Classifier using Hyperparameter tunnimg
from sklearn.model_selection import RandomizedSearchCV

lasso_reg = Lasso()

param_dist = {"alpha": list(np.array(range(1,9))*0.25),
              "normalize": [True, False]}

random_search = RandomizedSearchCV(lasso_reg, param_distributions=param_dist)

random_search.fit(X_train, y_train)

lasso_preds = random_search.best_estimator_.predict(X_test)

In [None]:
print("Lasso Best Estimator: ")
print("Best Estimator: \n", random_search.best_estimator_)
print("mean_squared_error: ", mean_squared_error(y_test,lasso_preds))
print("r2_score: ", r2_score(y_test, lasso_preds))

Lasso Best Estimator: 
Best Estimator: 
 Lasso(alpha=0.25, normalize=False)
mean_squared_error:  0.16677337709504667
r2_score:  0.639150685473945


In [None]:
np.sqrt(mean_squared_error(y_test,lasso_preds))

171.61456474309256