# <b> Problem Statement </b>


You work as a Business Analytics Consultant at the Bank of Corporate.The bank was witnessing slower than usual growth in its book of business for the most recent quarter in 2018. The bank provides financial services/products such as savings accounts, current accounts, debit cards, etc. to its customer. The data suggested that it was the home loan business of the bank that was hit by a major loss. Now, loans are the core business of banks. The main profit comes directly from the loan’s interest. The head of the Home Loan business asked the heads of the Sales, Operations, Risk and Analytics teams to investigate and identify the root causes for the slowing growth and solve the problem. A business like selling Home Loans can grow or shrink based on several factors like demand or supply side. Some of the reasons are:

- <b> Demand Side: </b> Are interest rates high?
- <b> Demand Side: </b> Are there any macro economic reasons, such as recession or low salary growth or inflation?
- <b> Supply Side: </b> Are new and attractive housing projects not available in the markets being served?
- <b> Supply Side: </b> Have real estate prices shot up making homes unaffordable, relatively speaking?
- <b> Competitor Side: </b> Are we losing customers to our competition? Is our competition also facing lower growth?



The team found out that the credit risk was in abnormal standards and the default loan rates were high. So what do you mean by default loans and credit risk?

<b> Deafult loans : </b> <br>
Default is the failure to repay a loan according to the terms agreed to in the promissory note. For most federal student loans, you will default if you have not made a payment in more than 270 days.

<b> Credit risk : </b> <br>
It is understood simply as the risk a bank takes while lending out money to borrowers. They might default and fail to repay the dues in time and these results in losses to the bank. 

## <b> So, what do banks do then? </b> <br>
They need to manage their credit risks. The goal of credit risk management in banks is to maintain credit risk exposure within proper and acceptable parameters. It is the practice of mitigating losses by understanding the adequacy of a bank’s capital and loan loss reserves at any given time. For this, banks not only need to manage the entire portfolio but also individual credits.

## <b> Measures taken </b> <br>
So in 2019, the bank came up with a project to build a "Credit risk estimate model" for its home loan branch.The loan should be granted after an intensive process of verification and validation. The dataset (provided below) contains the information about all the customers who were contacted during this year and were provided loans based on various parameters. The "Credit risk estimate model" need to be cost-efficient so that the bank not only decreases their credit risk but also increase the total profit.


## <b> Business objective </b> <br>

Your aim is to build the "Credit risk estimate model" to classify new loans availed as "Low Risk", "High Risk" and "Medium Risk". This will help the bank to sanction loans to "Low Risk" customers, following up with the latest information/data for the "Medium Risk" customers and reject the loan approval for "High Risk" customers.

## <b> Read the dataset

In [3]:
#Import the libraries
import pandas as pd, numpy as np

#Load the loan dataset
loan_data = pd.read_csv("loan2.csv") 

#Details of the dataset
loan_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38642 entries, 0 to 38641
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           38642 non-null  int64  
 1   loan_amnt    38642 non-null  int64  
 2   funded_amnt  38642 non-null  int64  
 3   int_rate     38642 non-null  float64
 4   installment  38642 non-null  float64
 5   emp_length   38642 non-null  int64  
 6   annual_inc   38642 non-null  float64
 7   loan_status  38642 non-null  object 
dtypes: float64(3), int64(4), object(1)
memory usage: 2.4+ MB


The dataset has the following columns: </b>

<b> id : </b>Transaction ID use to identify each transaction uniquely <br>
<b> loan_amnt : </b> Loan amount that was requested by the customer <br>
<b> funded_amnt : </b>Amount that was sanctioned by the bank <br>
<b> int_rate : </b> Interest rate offered on the loan amount <br>
<b> installment : </b>Amount of money paid during each installment <br>
<b> emp_length :  </b> Work experience (employment length of the customer)  <br>
<b> annual_inc : </b>What is the annual income of the customer<br>
<b> loan_status : </b> Classified as whether it is "High Risk", "Low Risk" and "Medium Risk" <br>

In [4]:
#Check the details of the dataset
loan_data.head()


Unnamed: 0,id,loan_amnt,funded_amnt,int_rate,installment,emp_length,annual_inc,loan_status
0,1077501,5000,5000,10.65,162.87,10,24000.0,Low Risk
1,1077430,2500,2500,15.27,59.83,1,30000.0,High Risk
2,1077175,2400,2400,15.96,84.33,10,12252.0,Low Risk
3,1076863,10000,10000,13.49,339.31,10,49200.0,Low Risk
4,1075358,3000,3000,12.69,67.79,1,80000.0,Medium Risk


In [12]:
# Distribution of the target variable
loan_data.loan_status.value_counts()

Low Risk       32145
High Risk       5399
Medium Risk     1098
Name: loan_status, dtype: int64

## <b> Data Cleaning </b>

In [13]:
#Check for null values
loan_data.isnull().sum()

id             0
loan_amnt      0
funded_amnt    0
int_rate       0
installment    0
emp_length     0
annual_inc     0
loan_status    0
dtype: int64

In [14]:
#Check for duplicate values
loan_data.duplicated().sum()

0

## <b> Feature Creation </b>
<b> funded_amnt: </b> Percentage of amount sanctioned compared to the total loan amount. Higher the value, it states that the bank is positive in lending the loan to the customer. <br>
<b> incToloan_perc: </b> Percentage of annual income when compared to the loan amount. Higher the value it states that the customer is more likely to pay back without defaulting.

In [15]:
# Adding new variables  
# fund_perc variable represents the ratio of funded amount wrt loan amount
loan_data["fund_perc"] = loan_data["funded_amnt"]/loan_data["loan_amnt"] 

#incToloan_perc variable represent the ratio of annula income wrt loan amount
loan_data["incToloan_perc"] = loan_data["annual_inc"]/loan_data["loan_amnt"] 


In [16]:
loan_data.head(2)

Unnamed: 0,id,loan_amnt,funded_amnt,int_rate,installment,emp_length,annual_inc,loan_status,fund_perc,incToloan_perc
0,1077501,5000,5000,10.65,162.87,10,24000.0,Low Risk,1.0,4.8
1,1077430,2500,2500,15.27,59.83,1,30000.0,High Risk,1.0,12.0


In [19]:
# Understanding distribution of all the numerical variables in dataset
loan_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,38642.0,681040.355391,211304.549803,54734.0,513435.0,662770.5,836491.25,1077501.0
loan_amnt,38642.0,11291.615988,7462.136215,500.0,5500.0,10000.0,15000.0,35000.0
funded_amnt,38642.0,11017.101211,7193.038828,500.0,5500.0,9950.0,15000.0,35000.0
int_rate,38642.0,12.052427,3.716705,5.42,9.32,11.86,14.59,24.59
installment,38642.0,326.760477,209.143908,15.69,168.4425,282.83,434.3975,1305.19
emp_length,38642.0,5.09205,3.408338,1.0,2.0,4.0,9.0,10.0
annual_inc,38642.0,69608.277211,64253.20224,4000.0,41400.0,60000.0,83199.99,6000000.0
fund_perc,38642.0,0.985571,0.070317,0.10125,1.0,1.0,1.0,1.0
incToloan_perc,38642.0,8.91595,13.845454,1.204819,4.0,6.066667,10.016699,1266.667


In [22]:
#column names
loan_data.columns

Index(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment',
       'emp_length', 'annual_inc', 'loan_status', 'fund_perc',
       'incToloan_perc'],
      dtype='object')

In [23]:
loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38642 entries, 0 to 38641
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              38642 non-null  int64  
 1   loan_amnt       38642 non-null  int64  
 2   funded_amnt     38642 non-null  int64  
 3   int_rate        38642 non-null  float64
 4   installment     38642 non-null  float64
 5   emp_length      38642 non-null  int64  
 6   annual_inc      38642 non-null  float64
 7   loan_status     38642 non-null  object 
 8   fund_perc       38642 non-null  float64
 9   incToloan_perc  38642 non-null  float64
dtypes: float64(5), int64(4), object(1)
memory usage: 2.9+ MB


## <b> Train-Test Split

In [26]:
loan_data.select_dtypes(np.number).columns

Index(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment',
       'emp_length', 'annual_inc', 'fund_perc', 'incToloan_perc'],
      dtype='object')

In [29]:
# choosing all the numerical variables as independent variables (classifier can only take numerical input)
# dropping two variable funded_amnt as we have created new variable transformation based on it 
X = loan_data.select_dtypes(np.number).drop(['id','funded_amnt'],axis=1)

#Dependent variable representing status of the loan
Y = loan_data['loan_status']

#splitting the dataset in train and test datasets using a split ratio of 70:30
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state=100)

# standardizing all the variables using standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## <b> Model Building

In [None]:
from sklearn.linear_model import LogisticRegression 
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
import warnings

In [32]:
# Important library
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")

In [33]:

# Building a classification model using one vs rest methoda

lr = LogisticRegression()
oneVsrest = OneVsRestClassifier(lr)

# Fitting the model with training data

oneVsrest.fit(X_train_scaled,y_train)


OneVsRestClassifier(estimator=LogisticRegression())

## <b>Step 3 : </b>
<b> Model Prediction </b>

In [34]:
# Making a prediction on the test set
prediction_oneVsrest = oneVsrest.predict(X_test_scaled)

   
# Evaluating the model

print(f"Test Set Accuracy :{accuracy_score(y_test,prediction_oneVsrest)*100}%\n\n")
print(f"Classification Report :\n\n{classification_report(y_test,prediction_oneVsrest)}")


Test Set Accuracy :82.94660571034245%


Classification Report :

              precision    recall  f1-score   support

   High Risk       0.47      0.01      0.02      1665
    Low Risk       0.83      1.00      0.91      9612
 Medium Risk       0.00      0.00      0.00       316

    accuracy                           0.83     11593
   macro avg       0.43      0.34      0.31     11593
weighted avg       0.76      0.83      0.75     11593



<b> Accuracy : </b> Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.The formula is given as: <br>
<b> *Accuracy = True Positives + True Negatives/True Positives+False Positives+False Negatives+True Positives* </b> <br> <br>
<b> Precision : </b> The quality of being exact and refers to how close two or more measurements are to each other, regardless of whether those measurements are accurate or not. The formula is : <br>
<b> *Precision = True Positives / (True Positives + False Positives)* </b> <br> <br>
<b> Recall : </b> It is calculated as the number of true positives divided by the total number of true positives and false negatives. The result is a value between 0.0 for no recall and 1.0 for full or perfect recall. The formula is : <br>
<b> *Recall = True Positives / (True Positives + False Negatives)* </b> <br> <br>
<b> F1 score : </b> F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The formula is : <br>
<b> *F1 score = 2\*((precision\*recall)/(precision+recall))* </b> <br> <br>

## <b>Analysing the probabilties and classification values </b>

In [38]:
# Adding followig variables to the test dataset

#Scaled feature array

X_test["Scaled_features"] = X_test_scaled.tolist()

#Actual target variable
X_test["Actual"] = y_test

#OnevsRest target prediction
X_test["Prediction_oneVsrest"]= prediction_oneVsrest

#OnevsRest probability prediction
X_test["Prob_pred_oneVsrest"] = oneVsrest.predict_proba(X_test_scaled).tolist()
#OnevsRest individual class prediction probabilities

X_test['prob_oneVsrest_highRisk'] = oneVsrest.predict_proba(X_test_scaled)[:,0].tolist()
X_test['prob_oneVsrest_lowRisk'] = oneVsrest.predict_proba(X_test_scaled)[:,1].tolist()
X_test['prob_oneVsrest_mediumRisk'] = oneVsrest.predict_proba(X_test_scaled)[:,2].tolist()
X_test.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc,Scaled_features,Actual,Prediction_oneVsrest,Prob_pred_oneVsrest,prob_oneVsrest_highRisk,prob_oneVsrest_lowRisk,prob_oneVsrest_mediumRisk
13544,24000,11.99,533.75,10,84996.0,1.0,3.5415,"[1.6976067970323159, -0.017103257487488508, 0....",Low Risk,Low Risk,"[0.145407048287554, 0.6705568181704201, 0.1840...",0.145407,0.670557,0.184036
20268,12500,13.43,423.77,1,58000.0,1.0,4.64,"[0.15662313180383905, 0.37089505172108733, 0.4...",Low Risk,Low Risk,"[0.13443332466561403, 0.8637730013278444, 0.00...",0.134433,0.863773,0.001794
35271,1700,13.16,57.41,1,14000.0,1.0,8.235294,"[-1.2905615277150784, 0.29814536874447944, -1....",Low Risk,Low Risk,"[0.1752122244548154, 0.8066509312052925, 0.018...",0.175212,0.806651,0.018137
29133,10000,12.73,335.67,10,60000.0,1.0,6.0,"[-0.1783733171588733, 0.1822847625224742, 0.03...",High Risk,Low Risk,"[0.13835020398997258, 0.8570853687482336, 0.00...",0.13835,0.857085,0.004564
2974,1300,7.9,40.68,1,41000.0,1.0,31.538462,"[-1.3441609595491122, -1.1191262329479577, -1....",Low Risk,Low Risk,"[0.0718346970559761, 0.925985425818249, 0.0021...",0.071835,0.925985,0.00218


## <b>Display the coefficient and intercept values for each Logistic Regression model </b>

In [40]:
# Classes for which individual models are created

print(oneVsrest.classes_)

#Coefficient matrix for all the models created

print(oneVsrest.coef_.shape)

#Intercept values for all the models created
print("\nintercept values")
print(oneVsrest.intercept_)

#Coefficient values for all the models created
print("\n Coefficent values")
coeff_array = oneVsrest.coef_
print(coeff_array)

['High Risk' 'Low Risk' 'Medium Risk']
(3, 7)

intercept values
[[-1.98702862]
 [ 1.81436432]
 [-5.11248147]]

 Coefficent values
[[ 0.39851727  0.59347116 -0.38860129  0.04331238 -0.39891493  0.02934282
  -0.01457014]
 [-1.16799074 -0.63873289  1.13309613 -0.07050531  0.30833649 -0.21800131
   0.09024585]
 [ 4.9602886   0.96920326 -5.59110894  0.17542626  0.1334617   1.39128615
  -0.64155322]]


## <b>Analyse probability values for one test sample</b>

In [41]:
print(X_test.iloc[0]["Prob_pred_oneVsrest"])

[0.145407048287554, 0.6705568181704201, 0.18403613354202591]


## <b>Understand the mathematics and calculations inside the Model </b>

In [None]:
#Below example demonstrate the calculation of prediction probability for a observation in the dataset
#The demonstartion uses coefficient values of each model for the calculation

# Choose the first observation
arr = X_test.iloc[0]
# Class calculates the log of odds value for a given class

# Calculates the probability values given the log of odds


#Non-normalized probability of all the classes


#Normalized probability of all the classes


## <b> Building Logistic Regression Model and using it in One vs One Classifier </b>

In [None]:
#Classification using OnevsOne method


# Fitting the model with training data

   
# Making a prediction on the test set


## <b> Model Prediction </b>

In [None]:
# Evaluating the model


## <b>Analysing the probabilties and classification values </b>

In [None]:
# Adding followig variables to the test dataset

#OnevsOne target prediction


## <b>Display the parameters and coefficients for each Logistic Regression model </b>

In [None]:
#OneVsOne.classes_