# Loan Default Prediction

Financial inclusion has expanded access to banking services. The main function of banks is to lend money to borrowers using money saved by depositors. 

In deciding who to lend to, banks need to assess these borrowers based on traits and historical performance of their previous borrowings. This type of assessment is captured in the C's of credit. 

In [1]:
import pandas as pd

data = pd.read_csv(r'C:\Users\user\Documents\DATA ANALYSIS FILES\CSV FILE\Default_Fin.csv')

In [2]:
#Let's see what we have in the data
data.head(2)

Unnamed: 0,Index,Employed,Bank Balance,Annual Salary,Defaulted?
0,1,1,8754.36,532339.56,0
1,2,0,9806.16,145273.56,0


In [3]:
#we don't need the index loaded with the csv
data.drop('Index',axis=1,inplace=True)
data.head(2)

Unnamed: 0,Employed,Bank Balance,Annual Salary,Defaulted?
0,1,8754.36,532339.56,0
1,0,9806.16,145273.56,0


In [4]:
data.isna().sum()

Employed         0
Bank Balance     0
Annual Salary    0
Defaulted?       0
dtype: int64

## Select Target

The target is the column you want to predict.


In [5]:
#target:

y = data['Defaulted?']

In [6]:
y.head(2)

0    0
1    0
Name: Defaulted?, dtype: int64

## Select Features


In [7]:
#view the list of all columns
data.columns

Index(['Employed', 'Bank Balance', 'Annual Salary', 'Defaulted?'], dtype='object')

In [8]:
#one way to select all columns except one of them is to simply drop the one you don't want
X = data.drop('Defaulted?', axis=1)

In [9]:
X.head(2)

Unnamed: 0,Employed,Bank Balance,Annual Salary
0,1,8754.36,532339.56
1,0,9806.16,145273.56


## Create train and test datasets

Split your data set into a train and test set. Your test set will be used to evaluate the trained model.

In [10]:
#use train_test_split to divide the dataset into train and test datasets
from sklearn.model_selection import train_test_split

#### create X_train, X_test, y_train, y_test using test_size of 30%


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [12]:
print('\n',X_train.head(2))
print('\n',X_test.head(2))
print('\n',y_train.head(2))
print('\n',y_test.head(2))


       Employed  Bank Balance  Annual Salary
803          1      22831.32      640728.96
1387         1       7977.84      394737.60

       Employed  Bank Balance  Annual Salary
6676         1       2139.36      427391.16
6421         1      21666.72      481437.48

 803     1
1387    0
Name: Defaulted?, dtype: int64

 6676    0
6421    1
Name: Defaulted?, dtype: int64


## Train Your Classification Model With A Decision Tree Classifier

In [13]:
#import the decision tree classifier algorithm
from sklearn.tree import DecisionTreeClassifier

create the model


In [14]:
#your code here
loan_model = lagos_model = DecisionTreeClassifier(random_state=30)

 fit the model

In [15]:
#Your code here

loan_model.fit(X_train,y_train)

DecisionTreeClassifier(random_state=30)

create your predictions

In [16]:
#your code here
loan_DT_preds = loan_model.predict(X_test)

## Score Your Decision Tree Model

In [17]:
from sklearn.metrics import accuracy_score

In [18]:
print(accuracy_score(loan_DT_preds,y_test))

0.9586666666666667


### What is the accuracy score of your decision tree model?

Accuracy score is 96%

## Train Your Classification Model With A Random Forest Classifier

In [19]:
#import the random forest classifier 
from sklearn.ensemble import RandomForestClassifier

In [20]:
#TODO: create the model
RF_loan_model =  RandomForestClassifier(random_state=30)


In [21]:
#fit the model
RF_loan_model.fit(X_train, y_train)

RandomForestClassifier(random_state=30)

In [22]:
#create your predictions
RF_loan_preds = RF_loan_model.predict(X_test)

## Score Your Random Forest Model

In [23]:
#import the accuracy_score metric
from sklearn.metrics import accuracy_score

In [24]:
print(accuracy_score(RF_loan_preds,y_test))

0.971


### What is the accuracy score of your random forest model?

In [25]:
#your answer here
#Accuracy score is 97%

* accuracy
* precision
* recall

### What are some advantages of a Random Forest over a Decision Tree?

In [26]:
#your answer here
#Error Reduction
#Precised Result

### Business understanding: what are some other features that can be added to this dataset to meet requirements of the 5 C's of credit

In [27]:
#1.Capital

In [28]:
#2.Collateral

In [29]:
#3. Character

In [30]:
#4. Condition

In [31]:
#5. Capacity

**With the loan default data set**

Train 2 additional models apart from decision trees and random forest which you have already used.

Some options:
1. Logistic Regression
2. KNeighbors
3. Support Vector Machines
etc

Compare all 4 models that you have now trained.

1. Which one gave the best accuracy?
2. Which one gave the best recall?
3. For a loan default prediction, should we be more worried about lending to bad customers or rejecting good customers?

In [32]:
#logistic regression expects data to be scaled
#import StandardScaler
from sklearn.preprocessing import StandardScaler

## What is the importance of scaling data?

In [33]:
#your answer here:
# scaling of the data makes it easy for a model to learn and understand the problem


In [34]:
#data scaling steps for train data set
scaler = StandardScaler()
X_train_scaled  =scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train_scaled, index=X_train.index, columns = X_train.columns)

In [35]:
#data scaling test for test dataset
X_test_scaled = scaler.transform(X_test)
X_test = pd.DataFrame(X_test_scaled, index=X_test.index, columns = X_test.columns)

In [36]:
from sklearn.linear_model import LogisticRegression

In [37]:
#initialize the logistic regression model
LR = LogisticRegression()

In [38]:
#train the model
LR.fit(X_train,y_train)

LogisticRegression()

In [39]:
#predict and store your predictions to a variable
LR_preds = LR.predict(X_test)

In [40]:
#import classification report so we view multiple classification metrics
from sklearn.metrics import classification_report

In [41]:
#print classification report
print(classification_report(LR_preds, y_test))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      2970
           1       0.27      0.83      0.40        30

    accuracy                           0.98      3000
   macro avg       0.63      0.91      0.70      3000
weighted avg       0.99      0.98      0.98      3000

