<a href="https://colab.research.google.com/github/TabithaWKariuki/Loan-Default-Prediction/blob/main/Task_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Defining the Question

### a) Specifying the Data Analytic Question

 Use credit_risk_dataset_training.csv to train your model and
credit_risk_dataset_test.csv to predict the missing value (‘loan_status’). Please
document your steps and method used. Include the accuracy or evaluation metric used for
calibrating your model

### b) Defining the Metric for Success

1. Predict the outcome of a loan: is a customer likely to satisfy or default on the loan
obligations?

2. Document your steps and method used. Include the accuracy or evaluation metric used for
calibrating your model


### c) Understanding the context

Credit default risk is the risk
that a lender takes the chance that a borrower fails to make required payments of the loan.
The main purpose of this analysis is to predict whether a new customer can be a reliable
customer. It's a way to avoid default and increase the bank’s revenue. This can be used to
automate approving and declining loan applications more accurately. 

### d) Recording the Experimental Design

1. Obtaining the dataset for our study.
2. Importing all the necessary libraries for data analysis. 
3. Loading and viewing our data to better understand it.
4. Data Modelling
5. Model Evaluation with regards to accuracy levels
6. Model improvement and tuning
7. Deployment using streamlit
8. Conclusion

### e) Data Relevance

The dataset; credit_risk_dataset_training and credit_risk_dataset_test have the appropriate columns to answer the questions. They are relevant for modeling.

## 2. Reading the Data

In [None]:
#import Libaries for my Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Set global parameters
%matplotlib inline
sns.set()
plt.rcParams['figure.figsize'] = (10.0, 8.0)
warnings.filterwarnings('ignore')

In [None]:
# Loading the cleaned train dataset from the Task 1 i.e. csv

train = pd.read_csv('new_training_data.csv')

In [None]:
# Loading the test dataset from the source i.e. csv

testset = pd.read_csv('credit_risk_dataset_test.csv')

## 3. Understanding the Data

In [None]:
# Determining the no. of records in our dataset

train.shape
#the dataset, has 22765 columns and 13 entries(rows)

(22765, 13)

In [None]:
# Determining the no. of records in our dataset

testset.shape
#the dataset, has 9731 columns and 12 entries(rows)

(9731, 12)

In [None]:
# Previewing the top of our test dataset

testset.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
0,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,0.57,N,3,
1,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,0.55,Y,4,
2,24,78956,RENT,5.0,MEDICAL,B,35000,11.11,0.44,N,4,
3,26,108160,RENT,4.0,EDUCATION,E,35000,18.39,0.32,N,4,
4,23,92111,RENT,7.0,MEDICAL,F,35000,20.25,0.32,N,4,


#4. Modeling

**K Nearest Neighbours**

In [None]:
train.columns

Index(['Unnamed: 0', 'person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length'],
      dtype='object')

In [None]:
# Label encoder
# We know that machine learning models take only numbers as inputs and can not process strings. 
# So, we have to deal with the categories present in the dataset and convert them into numbers.

from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()

train['person_home_ownership']=enc.fit_transform(train['person_home_ownership'])
train['person_emp_length']=enc.fit_transform(train['person_emp_length'])
train['loan_intent']=enc.fit_transform(train['loan_intent'])
train['loan_grade']=enc.fit_transform(train['loan_grade'])
train['cb_person_default_on_file']=enc.fit_transform(train['cb_person_default_on_file'])

In [None]:
# Defining the x and y variables

X= train.drop(['loan_status'],axis=1).values
y= train['loan_status'].values

In [None]:
# Train Test Split
# To avoid over-fitting, we will divide our dataset into training and test splits in the 80 20 split, 
# which gives us a better idea as to how our algorithm performed during the testing phase. 
# This way our algorithm is tested on un-seen data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
# Feature Scaling
# Before making any actual predictions, it is always a good practice to scale the features 
# so that all of them can be uniformly evaluated.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Training and Predictions
# The first step is to import the KNeighborsClassifier class from the sklearn.neighbors library. 
# In the second line, this class is initialized with one parameter, i.e. n_neigbours. 
# This is basically the value for the K. There is no ideal value for K and it is selected after testing and evaluation, 
# however to start out, 5 seems to be the most commonly used value for KNN algorithm.

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [None]:
# The final step is to make predictions on our test data

y_pred = classifier.predict(X_test)

In [None]:
# Evaluating the Algorithm
# For evaluating an algorithm, confusion matrix, precision, recall and f1 score are the most commonly used metrics. 
# The confusion_matrix and classification_report methods of the sklearn.metrics can be used to calculate these metrics. 

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[3396  177]
 [ 444  536]]
              precision    recall  f1-score   support

           0       0.88      0.95      0.92      3573
           1       0.75      0.55      0.63       980

    accuracy                           0.86      4553
   macro avg       0.82      0.75      0.77      4553
weighted avg       0.86      0.86      0.86      4553



In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy score for 80 20 split is: ", accuracy_score(y_test, y_pred)*100)

Accuracy score for 80 20 split is:  86.36064133538326


**Our KNN model had an accuracy score of 86.3%**

**Random Forest Classifier**

In [None]:
 # We will separate the dependent (Loan_Status) and the independent variables

X = train[['person_age', 'person_income', 'person_home_ownership', 'person_emp_length', 
           'loan_intent', 'loan_grade', 'loan_amnt',
           'loan_int_rate', 'loan_percent_income',
           'cb_person_default_on_file', 'cb_person_cred_hist_length']]
y = train.loan_status
X.shape, y.shape

((22765, 11), (22765,))

In [None]:
# split our dataset into a training and validation set, 
# so that we can train the model on the training set and
# evaluate its performance on the validation set

from sklearn.model_selection import train_test_split
x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size = 0.2, random_state = 10)

In [None]:
from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()

train['person_home_ownership']=enc.fit_transform(train['person_home_ownership'])

In [None]:
train.head()

Unnamed: 0.1,Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,0,22,59000,3,35,4,3,35000,16.02,1,0.59,1,3
1,1,21,9600,2,5,1,1,1000,11.14,0,0.1,0,2
2,2,23,65500,3,4,3,2,35000,15.23,1,0.53,0,2
3,3,21,9900,2,2,5,0,2500,7.14,1,0.25,0,2
4,4,26,77100,3,8,1,1,35000,12.42,1,0.45,0,3


In [None]:
# Training the random forest model using the training set

from sklearn.ensemble import RandomForestClassifier 
model = RandomForestClassifier(max_depth=4, random_state = 10) 
model.fit(x_train, y_train)

RandomForestClassifier(max_depth=4, random_state=10)

In [None]:
# Checking the performance of training set

from sklearn.metrics import accuracy_score
pred_cv = model.predict(x_cv)
accuracy_score(y_cv,pred_cv)

0.8980891719745223

**The training set of the training data has an accuracy score of 89.8%.** 

In [None]:
# Checking the performance of test set (0.2)

pred_train = model.predict(x_train)
accuracy_score(y_train,pred_train)

0.9041840544695805

**The validation set of our training data has a 90.4% accuracy score**

# Passing our model in our test dataset

In [None]:
# Assigning columns in the test set the variable column

columns = list(testset.columns)

In [None]:
# Checking null values
# Models do not work well floats and null values

testset.isnull().sum()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              282
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                  969
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
loan_status                   9731
dtype: int64

In [None]:
# Replacing null values with 0

df1 = testset.fillna(0)
df1.isnull().sum()

person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
loan_status                   0
dtype: int64

In [None]:
# Lable Encoding
# Changing categorical variables to numericals

from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()

df1['person_home_ownership']=enc.fit_transform(df1['person_home_ownership'])
df1['loan_intent']=enc.fit_transform(df1['loan_intent'])
df1['cb_person_default_on_file']=enc.fit_transform(df1['cb_person_default_on_file'])
df1['loan_grade']=enc.fit_transform(df1['loan_grade'])


In [None]:
# Passing our model through the test set and previewing


y_test_pred = model.predict(df1[columns[:11]])

y_test_pred

array([1, 1, 1, ..., 1, 0, 1])

In [None]:
# Previewing our loan status to check whether the N/A
# have been predicted

df1['loan_status'] = model.predict(df1[columns[:11]])

df1.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
0,25,9600,0,1.0,3,2,5500,12.87,0.57,0,3,1
1,24,54400,3,8.0,3,2,35000,14.27,0.55,1,4,1
2,24,78956,3,5.0,3,1,35000,11.11,0.44,0,4,1
3,26,108160,3,4.0,1,4,35000,18.39,0.32,0,4,1
4,23,92111,3,7.0,3,5,35000,20.25,0.32,0,4,1


# Deploying the Model

In [None]:
# Saving the model

import pickle 
pickle_out = open("classifier.pkl", mode = "wb") 
pickle.dump(model, pickle_out) 
pickle_out.close()

In [None]:
# pyngrok is a python wrapper for ngrok which helps to open secure tunnels from public URLs to localhost. 
# This will help us to host our web app.
# Streamlit will be used to make our web app. 


!pip install -q pyngrok

!pip install -q streamlit

!pip install -q streamlit_ace

[?25l[K     |▍                               | 10 kB 20.0 MB/s eta 0:00:01[K     |▉                               | 20 kB 3.0 MB/s eta 0:00:01[K     |█▎                              | 30 kB 4.2 MB/s eta 0:00:01[K     |█▊                              | 40 kB 1.7 MB/s eta 0:00:01[K     |██▏                             | 51 kB 1.8 MB/s eta 0:00:01[K     |██▋                             | 61 kB 2.2 MB/s eta 0:00:01[K     |███                             | 71 kB 2.3 MB/s eta 0:00:01[K     |███▌                            | 81 kB 2.4 MB/s eta 0:00:01[K     |████                            | 92 kB 2.7 MB/s eta 0:00:01[K     |████▍                           | 102 kB 2.1 MB/s eta 0:00:01[K     |████▉                           | 112 kB 2.1 MB/s eta 0:00:01[K     |█████▎                          | 122 kB 2.1 MB/s eta 0:00:01[K     |█████▊                          | 133 kB 2.1 MB/s eta 0:00:01[K     |██████▏                         | 143 kB 2.1 MB/s eta 0:00:01[K    

In [None]:
# Saving the script as app.py

%%writefile app.py
 
# Importing the Libaries

import pickle
import streamlit as st
 
# loading the trained model and saving it in the variable classifier

pickle_in = open('classifier.pkl', 'rb') 
classifier = pickle.load(pickle_in)
 
@st.cache()
  
# defining the function which will make the prediction using the data which the user inputs 
def prediction(person_home_ownership, loan_intent, cb_person_default_on_file):   
 
    # Pre-processing user input  

        if person_home_ownership == "RENT":
        person_home_ownership = 0
    else:
        person_home_ownership = 1
 
    if loan_intent == "MEDICAL":
        loan_intent = 0
    else:
        loan_intent = 1  

        if cb_person_default_on_file == "YES":
        cb_person_default_on_file = 0
    else:
        cb_person_default_on_file = 1


    # Making predictions 
    prediction = classifier.predict( 
        [[person_home_ownership, loan_intent, cb_person_default_on_file]])
     
    if prediction == 0:
        pred = 'Rejected'
    else:
        pred = 'Approved'
    return pred
      
  
# this is the main function in which we define our webpage  
def main():       
    # front end elements of the web page 
    html_temp = """ 
    <div style ="background-color:yellow;padding:13px"> 
    <h1 style ="color:black;text-align:center;">Loan Approval App</h1> 
    </div> 
    """
      
    # display the front end aspect
    st.markdown(html_temp, unsafe_allow_html = True) 
      
    # following lines create boxes in which user can enter data required to make prediction 
    HomeOwnership = st.selectbox('person_home_ownership',("RENT","MORTGAGE","OWN", "OTHER"))
    LoanIntent = st.selectbox('loan_intent',("MEDICAL","VENTURE", "EDUCATION", "DEBTCONSOLIDATION", "PERSONAL", "HOMEIMPROVEMENT")) 
    DefaultOnFile = st.selectbox('cb_person_default_on_file',("YES","NO"))
    result =""
      
    # when 'Predict' is clicked, make the prediction and store it 
    if st.button("Predict"): 
        result = prediction(person_home_ownership, loan_intent, cb_person_default_on_file) 
        st.success('Your loan is {}'.format(result))
        print(LoanAmount)
     
if __name__=='__main__': 
    main()

Writing app.py


In [None]:
!streamlit run app.py &>/dev/null&

In [None]:
from pyngrok import ngrok
 
public_url = ngrok.connect('8501')
public_url



<NgrokTunnel: "http://cafd-35-234-16-133.ngrok.io" -> "http://localhost:8501">

# Conclusions

The best model for prediction is the Random Forest Classifier with an accuracy score of 90%.