<a href="https://colab.research.google.com/github/stogaja/Bank-Loan-Status-Classification/blob/main/Loan_Status_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LOAN STATUS CLASSIFICATION

## 1. Defining the Question

### a) Specifying the Question

We intend to build a classification model that assesses certain specifics of the client's information for example the client’s current income, credit score, the purpose of the loan with the help of (specific) certain Machine learning algorithms to classify the data into either a good loan application or a bad loan application. This in turn helps the bank predict which loan applications to grant. We will therefore build several models and select the one that works best.

### b) Defining the Metric for Success

Our study will be considered successful if we are able to meet the below objectives.

**Main Objective**

To find the groups of people applying for loans in banks at an individual level by building an unsupervised clustering model. 

**Specific Objectives**

i.)  To determine the characteristics responsible for customer loan classification through feature selection.
ii.) To determine the maximum loan limit for certain clients based on these features.
iii.) To check for anomalies in the number of open accounts by an individual.
iv.) To make a prediction on whether a client is likely to pay off their loan or not. 
v.) To determine the most common purpose for loan application.


### c) Understanding the contex

Loan classification, risk management, and provisioning processes are closely intertwined in a bank's operations. Loan pricing, the frequency and intensity of review and analysis, the rigor of monitoring, and the tolerance for loan losses (which should be precisely proportional to the risk rating grade) are all determined by the characteristics of the various risk rating classes. They are  associated with the amount of risk indicated by a loan's assigned risk rating grade), and the amount of risk indicated by the amount of risk to absorb unforeseen losses, regulatory capital is essential. When loan classification systems are combined with management's ability to recognize negative trends, there is an improved decision making process through portfolio management and early reporting techniques.

A loan classification system is an important component of a bank's credit risk assessment and valuation process, as it classifies loans and groups of loans with comparable credit risk characteristics into risk categories. Underwriting and approval, monitoring and managing credit quality, early identification of adverse trends and potentially problem loans, loan loss provisioning, management reporting, and the determination of regulatory capital requirements are all areas where a loan classification system can be useful. Loan classification systems are recognized by both accounting frameworks and Basel II/III regulatory capital frameworks as suitable instruments for accurately assessing credit risk and establishing groupings of loans for collective evaluation for loan loss calculation.

### d). Recording the Experimental Design

### e) Data Relevance

We shall be using the datasets below:

i. Credit test dataset ( https://www.kaggle.com/code/sazack/loan-status-classification/data?select=credit_test.csv )

ii. Credit train dataset ( https://www.kaggle.com/code/sazack/loan-status-classification/data?select=credit_train.csv )

## 2. Reading the Data

In [None]:
# let's load the R extension into the environment
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [57]:
%%R
#Installing the relevant R libraries

suppressWarnings(
        suppressMessages(if
                         (!require(e1071, quietly=TRUE))
                install.packages("e1071")))
library(e1071)

suppressWarnings(
        suppressMessages(if
                         (!require(factoextra, quietly=TRUE))
                install.packages("factoextra")))
library(factoextra)

suppressWarnings(
        suppressMessages(if
                         (!require(devtools, quietly=TRUE))
                install.packages("devtools")))
library(devtools)

suppressWarnings(
        suppressMessages(if
                         (!require(factoextra, quietly=TRUE))
                install.packages("factoextra")))
library(factoextra)

suppressWarnings(
        suppressMessages(if
                         (!require(Rtsne, quietly=TRUE))
                install.packages("Rtsne")))
library(Rtsne)

suppressWarnings(
        suppressMessages(if
                         (!require(VIM, quietly=TRUE))
                install.packages("VIM")))
library(VIM)

suppressWarnings(
        suppressMessages(if
                         (!require(CatEncoders, quietly=TRUE))
                install.packages("CatEncoders")))
library(CatEncoders)

#Installing and loading our caret package

suppressWarnings(
        suppressMessages(if
                         (!require(caret, quietly=TRUE))
                install.packages("caret")))
library(caret)

#Installing and loading the corrplot package for plotting

suppressWarnings(
        suppressMessages(if
                         (!require(corrplot, quietly=TRUE))
                install.packages("corrplot")))
library(corrplot)

suppressWarnings(
        suppressMessages(if
                         (!require(tibble, quietly=TRUE))
                install.packages("tibble")))
library(tibble)

suppressWarnings(
        suppressMessages(if
                         (!require(data.table, quietly=TRUE))
                install.packages("data.table")))
library(data.table)

suppressWarnings(
        suppressMessages(if
                         (!require(dplyr, quietly=TRUE))
                install.packages("dplyr")))
library(dplyr)

suppressWarnings(
        suppressMessages(if
                         (!require(readr, quietly=TRUE))
                install.packages("readr")))
library(readr)

suppressWarnings(
        suppressMessages(if
                         (!require(magrittr, quietly=TRUE))
                install.packages("magrittr")))
library(magrittr)

suppressWarnings(
        suppressMessages(if
                         (!require(knitr, quietly=TRUE))
                install.packages("knitr")))
library(knitr)

suppressWarnings(
        suppressMessages(if
                         (!require(tidyverse, quietly=TRUE))
                install.packages("tidyverse")))
library(tidyverse)

suppressWarnings(
        suppressMessages(if
                         (!require(Amelia, quietly=TRUE))
                install.packages("Amelia")))
library(Amelia)

suppressWarnings(
        suppressMessages(if
                         (!require(Rcpp, quietly=TRUE))
                install.packages("Rcpp")))
library(Rcpp)


In [None]:
%%R
# let's read the dataset
loan <- read.csv("credit_train.csv", sep = ",")

In [None]:
%%R
# let's preview the dataset
head(loan)

                               Loan.ID                          Customer.ID
1 14dd8831-6af5-400b-83ec-68e61888a048 981165ec-3274-42f5-a3b4-d104041a9ca9
2 4771cc26-131a-45db-b5aa-537ea4ba5342 2de017a3-2e01-49cb-a581-08169e83be29
3 4eed4e6a-aa2f-4c91-8651-ce984ee8fb26 5efb2b2b-bf11-4dfd-a572-3761a2694725
4 77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a e777faab-98ae-45af-9a86-7ce5b33b1011
5 d4062e70-befa-4995-8643-a0de73938182 81536ad9-5ccf-4eb8-befb-47a4d608658e
6 89d8cb0c-e5c2-4f54-b056-48a645c543dd 4ffe99d3-7f2a-44db-afc1-40943f1f9750
  Loan.Status Current.Loan.Amount       Term Credit.Score Annual.Income
1  Fully Paid              445412 Short Term          709       1167493
2  Fully Paid              262328 Short Term           NA            NA
3  Fully Paid            99999999 Short Term          741       2231892
4  Fully Paid              347666  Long Term          721        806949
5  Fully Paid              176220 Short Term           NA            NA
6 Charged Off              206602 Sh

In [None]:
%%R
# let's preview the bottom 6 rows
tail(loan)

       Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
100509                                                  NA                NA
100510                                                  NA                NA
100511                                                  NA                NA
100512                                                  NA                NA
100513                                                  NA                NA
100514                                                  NA                NA
       Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
100509            NA                                                       NA
100510            NA                                                       NA
100511            NA                                                       NA
100512            NA                                                       NA
100513            NA                                                   

The dataset has plenty of null values in the dataset

In [None]:
%%R
# let's check the shape of the data
cat('Number of rows = ', nrow(loan), 'and number of columns = ', ncol(loan), '.')

Number of rows =  100514 and number of columns =  19 .

In [None]:
%%R
# let's get the data types of our columns
sapply(loan, class)

                     Loan.ID                  Customer.ID 
                 "character"                  "character" 
                 Loan.Status          Current.Loan.Amount 
                 "character"                    "integer" 
                        Term                 Credit.Score 
                 "character"                    "integer" 
               Annual.Income         Years.in.current.job 
                   "integer"                  "character" 
              Home.Ownership                      Purpose 
                 "character"                  "character" 
                Monthly.Debt      Years.of.Credit.History 
                   "numeric"                    "numeric" 
Months.since.last.delinquent      Number.of.Open.Accounts 
                   "integer"                    "integer" 
   Number.of.Credit.Problems       Current.Credit.Balance 
                   "integer"                    "integer" 
         Maximum.Open.Credit                 Bankruptcie

In [None]:
%%R
# let's check the structure of our dataset
str(loan)

'data.frame':	100514 obs. of  19 variables:
 $ Loan.ID                     : chr  "14dd8831-6af5-400b-83ec-68e61888a048" "4771cc26-131a-45db-b5aa-537ea4ba5342" "4eed4e6a-aa2f-4c91-8651-ce984ee8fb26" "77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a" ...
 $ Customer.ID                 : chr  "981165ec-3274-42f5-a3b4-d104041a9ca9" "2de017a3-2e01-49cb-a581-08169e83be29" "5efb2b2b-bf11-4dfd-a572-3761a2694725" "e777faab-98ae-45af-9a86-7ce5b33b1011" ...
 $ Loan.Status                 : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
 $ Current.Loan.Amount         : int  445412 262328 99999999 347666 176220 206602 217646 648714 548746 215952 ...
 $ Term                        : chr  "Short Term" "Short Term" "Short Term" "Long Term" ...
 $ Credit.Score                : int  709 NA 741 721 NA 7290 730 NA 678 739 ...
 $ Annual.Income               : int  1167493 NA 2231892 806949 NA 896857 1184194 NA 2559110 1454735 ...
 $ Years.in.current.job        : chr  "8 years" "10+ years" "8 years" "3 y

In [None]:
%%R
# let's see the unique values for each column
loan %>% summarise_all(n_distinct)

  Loan.ID Customer.ID Loan.Status Current.Loan.Amount Term Credit.Score
1   82000       82000           3               22005    3          325
  Annual.Income Years.in.current.job Home.Ownership Purpose Monthly.Debt
1         36175                   13              5      17        65766
  Years.of.Credit.History Months.since.last.delinquent Number.of.Open.Accounts
1                     507                          117                      52
  Number.of.Credit.Problems Current.Credit.Balance Maximum.Open.Credit
1                        15                  32731               44597
  Bankruptcies Tax.Liens
1            9        13


## 3. External Data Validation



The data was provided by the bank about the brand and was based on a previous related banking data, there is no need for external validation.

## 4. Data Preparation

### a) Uniformity

In [None]:
%%R
colnames(loan)

 [1] "Loan.ID"                      "Customer.ID"                 
 [3] "Loan.Status"                  "Current.Loan.Amount"         
 [5] "Term"                         "Credit.Score"                
 [7] "Annual.Income"                "Years.in.current.job"        
 [9] "Home.Ownership"               "Purpose"                     
[11] "Monthly.Debt"                 "Years.of.Credit.History"     
[13] "Months.since.last.delinquent" "Number.of.Open.Accounts"     
[15] "Number.of.Credit.Problems"    "Current.Credit.Balance"      
[17] "Maximum.Open.Credit"          "Bankruptcies"                
[19] "Tax.Liens"                   


In [None]:
%%R
colnames(loan) <- tolower(colnames(loan))
colnames(loan)

 [1] "loan.id"                      "customer.id"                 
 [3] "loan.status"                  "current.loan.amount"         
 [5] "term"                         "credit.score"                
 [7] "annual.income"                "years.in.current.job"        
 [9] "home.ownership"               "purpose"                     
[11] "monthly.debt"                 "years.of.credit.history"     
[13] "months.since.last.delinquent" "number.of.open.accounts"     
[15] "number.of.credit.problems"    "current.credit.balance"      
[17] "maximum.open.credit"          "bankruptcies"                
[19] "tax.liens"                   


The column names had camel case letters but we converted the letters into lowercase for easy reference.

### b) Completeness

In [None]:
%%R
# let's check for missing data
missing.values <- sum(is.na(loan))
cat(missing.values)

97833

The data has alot of missing values

In [55]:
cat_cols <- loan[,colnames(loan)[grepl('factor|logical|character',sapply(loan,class))],with=F]


SyntaxError: ignored

In [53]:
%%R
# let's store numerical and categorical columns in variable
num_cols <- select_if(loan, is.numeric)
cat_cols <- select( -c(num_cols))

R[write to console]: Error in -c(num_cols) : invalid argument to unary operator




Error in -c(num_cols) : invalid argument to unary operator


RInterpreterError: ignored

In [None]:
%%R
Impute_data <- Amelia(loan, m = 10, maxit = 10, norms = )