<a href="https://colab.research.google.com/github/ChuquEmeka/C5.0-Bank-Loan-Analysis/blob/main/c5_0_bank_loan_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying risky Bank Loans using C5.0 Decision Tree Algorithm

### Presented by Edeh Emeka N.

In [1]:
library("readr")

## **DATA EXPLORATION**

In [2]:
credit <- read.csv("credit.csv")
str(credit)

'data.frame':	1000 obs. of  17 variables:
 $ checking_balance    : chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : chr  "critical" "good" "critical" "good" ...
 $ purpose             : chr  "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
 $ employment_duration : chr  "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ...
 $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ other_credit        : chr  "none" "none" "none" "none" ...
 $ housing             : chr  "own" "own" "own" "other" ...
 $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
 $ job                 : chr  "skilled

In [3]:
summary(credit)

 checking_balance   months_loan_duration credit_history       purpose         
 Length:1000        Min.   : 4.0         Length:1000        Length:1000       
 Class :character   1st Qu.:12.0         Class :character   Class :character  
 Mode  :character   Median :18.0         Mode  :character   Mode  :character  
                    Mean   :20.9                                              
                    3rd Qu.:24.0                                              
                    Max.   :72.0                                              
     amount      savings_balance    employment_duration percent_of_income
 Min.   :  250   Length:1000        Length:1000         Min.   :1.000    
 1st Qu.: 1366   Class :character   Class :character    1st Qu.:2.000    
 Median : 2320   Mode  :character   Mode  :character    Median :3.000    
 Mean   : 3271                                          Mean   :2.973    
 3rd Qu.: 3972                                          3rd Qu.:4.000    
 Ma

In [4]:
#Checking out features of the applicant's checking and savings account balance
table(credit$checking_balance)
table(credit$savings_balance)


    < 0 DM   > 200 DM 1 - 200 DM    unknown 
       274         63        269        394 


     < 100 DM     > 1000 DM  100 - 500 DM 500 - 1000 DM       unknown 
          603            48           103            63           183 

In [5]:
#Convert the credit$default column to factor
credit$default <- as.factor(credit$default)

In [6]:
# look at two characteristics of the loan
summary(credit$months_loan_duration)
summary(credit$amount)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    18.0    20.9    24.0    72.0 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18424 

**From the above summaries, it is evident that the loan amount ranges from 250 Deutsche Marks(DM) to 18,420 over a period of 4 to 72 months(the loan data was obtained from Germany and DM was the currency at the time).**

In [7]:
# look at the class variable
table(credit$default)


 no yes 
700 300 

**The above output is an indication that about 30% of the loans went into default**

## **DATA PREPARATION**
#### **CREATING RANDOM TRAINING AND TEST DATA**

In [8]:
# creating a random sample for training and test data

set.seed(123) #this will ensure an identical result in future since i am also using the sample()
train_sample <- sample(1000, 700) #The sample() is used to perform Random sampling without replacement
range(train_sample) #range shows you the MIN and MAX value


In [9]:
str(train_sample)


 int [1:700] 415 463 179 526 195 938 818 118 299 229 ...


**The above train data gives a vector of 700 random integers**

In [10]:
# split the data frames
credit_train <- credit[train_sample, ]
credit_test  <- credit[-train_sample, ]

In [11]:
head(credit_train)

head(credit_test)


Unnamed: 0_level_0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone,default
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<int>,<chr>,<int>,<chr>,<fct>
415,< 0 DM,24,good,car,1381,unknown,1 - 4 years,4,2,35,none,own,1,skilled,1,no,yes
463,1 - 200 DM,12,good,furniture/appliances,3017,< 100 DM,< 1 year,3,1,34,none,rent,1,management,1,no,no
179,unknown,12,good,furniture/appliances,1963,< 100 DM,4 - 7 years,4,2,31,none,rent,2,management,2,yes,no
526,1 - 200 DM,26,good,car,7966,< 100 DM,< 1 year,2,3,30,none,own,2,skilled,1,no,no
195,1 - 200 DM,45,good,furniture/appliances,3031,100 - 500 DM,1 - 4 years,4,4,21,none,rent,1,skilled,1,no,yes
938,1 - 200 DM,6,good,furniture/appliances,2063,< 100 DM,< 1 year,4,3,30,none,rent,1,management,1,yes,no


Unnamed: 0_level_0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone,default
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<int>,<chr>,<int>,<chr>,<fct>
1,< 0 DM,6,critical,furniture/appliances,1169,unknown,> 7 years,4,4,67,none,own,2,skilled,1,yes,no
3,unknown,12,critical,education,2096,< 100 DM,4 - 7 years,2,3,49,none,own,1,unskilled,2,no,no
4,< 0 DM,42,good,furniture/appliances,7882,< 100 DM,4 - 7 years,2,4,45,none,other,1,skilled,2,no,no
7,unknown,24,good,furniture/appliances,2835,500 - 1000 DM,> 7 years,3,4,53,none,own,1,skilled,1,no,no
9,unknown,12,good,furniture/appliances,3059,> 1000 DM,4 - 7 years,2,4,61,none,own,1,unskilled,1,no,no
12,< 0 DM,48,good,business,4308,< 100 DM,< 1 year,3,4,24,none,rent,1,skilled,1,no,yes


In [12]:
# check the proportion of class variable
prop.table(table(credit_train$default))
prop.table(table(credit_train$default))*100

prop.table(table(credit_test$default))
prop.table(table(credit_test$default))*100


       no       yes 
0.7085714 0.2914286 


      no      yes 
70.85714 29.14286 


  no  yes 
0.68 0.32 


 no yes 
 68  32 

**This appears to be evenly splitted since we have approximately 30% of defaulted loans in both the train and test datasets.**

## **Training a model on the data**

In [13]:
# build the simplest decision tree
install.packages("C50")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘plyr’, ‘libcoin’, ‘mvtnorm’, ‘Formula’, ‘inum’, ‘reshape2’, ‘partykit’, ‘Cubist’




In [14]:
credit_train$default<-as.factor(credit_train$default)

In [15]:
library(C50)
credit_model <- C5.0(credit_train[-17], credit_train$default)

In [16]:
# display simple facts about the tree
credit_model


Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)

Classification Tree
Number of samples: 700 
Number of predictors: 16 

Tree size: 52 

Non-standard options: attempt to group attributes


In [17]:
# display detailed information about the tree
summary(credit_model)


Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)


C5.0 [Release 2.07 GPL Edition]  	Sun Jan 23 13:18:34 2022
-------------------------------

Class specified by attribute `outcome'

Read 700 cases (17 attributes) from undefined.data

Decision tree:

checking_balance = unknown: no (272/31)
checking_balance in {< 0 DM,1 - 200 DM,> 200 DM}:
:...months_loan_duration > 42:
    :...savings_balance in {500 - 1000 DM,> 1000 DM}: yes (0)
    :   savings_balance = unknown: no (5)
    :   savings_balance in {< 100 DM,100 - 500 DM}:
    :   :...years_at_residence <= 1: no (3/1)
    :       years_at_residence > 1: yes (29/2)
    months_loan_duration <= 42:
    :...credit_history in {very good,perfect}:
        :...age <= 23: no (4)
        :   age > 23:
        :   :...percent_of_income > 3: yes (21/2)
        :       percent_of_income <= 3:
        :       :...dependents > 1: yes (2)
        :           dependents <= 1:
        :           :...housing = rent: yes (3)
        :

**from the above output, it can be seen that the model only  
wrongly classified 82(11.7%) of the 700 training instances  
with an accuracy score of 88.3%**

## **Evaluating model performance**

In [18]:

# create a factor vector of predictions on test data
credit_pred <- predict(credit_model, credit_test)
head(credit_pred)

#### cross tabulation of predicted versus actual classes using gmodels

In [19]:
# cross tabulation of predicted versus actual classes
install.packages("gmodels")
library(gmodels)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘gtools’, ‘gdata’




In [20]:
#to remove the column and row percentages, i will set prop.c and prop.r parameters to false
accuracy_chk<- CrossTable(credit_test$default, credit_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  300 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |       176 |        28 |       204 | 
               |     0.587 |     0.093 |           | 
---------------|-----------|-----------|-----------|
           yes |        55 |        41 |        96 | 
               |     0.183 |     0.137 |           | 
---------------|-----------|-----------|-----------|
  Column Total |       231 |        69 |       300 | 
---------------|-----------|-----------|-----------|

 


In [21]:
#getting the prediction matrix
accuracy_chk<-table(credit_test$default, credit_pred)

accuracy_chk

     credit_pred
       no yes
  no  176  28
  yes  55  41

In [22]:
#Calculating the accuracy of the model
accuracy_result<- sum(diag(accuracy_chk))/sum(accuracy_chk)
accuracy_result


**Of the 300 test loan application records, the model correctly predicted that 176 did not default and 41 defaulted with an accuracy score of about 72% which is worse than what was obtained in the training data.**

## **Improving model performance**

In [23]:
## Boosting the accuracy of decision trees
# boosted decision tree with 10 trials
credit_boost10 <- C5.0(credit_train[-17], credit_train$default,
                       trials = 10)
credit_boost10
summary(credit_boost10)


Call:
C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)

Classification Tree
Number of samples: 700 
Number of predictors: 16 

Number of boosting iterations: 10 
Average tree size: 43.8 

Non-standard options: attempt to group attributes



Call:
C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)


C5.0 [Release 2.07 GPL Edition]  	Sun Jan 23 13:18:44 2022
-------------------------------

Class specified by attribute `outcome'

Read 700 cases (17 attributes) from undefined.data

-----  Trial 0:  -----

Decision tree:

checking_balance = unknown: no (272/31)
checking_balance in {< 0 DM,1 - 200 DM,> 200 DM}:
:...months_loan_duration > 42:
    :...savings_balance in {500 - 1000 DM,> 1000 DM}: yes (0)
    :   savings_balance = unknown: no (5)
    :   savings_balance in {< 100 DM,100 - 500 DM}:
    :   :...years_at_residence <= 1: no (3/1)
    :       years_at_residence > 1: yes (29/2)
    months_loan_duration <= 42:
    :...credit_history in {very good,perfect}:
        :...age <= 23: no (4)
        :   age > 23:
        :   :...percent_of_income > 3: yes (21/2)
        :       percent_of_income <= 3:
        :       :...dependents > 1: yes (2)
        :           dependents <= 1:
        :           

In [24]:
credit_boost_pred10 <- predict(credit_boost10, credit_test)
CrossTable(credit_test$default, credit_boost_pred10,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  300 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |       176 |        28 |       204 | 
               |     0.587 |     0.093 |           | 
---------------|-----------|-----------|-----------|
           yes |        56 |        40 |        96 | 
               |     0.187 |     0.133 |           | 
---------------|-----------|-----------|-----------|
  Column Total |       232 |        68 |       300 | 
---------------|-----------|-----------|-----------|

 


In [25]:
# creating dimensions for a cost matrix
matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
names(matrix_dimensions) <- c("predicted", "actual")
matrix_dimensions

In [26]:
# building the matrix
error_cost <- matrix(c(0, 1, 4, 0), nrow = 2, dimnames = matrix_dimensions)
error_cost

Unnamed: 0,no,yes
no,0,4
yes,1,0


In [27]:
# applying the cost matrix to the tree
credit_cost <- C5.0(credit_train[-17], credit_train$default,
                          costs = error_cost)
credit_cost_pred <- predict(credit_cost, credit_test)

CrossTable(credit_test$default, credit_cost_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  300 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |       125 |        79 |       204 | 
               |     0.417 |     0.263 |           | 
---------------|-----------|-----------|-----------|
           yes |        27 |        69 |        96 | 
               |     0.090 |     0.230 |           | 
---------------|-----------|-----------|-----------|
  Column Total |       152 |       148 |       300 | 
---------------|-----------|-----------|-----------|

 
