# Identifying risky bank loans using C5.0 Decision Trees

## Exploring and preparing the data

In [None]:
credit<-read.csv("Credit_Data.csv", stringsAsFactors = T)
head(credit)

Unnamed: 0_level_0,default,account_check_status,duration_in_month,credit_history,purpose,credit_amount,savings,present_emp_since,installment_as_income_perc,personal_status_sex,⋯,present_res_since,property,age,other_installment_plans,housing,credits_this_bank,job,people_under_maintenance,telephone,foreign_worker
Unnamed: 0_level_1,<int>,<fct>,<int>,<fct>,<fct>,<int>,<fct>,<fct>,<int>,<fct>,⋯,<int>,<fct>,<int>,<fct>,<fct>,<int>,<fct>,<int>,<fct>,<fct>
1,0,< 0 DM,6,critical account/ other credits existing (not at this bank),domestic appliances,1169,unknown/ no savings account,.. >= 7 years,4,male : single,⋯,4,real estate,67,none,own,2,skilled employee / official,1,"yes, registered under the customers name",yes
2,1,0 <= ... < 200 DM,48,existing credits paid back duly till now,domestic appliances,5951,... < 100 DM,1 <= ... < 4 years,2,female : divorced/separated/married,⋯,2,real estate,22,none,own,1,skilled employee / official,1,none,yes
3,0,no checking account,12,critical account/ other credits existing (not at this bank),(vacation - does not exist?),2096,... < 100 DM,4 <= ... < 7 years,2,male : single,⋯,3,real estate,49,none,own,1,unskilled - resident,2,none,yes
4,0,< 0 DM,42,existing credits paid back duly till now,radio/television,7882,... < 100 DM,4 <= ... < 7 years,2,male : single,⋯,4,if not A121 : building society savings agreement/ life insurance,45,none,for free,1,skilled employee / official,2,none,yes
5,1,< 0 DM,24,delay in paying off in the past,car (new),4870,... < 100 DM,1 <= ... < 4 years,3,male : single,⋯,4,unknown / no property,53,none,for free,2,skilled employee / official,2,none,yes
6,0,no checking account,36,existing credits paid back duly till now,(vacation - does not exist?),9055,unknown/ no savings account,1 <= ... < 4 years,2,male : single,⋯,4,unknown / no property,35,none,for free,1,unskilled - resident,2,"yes, registered under the customers name",yes


In [None]:
str(credit)

'data.frame':	1000 obs. of  21 variables:
 $ default                   : int  0 1 0 0 1 0 0 0 0 1 ...
 $ account_check_status      : Factor w/ 4 levels "< 0 DM",">= 200 DM / salary assignments for at least 1 year",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ duration_in_month         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history            : Factor w/ 5 levels "all credits at this bank paid back duly",..: 2 4 2 4 3 4 4 4 4 2 ...
 $ purpose                   : Factor w/ 10 levels "(vacation - does not exist?)",..: 5 5 1 8 3 1 8 4 5 3 ...
 $ credit_amount             : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings                   : Factor w/ 5 levels ".. >= 1000 DM ",..: 5 2 2 2 2 5 4 2 1 2 ...
 $ present_emp_since         : Factor w/ 5 levels ".. >= 7 years",..: 1 3 4 4 3 3 1 3 4 5 ...
 $ installment_as_income_perc: int  4 2 2 2 3 2 3 2 2 4 ...
 $ personal_status_sex       : Factor w/ 4 levels "female : divorced/separated/married",..: 4 1 4 4 4 4 4 4 2 3 ...
 $ o

In [None]:
# checking the features of savings and checkings
table(credit$account_check_status)



                                            < 0 DM 
                                               274 
>= 200 DM / salary assignments for at least 1 year 
                                                63 
                                 0 <= ... < 200 DM 
                                               269 
                               no checking account 
                                               394 

In [None]:
table(credit$savings)


             .. >= 1000 DM                 ... < 100 DM 
                         48                         603 
        100 <= ... < 500 DM       500 <= ... < 1000 DM  
                        103                          63 
unknown/ no savings account 
                        183 

In [None]:
# features of numeric variables, loan duration and amount
summary(credit$credit_amount)
summary(credit$duration_in_month)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18424 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    18.0    20.9    24.0    72.0 

In [None]:
prop.table(table(credit$default))*100
# 30% went into default 


 0  1 
70 30 

## Data preparation: Training and Test Datasets

In [None]:
# creating random numbers
set.seed(12345)
credit_rand<-credit[order(runif(1000)),]


In [None]:
str(credit_rand)

'data.frame':	1000 obs. of  21 variables:
 $ default                   : int  1 0 0 0 1 0 1 1 0 0 ...
 $ account_check_status      : Factor w/ 4 levels "< 0 DM",">= 200 DM / salary assignments for at least 1 year",..: 1 3 3 1 3 4 1 3 3 1 ...
 $ duration_in_month         : int  24 7 12 24 9 18 33 9 20 15 ...
 $ credit_history            : Factor w/ 5 levels "all credits at this bank paid back duly",..: 2 4 4 4 2 4 2 4 3 4 ...
 $ purpose                   : Factor w/ 10 levels "(vacation - does not exist?)",..: 3 5 5 8 1 2 8 8 7 8 ...
 $ credit_amount             : int  1199 2576 1103 4020 1501 1568 4281 918 2629 1845 ...
 $ savings                   : Factor w/ 5 levels ".. >= 1000 DM ",..: 2 2 2 2 2 3 4 2 2 2 ...
 $ present_emp_since         : Factor w/ 5 levels ".. >= 7 years",..: 1 3 4 3 1 3 3 3 3 2 ...
 $ installment_as_income_perc: int  4 2 4 2 2 3 1 4 2 4 ...
 $ personal_status_sex       : Factor w/ 4 levels "female : divorced/separated/married",..: 4 4 4 4 1 1 1 1 4 1 ...
 $ othe

In [None]:
levels(credit_rand$account_check_status)
# levels(credit_rand$age)
levels(credit_rand$credit_history)
# levels(credit_rand$credits_this_bank)
# levels(credit_rand$default)
# levels(credit_rand$duration_in_month)
levels(credit_rand$foreign_worker)
levels(credit_rand$housing)
# levels(credit_rand$installment_as_income_perc)
levels(credit_rand$job)
levels(credit_rand$other_debtors)
levels(credit_rand$other_installment_plans)
# levels(credit_rand$people_under_maintenance)
levels(credit_rand$personal_status_sex)
levels(credit_rand$present_emp_since)
# levels(credit_rand$present_res_since)
levels(credit_rand$property)
levels(credit_rand$purpose)
levels(credit_rand$savings)
levels(credit_rand$telephone)
# levels(credit_rand$credit_amount)

In [None]:
levels(credit_rand$savings)<-make.names(levels(credit_rand$savings))
levels(credit_rand$account_check_status)<-make.names(levels(credit_rand$account_check_status))
levels(credit_rand$present_emp_since)<-make.names(levels(credit_rand$present_emp_since))
levels(credit_rand$personal_status_sex)<-make.names(levels(credit_rand$personal_status_sex))
levels(credit_rand$property)<-make.names(levels(credit_rand$property))
levels(credit_rand$credit_history)<-make.names(levels(credit_rand$credit_history))
levels(credit_rand$foreign_worker)<-make.names(levels(credit_rand$foreign_worker))
levels(credit_rand$housing)<-make.names(levels(credit_rand$housing))
levels(credit_rand$job)<-make.names(levels(credit_rand$job))
levels(credit_rand$other_debtors)<-make.names(levels(credit_rand$other_debtors))
levels(credit_rand$other_installment_plans)<-make.names(levels(credit_rand$other_installment_plans))
levels(credit_rand$purpose)<-make.names(levels(credit_rand$purpose))
levels(credit_rand$telephone)<-make.names(levels(credit_rand$telephone))

In [None]:
# confirming if we still have the same data as before
summary(credit$credit_amount)
summary(credit_rand$credit_amount)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18424 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18424 

In [None]:
# splitting the data, train 90% and test 10%
credit_train<-credit_rand[1:900,]
credit_test<-credit_rand[901:1000,]

In [None]:
prop.table(table(credit_train$default))
prop.table(table(credit_test$default))


        0         1 
0.7022222 0.2977778 


   0    1 
0.68 0.32 

## Training the model on the data

In [None]:
# making the default variable a factor
credit_train$default<-factor(credit_train$default)
str(credit_train$default)

 Factor w/ 2 levels "0","1": 2 1 1 1 2 1 2 2 1 1 ...


In [None]:
# Creating the model without the 'default'
# install.packages("C50")
library(C50)
credit_model<-C5.0(credit_train[,-1], credit_train$default)

In [None]:
# to check the simple facts about trees
credit_model


Call:
C5.0.default(x = credit_train[, -1], y = credit_train$default)

Classification Tree
Number of samples: 900 
Number of predictors: 20 

Tree size: 57 

Non-standard options: attempt to group attributes


In [None]:
# getting the summary of the model
summary(credit_model)


Call:
C5.0.default(x = credit_train[, -1], y = credit_train$default)


C5.0 [Release 2.07 GPL Edition]  	Sat Oct 22 13:24:00 2022
-------------------------------

Class specified by attribute `outcome'

Read 900 cases (21 attributes) from undefined.data

Decision tree:

account_check_status = no.checking.account: 0 (358/44)
account_check_status in {X..0.DM,
:                        X...200.DM...salary.assignments.for.at.least.1.year,
:                        X0..........200.DM}:
:...foreign_worker = no:
    :...other_installment_plans in {none,stores}: 0 (17/1)
    :   other_installment_plans = bank:
    :   :...present_res_since <= 3: 1 (2)
    :       present_res_since > 3: 0 (2)
    foreign_worker = yes: [S1]

SubTree [S1]

credit_history in {all.credits.at.this.bank.paid.back.duly,
:                  no.credits.taken..all.credits.paid.back.duly}: 1 (61/20)
credit_history in {critical.account..other.credits.existing..not.at.this.bank.,
:                  delay.in.paying.off.in.the.

## Evaluating model performance

In [None]:
# creating predicted values using the test data
credit_pred<-predict(credit_model, credit_test)

In [None]:
# install.packages("gmodels")
library(gmodels)
CrossTable(credit_test$default, credit_pred, dnn = c("actual default", "predicted default"))


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |         0 |         1 | Row Total | 
---------------|-----------|-----------|-----------|
             0 |        54 |        14 |        68 | 
               |     2.173 |     4.035 |           | 
               |     0.794 |     0.206 |     0.680 | 
               |     0.831 |     0.400 |           | 
               |     0.540 |     0.140 |           | 
---------------|-----------|-----------|-----------|
             1 |        11 |        21 |        32 | 
               |     4.617 |     8.575 |           | 
               |     0.344 |     0.656 |     0.320 | 
               |     0.169 |     0.600 |           | 
               |     0.110 |     0.210 |           | 
-------

The model predicted 75% accurately

## Boosting the accuracy of the decision Trees

In [None]:
# adding the parameter 'trials' to the model indicating number of separte decision
# trees to use in the boosted team
credit_boost10<-C5.0(credit_train[-1], credit_train$default, trials = 10)
credit_boost10


Call:
C5.0.default(x = credit_train[-1], y = credit_train$default, trials = 10)

Classification Tree
Number of samples: 900 
Number of predictors: 20 

Number of boosting iterations: 10 
Average tree size: 47.3 

Non-standard options: attempt to group attributes


In [None]:
# the tree shrunk
summary(credit_boost10)


Call:
C5.0.default(x = credit_train[-1], y = credit_train$default, trials = 10)


C5.0 [Release 2.07 GPL Edition]  	Sat Oct 22 13:38:00 2022
-------------------------------

Class specified by attribute `outcome'

Read 900 cases (21 attributes) from undefined.data

-----  Trial 0:  -----

Decision tree:

account_check_status = no.checking.account: 0 (358/44)
account_check_status in {X..0.DM,
:                        X...200.DM...salary.assignments.for.at.least.1.year,
:                        X0..........200.DM}:
:...foreign_worker = no:
    :...other_installment_plans in {none,stores}: 0 (17/1)
    :   other_installment_plans = bank:
    :   :...present_res_since <= 3: 1 (2)
    :       present_res_since > 3: 0 (2)
    foreign_worker = yes: [S1]

SubTree [S1]

credit_history in {all.credits.at.this.bank.paid.back.duly,
:                  no.credits.taken..all.credits.paid.back.duly}: 1 (61/20)
credit_history in {critical.account..other.credits.existing..not.at.this.bank.,
:          

In [None]:
credit_boost_pred10<-predict(credit_boost10, credit_test)
CrossTable(credit_test$default, credit_boost_pred10, dnn = c('actual default', 'predicted default'))


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |         0 |         1 | Row Total | 
---------------|-----------|-----------|-----------|
             0 |        63 |         5 |        68 | 
               |     1.603 |     6.031 |           | 
               |     0.926 |     0.074 |     0.680 | 
               |     0.797 |     0.238 |           | 
               |     0.630 |     0.050 |           | 
---------------|-----------|-----------|-----------|
             1 |        16 |        16 |        32 | 
               |     3.407 |    12.815 |           | 
               |     0.500 |     0.500 |     0.320 | 
               |     0.203 |     0.762 |           | 
               |     0.160 |     0.160 |           | 
-------

The error reduced and the model was accurate by 79%  
However the rate at which the model predicts those who default is at 50% making it 50% accurate

In [None]:
# applying decision trees using cost error parameter
credit_cost<-C5.0(credit_train[-1], credit_train$default)
credit_cost_pred<-predict(credit_cost, credit_test)
CrossTable(credit_test$default, credit_cost_pred, prop.chisq = FALSE, 
prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |         0 |         1 | Row Total | 
---------------|-----------|-----------|-----------|
             0 |        54 |        14 |        68 | 
               |     0.540 |     0.140 |           | 
---------------|-----------|-----------|-----------|
             1 |        11 |        21 |        32 | 
               |     0.110 |     0.210 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        65 |        35 |       100 | 
---------------|-----------|-----------|-----------|

 
