## Decision Tree - Loan Repayment  
*Source: https://www.youtube.com/watch?v=RmajweUFKvM&list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy&index=18*

### What is a Decision Tree algorithm?

The name itself suggests that it uses a flowchart like a tree structure to show the predictions that result from a series of feature-based splits. It starts with a root node and ends with a decision made by leaves.  

![image.png](attachment:image.png)  


**You should read the content in this link:** https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/  

When you read around the topic of decision tree algorithms you will see that they are able to handle both numerical and categorical data. However we are using the scikit-learn implementation and, for now, it does not support categorical variables.

In [1]:
#Import the libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [14]:
balance_data = pd.read_csv('LoanDataset.csv', sep= ',', header= 0)
balance_data

Unnamed: 0,Initial Payment,Last Payment,Credit Score,House Number,Result
0,201,10018,250,3046,yes
1,205,10016,395,3044,yes
2,257,10129,109,3251,yes
3,246,10064,324,3137,yes
4,117,10115,496,3094,yes
...,...,...,...,...,...
995,413,14914,523,4683,No
996,359,14423,927,4838,No
997,316,14872,613,4760,No
998,305,14926,897,4572,No


In [3]:
print ("Dataset Length: ", len(balance_data))

Dataset Length:  1000


In [4]:
print ("Dataset Shape: ", balance_data.shape)

Dataset Shape:  (1000, 5)


In [5]:
balance_data.head()

Unnamed: 0,Initial Payment,Last Payment,Credit Score,House Number,Result
0,201,10018,250,3046,yes
1,205,10016,395,3044,yes
2,257,10129,109,3251,yes
3,246,10064,324,3137,yes
4,117,10115,496,3094,yes


### Extracting the Dependent and Independent Variables

In [6]:
#Extract the independent variables
X = balance_data.values[:, 0:4]
#Extract the dependent variable
y = balance_data.values[:,4]

In [7]:
#Display X
X

array([[201, 10018, 250, 3046],
       [205, 10016, 395, 3044],
       [257, 10129, 109, 3251],
       ...,
       [316, 14872, 613, 4760],
       [305, 14926, 897, 4572],
       [168, 14798, 834, 4937]], dtype=object)

In [8]:
#Display y
y

array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
       'yes', 'yes',

### Splitting the Data into Training and Testing Sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 100)

### Fitting the Model to the Training Set

In [10]:
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)

### Testing the Prediction

In [11]:
y_pred_en = clf_entropy.predict(X_test)
y_pred_en

array(['yes', 'yes', 'No', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'No',
       'No', 'No', 'yes', 'No', 'No', 'yes', 'yes', 'No', 'yes', 'No',
       'No', 'yes', 'No', 'yes', 'yes', 'No', 'No', 'yes', 'No', 'No',
       'No', 'yes', 'yes', 'yes', 'yes', 'No', 'No', 'No', 'yes', 'No',
       'yes', 'yes', 'yes', 'No', 'No', 'yes', 'yes', 'yes', 'No', 'No',
       'yes', 'No', 'yes', 'yes', 'yes', 'yes', 'No', 'yes', 'No', 'yes',
       'yes', 'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'No',
       'No', 'No', 'No', 'yes', 'No', 'yes', 'yes', 'No', 'yes', 'No',
       'No', 'No', 'No', 'yes', 'No', 'yes', 'No', 'yes', 'yes', 'No',
       'yes', 'yes', 'yes', 'yes', 'yes', 'No', 'yes', 'yes', 'yes',
       'yes', 'No', 'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No',
       'yes', 'yes', 'yes', 'yes', 'No', 'No', 'yes', 'yes', 'yes', 'No',
       'No', 'No', 'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No',
       'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'ye

### Evaluating the Model

In [12]:
#Evaluating the Performance of our Model using the confusion matrix and classification report
print('Confusion Matrix \n',confusion_matrix(y_test,y_pred_en))
print('\n')
print('Classification Report \n',classification_report(y_test,y_pred_en))
print('\n')
print('Accuracy Of Our Model ',accuracy_score(y_test,y_pred_en))

Confusion Matrix 
 [[134  13]
 [  6 147]]


Classification Report 
               precision    recall  f1-score   support

          No       0.96      0.91      0.93       147
         yes       0.92      0.96      0.94       153

    accuracy                           0.94       300
   macro avg       0.94      0.94      0.94       300
weighted avg       0.94      0.94      0.94       300



Accuracy Of Our Model  0.9366666666666666
