# Dataset Name: HOME MORTGAGE DISCLOSURE ACT data 
Model chosen: Decision Tree Classifier

Command to Install pydotplus through anaconda prompt: <br>
conda install -c conda-forge pydotplus <br>
pip install graphviz 

Objective: To classify the loan action taken depending on applicant-income, loan-amount, loan Type, loan purpose and agency.

In [9]:
# Importing the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
import pydotplus
from IPython.display import Image
import matplotlib.pyplot as plt

# Loading Data:   <br>
After importing required libraries load the required dataset using pandas’ read CSV function.

In [4]:
# Importing the dataset
dataset = pd.read_csv('hmda_lar.csv')


# Feature Selection: <br>
Excluded the attributes which have null values for more than 300 out of 500 observations. From the remaining observations we have identified the following features to analyse our objective.

These are the independent variables selected from the dataset


In [6]:
features = ['applicant_income_000s','loan_amount_000s','loan_type','loan_purpose','agency_code']

In [7]:
target = ['action_taken']

# Removing Null values and working with missing data:
Here we have select columns and identify the columns with null values and fill the null values.

import seaborn as sns
corr = dataset.corr()
fig, ax = plt.subplots(figsize=(16,10)) 
sns.heatmap(corr.iloc[:, 1:78:], annot = True, linewidth = 0.5, vmax= 8, cmap ='RdBu')



In [5]:
print(dataset.isnull().sum()) # outputs the columns with null values
# column "applicant_income" has null values. Replace missing values with the mean 
dataset['applicant_income_000s'].fillna(dataset['applicant_income_000s'].mean(), inplace=True)


action_taken                        0
action_taken_name                   0
agency_code                         0
agency_abbr                         0
agency_name                         0
                                 ... 
number_of_owner_occupied_units    221
minority_population               221
population                        221
rate_spread                       479
tract_to_msamd_income             221
Length: 78, dtype: int64


# Splitting Data: <br>
To understand model performance, divide the dataset into training set(75%) and test set(25%). Training set to fit the model and test set to verify the performance. 
We used  sklearn.model_selection.train_test_split to split the dataset.




In [11]:
#Split test and train data
X = dataset[features]
y = dataset[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Evaluating Model: <br>
Accuracy can be computed by comparing actual test set values and predicted values.



In [13]:
clf = tree.DecisionTreeClassifier(criterion="entropy",max_depth=5)
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)

In [14]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm) 

[[26  1 11  0  5  0  0]
 [ 3  1  0  0  0  0  5]
 [ 5  1 24  0  1  1  1]
 [ 2  0  4  0  0  0  0]
 [ 8  0  3  0  5  0  0]
 [ 0  1  1  0  1  1  5]
 [ 1  1  0  0  0  0  7]]



# ACCURACY = sum of correct classifications/ total no of classifications

In [23]:
print("Decision Tree Accuracy:", metrics.accuracy_score(y_test, y_pred)*100)

Decision Tree Accuracy: 51.2


# Performance metric:

In [24]:
print("----- Decision Tree Classification Report  -----")
print(metrics.classification_report(y_test, y_pred))

----- Decision Tree Classification Report  -----
              precision    recall  f1-score   support

           1       0.58      0.60      0.59        43
           2       0.20      0.11      0.14         9
           3       0.56      0.73      0.63        33
           4       0.00      0.00      0.00         6
           6       0.42      0.31      0.36        16
           7       0.50      0.11      0.18         9
           8       0.39      0.78      0.52         9

    accuracy                           0.51       125
   macro avg       0.38      0.38      0.35       125
weighted avg       0.48      0.51      0.48       125



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# View the decision tree
data_feature_names = ['Applicant Income', 'Loan Amount','Loan Type','Loan Purpose','Agency']
plotData=tree.export_graphviz(clf, out_file=None, filled=True, rounded=True, special_characters=True,
               feature_names=data_feature_names)
graph = pydotplus.graph_from_dot_data(plotData)
graph.write_png('DecisionTree.png')
Image(graph.create_png())
print("Decision Tree Created")