# Q7
Tree Based Models - Q07- 08/July

Consider the dataset in credit_data.csv at the following location
https://drive.google.com/drive/folders/1Jl8iDu7nGmrqCECbrLqmVafgwE5PYfiU

A1 - A15 = Attributes, T - Target (positive or negative credit)

Split the data into train and test ( 75% - 25%) and  train a decision tree using sklearn DecisionTreeClassifier() with 2 different methods 

    - 1) ID3  (criterion = 'entropy')
    - 2) CART (criterion = 'gini')

Find the accuracy on test data for the 2 different methods. Why do you think they are so?

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv("01_credit_data.csv")
df.head(2)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,T
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+


In [3]:
cat_var_list = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
num_vrb_list = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']

In [4]:
#df[num_vrb_list] = df[num_vrb_list].astype(float)
for col in  num_vrb_list:
    df[col] = pd.to_numeric(df[col], errors='coerce')
df = df.dropna(axis=0)

In [5]:
df_encoded = pd.get_dummies(df, columns = cat_var_list, prefix_sep='_', drop_first=True)
df_encoded['target'] = np.where(df_encoded['T'] == '+', 1, 0)
#df_encoded.head()

In [6]:
# varify encoding
pd.crosstab(df_encoded['T'], df_encoded['target'])

target,0,1
T,Unnamed: 1_level_1,Unnamed: 2_level_1
+,0,299
-,367,0


In [7]:
df_encoded.drop('T', axis=1, inplace=True)

In [8]:
df_train, df_test = train_test_split(df_encoded, test_size=0.25, random_state=25)
print(df_train.shape)
print(df_test.shape)

(499, 41)
(167, 41)


# 1. With entropy as split criteria

In [9]:
x_cols = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15', 'A1_a', 'A1_b', 'A4_u', 'A4_y',
          'A5_gg', 'A5_p', 'A6_aa', 'A6_c', 'A6_cc', 'A6_d', 'A6_e', 'A6_ff',
          'A6_i', 'A6_j', 'A6_k', 'A6_m', 'A6_q', 'A6_r', 'A6_w', 'A6_x', 'A7_bb',
          'A7_dd', 'A7_ff', 'A7_h', 'A7_j', 'A7_n', 'A7_o', 'A7_v', 'A7_z',
          'A9_t', 'A10_t', 'A12_t', 'A13_p', 'A13_s'
         ]
X = df_train[x_cols]
y = df_train['target']
model_decision_tree = DecisionTreeClassifier(random_state=0, criterion='entropy')
model_decision_tree.fit(X, y)
model_accuracy = model_decision_tree.fit(X, y).score(df_test[x_cols], df_test['target'])
print(f"model_accuracy is {np.round(model_accuracy * 100, 2)}%")

model_accuracy is 84.43%


# With Gini as split criteria

In [10]:
x_cols = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15', 'A1_a', 'A1_b', 'A4_u', 'A4_y',
          'A5_gg', 'A5_p', 'A6_aa', 'A6_c', 'A6_cc', 'A6_d', 'A6_e', 'A6_ff',
          'A6_i', 'A6_j', 'A6_k', 'A6_m', 'A6_q', 'A6_r', 'A6_w', 'A6_x', 'A7_bb',
          'A7_dd', 'A7_ff', 'A7_h', 'A7_j', 'A7_n', 'A7_o', 'A7_v', 'A7_z',
          'A9_t', 'A10_t', 'A12_t', 'A13_p', 'A13_s'
         ]
X = df_train[x_cols]
y = df_train['target']
model_decision_tree = DecisionTreeClassifier(random_state=0, criterion='gini')
model_decision_tree.fit(X, y)
model_accuracy = model_decision_tree.fit(X, y).score(df_test[x_cols], df_test['target'])
print(f"model_accuracy is {np.round(model_accuracy * 100, 2)}%")

model_accuracy is 83.23%


# Answers
    - 1. With Entropy as split criteria, we get accuracy of 84.43%
    - 2. With Gini as split criteria, we get accuracy of 83.23 %