# Modeling with DecisionTreeClassifier

In [1]:
import pandas as pd
import numpy as np

#Sklearn for model fitting and scroing
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (8,8)

In [2]:
#Load data
df = pd.read_csv("Data/training.csv")
df = df.drop(columns={"id"})
df.sample(5)

Unnamed: 0,target,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199,var_200
39,1,0.945,0.604,0.381,0.724,0.863,0.378,0.724,0.95,0.663,...,0.951,0.568,0.413,0.225,0.434,0.81,0.866,0.583,0.462,0.671
58,1,0.653,0.172,0.091,0.845,0.238,0.532,0.92,0.891,0.76,...,0.558,0.728,0.494,0.802,0.744,0.257,0.889,0.92,0.735,0.502
237,0,0.244,0.751,0.802,0.871,0.18,0.144,0.272,0.766,0.835,...,0.133,0.388,0.856,0.194,0.641,0.852,0.752,0.894,0.627,0.885
102,0,0.734,0.287,0.674,0.313,0.184,0.647,0.719,0.053,0.593,...,0.51,0.744,0.825,0.666,0.582,0.462,0.124,0.667,0.141,0.346
178,1,0.414,0.934,0.677,0.476,0.187,0.167,0.571,0.26,0.681,...,0.311,0.917,0.58,0.589,0.256,0.107,0.178,0.533,0.455,0.669


### Starting

The first stage will be to look at the data in an initial exploratory data analysis. We know that our data is made up of numerical variables and is in a standardized form. The records are complete and there are non null values, so the data is in a clean state.

In [3]:
import ml_utils
df_eda = ml_utils.edaDF(df, "target")
num_var = list(df.columns)
df_eda.setNum(num_var)

In [4]:
df_eda.fullEDA()


Tab(children=(Output(), Output(), Output(), Output()), selected_index=0, titles=('Info', 'Statistics', 'Catego…

In [6]:
df.shape

(250, 201)

### Model fitting

As mentioned above, our data is in a clean state and as such, we will proceed to model fitting. 
The model that is used is the DecisionTreeClassifier.
Conidering that the data set is small-sized (250 rows of data), we will use a 70/30 percent split for the train/test data.

In [15]:
y = np.array(df["target"]).reshape(-1,1)
X = np.array(df.drop(columns={"target"}))

#train size = 70%
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=.7)


Model 1: Here, the DecisionTreeClassifier is used with default hyperparameters 

In [16]:
pipeline_model1 = [
    ('scaler', StandardScaler()),
    ('DT', DecisionTreeClassifier(random_state=0))
    ]

model1 = Pipeline(pipeline_model1)
model1 = model1.fit(X_train, y_train)
print("Training Accuracy:", model1.score(X_train, y_train))
print("Testing Accuracy:", model1.score(X_test, y_test))

Training Accuracy: 1.0
Testing Accuracy: 0.6266666666666667


Model 2: The DecisionTreeClassifier is used with the criterion set to "entropy". 
We noticed that there is not so much difference in the testing accuracy 

In [17]:

pipeline_model2 = [
    ('scaler', StandardScaler()),
    ('DT', DecisionTreeClassifier(random_state=0, criterion="entropy"))
    ]
model2 = Pipeline(pipeline_model2)
model2 = model2.fit(X_train, y_train)
print("Training Accuracy:", model2.score(X_train, y_train))
print("Testing Accuracy:", model2.score(X_test, y_test))

Training Accuracy: 1.0
Testing Accuracy: 0.6133333333333333


Model 3: The DecisionTreeClassifier is used with the criterion set default and max_depth set to 5. 
There was improvement in the testing accuracy while the traininig accuracy dropped slightly. This was considered as the best model because further tuning seems not to yield better results.

In [19]:
#Build pipeline
pipeline_best = [
    ('scaler', StandardScaler()),
    ('DT', DecisionTreeClassifier(random_state=0, max_depth=5))
    ]

best = Pipeline(pipeline_best)
best = best.fit(X_train, y_train)
print("Training Accuracy:", best.score(X_train, y_train))
print("Testing Accuracy:", best.score(X_test, y_test))

Training Accuracy: 0.9257142857142857
Testing Accuracy: 0.64


### Finishing

In [21]:
print(best.score(X_test, y_test))
print(best)

0.64
Pipeline(steps=[('scaler', StandardScaler()),
                ('DT', DecisionTreeClassifier(max_depth=5, random_state=0))])


### Testing

In [20]:
#Load Test Data
test_df = pd.read_csv("testing.csv")
test_df = test_df.drop(columns={"id"})
#Create tests and score
test_y = np.array(test_df["target"]).reshape(-1,1)
test_X = np.array(test_df.drop(columns={"target"}))

preds = best.predict(test_X)

roc_score = roc_auc_score(test_y, preds)
acc_score = accuracy_score(test_y, preds)

print(roc_score)
print(acc_score)
print(name, np.mean([roc_score, acc_score]))


0.5429167644854836
0.5427848101265823
Adewale Adeniji 0.542850787306033


### What Accuracy Changes Were Used

What was done to try to increase accuracy and/or limit overfitting:
<ul> 
<li>The first step was to change the diffult train_size of the train_test_split to 0.7 so as to have more data for testing.
<li>Secondly, the max_depth of the DecisionTreeClassifier was set to 5 as this was found to produce the best result for the test data.
</ul>