## Winters, Alexander (V00970263)

# Problem 4. SK Learn

### Sources:

https://scikit-learn.org/stable/modules/tree.html#tree

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [1]:
import numpy as np
np.random.seed(1337)

In [2]:
import pandas as pd
# Plotting support
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [3]:
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score, zero_one_loss

In [4]:
elections_df = pd.read_csv('elections_clean.csv')
elections_df.columns

Index(['votes', 'unemployment', 'med_hh_inc', 'per_capita_inc',
       'poverty_all_ages', 'deep_pov_all', 'deep_pov_children', 'population',
       'total_area', 'pop_density', 'total_male', 'total_female',
       'voter_turnout', 'democrat', 'county', 'state', 'education', 'religion',
       'age_young', 'age_adult', 'age_old', 'ethnic_male', 'ethnic_female'],
      dtype='object')

In [5]:
# Get the label vector
label_vector = elections_df.pop('democrat')

# Take only the categorial features
categorial_features = ['education', 'religion', 'ethnic_male', 'ethnic_female']

# We only want the categorial features and our label vector
elections_df = elections_df[categorial_features]
# One-hot encoding of the categorial_features
elections_df = pd.get_dummies(elections_df, categorial_features)
elections_df['democrat'] = label_vector

In [6]:
# Split the data 70/30
X_train, X_test, Y_train, Y_test = train_test_split(elections_df.drop('democrat', axis=1), elections_df['democrat'], train_size=0.7)
# Generate decision tree
DTree = tree.DecisionTreeClassifier(criterion='entropy')
# Train the decision tree
DTree.fit(X_train, Y_train)

# Train and Test prediciton
train_prediction = DTree.predict(X_train) 
test_prediction = DTree.predict(X_test)

# Calculate accuracy and error for training and testing data
train_acc = accuracy_score(Y_train, train_prediction) * 100.0 
train_err = zero_one_loss(Y_train, train_prediction) * 100.0

test_acc = accuracy_score(Y_test, test_prediction) * 100.0 
test_err = zero_one_loss(Y_test, test_prediction) * 100.0 

print("The training accuracy of the Decision-Tree: " + str(train_acc) + " %")
print("The training error of the Decision-Tree: " + str(train_err) + " %\n")

print("The validation accuracy of the Decision-Tree: " + str(test_acc) + " %")
print("The validation error of the Decision-Tree: " + str(test_err) + " %\n")

max_depth = DTree.get_depth()
print("Maximum Depth of Decision-Tree is: " + str(max_depth))

The training accuracy of the Decision-Tree: 90.82235347569286 %
The training error of the Decision-Tree: 9.177646524307137 %

The validation accuracy of the Decision-Tree: 87.8177966101695 %
The validation error of the Decision-Tree: 12.182203389830503 %

Maximum Depth of Decision-Tree is: 17


In [7]:
import graphviz
dot_data = tree.export_graphviz(DTree, out_file=None,
                               feature_names=elections_df.columns.name,
                               filled=True, rounded=True,
                               special_characters=True)
graph = graphviz.Source(dot_data)
# graph.render("elections") # Renders the graph into a .pdf
# graph # Renders inside jupyter-notebook

Comparing this Decision-Tree to the one built from scratch in Problem 2, the prediciton accuracy of the two trees are very similar. However, the max depth of the trees significanly differ. Since max depth depends strongly on the implementation and parametrization, I'm assuming that my implementation have impure leaves. Hence, sklearn expands all nodes until leaves are pure (unless otherwise specified). 

Furthermore, both implementations have similar accuracy due to having the same features to split on and using entropy based splitting. 