# Python Decision Tree Notebook

This notebook follows documentation from http://scikit-learn.org/stable/modules/tree.html


## 1. Sourcing Data

Before we do anything, we need to get our data.

Currently that data is sitting in a csv file ("hr_clean.csv" or "hr_data.csv"), and

To do that, we are going to be using one of the most popular python libraries for data analysis (and in coding generally) - Pandas. You can learn about the story of Pandas here: https://qz.com/1126615/the-story-of-the-most-important-tool-in-data-science/

To use Pandas to read in a csv file, a quick google search might take us to the following walkthrough:
https://www.geeksforgeeks.org/python-read-csv-using-pandas-read_csv/



In [None]:
import pandas as pd
hrdata = pd.read_csv("HR_clean.csv")
hrdata.head()

## 2. Exploring Data

In [None]:
hrdata.describe()

Ok, so we're starting to understand our data a bit better. But what if we want to understand it better? It's time to visualize.

A quick google for "graphs in python" or something similar steers us to a library called matplotlib.

https://matplotlib.org/tutorials/introductory/sample_plots.html

In [None]:
%matplotlib notebook
plot1 = hrdata.plot.scatter(x = 'average_monthly_hours', y = 'satisfaction_level', c = 'resigned', colormap = 'viridis')

## 3. Building a Model

http://scikit-learn.org/stable/modules/tree.html

In [None]:
classes = hrdata["resigned"]
features = hrdata.iloc[:,4:-1]

In [None]:
from sklearn import tree
model = tree.DecisionTreeClassifier(max_depth = 5)
model = model.fit(features, classes)

## Evaluating the model

In [None]:
import graphviz 

In [None]:
dot_data = tree.export_graphviz(model, out_file=None, 
                         feature_names = features.columns,  
                         class_names = ["stay","leave"],  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

In [None]:
model.score(features,classes)

In [None]:
# http://scikit-learn.org/stable/modules/cross_validation.html

from sklearn.model_selection import cross_val_score
train_test = cross_val_score(model, features, classes)
print(train_test)

In [None]:
train_test.mean()

In [None]:
from sklearn.model_selection import cross_val_score

train_test = cross_val_score(model, features, classes)
print('Score for 3 splits of the data (1/3 test)',train_test)
print('Average score', train_test.mean())

Wel then how do we know what the best model could possibly be?

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':range(2,20)}

modelSearch = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4)

modelSearch.fit(features, classes)

tree_model = modelSearch.best_estimator_

print(modelSearch.best_score_, modelSearch.best_params_) 

model = tree.DecisionTreeClassifier(max_depth=modelSearch.best_params_["max_depth"])
model.fit(features,classes)

In [None]:
score = model.score(features,target)

## Final step

In [1]:
from IPython import display
display_gif = "https://media.giphy.com/media/35OOkbcHtrFr8cQD7E/giphy.gif"
if score > 0.99:
    display_gif = 'https://media.giphy.com/media/g9582DNuQppxC/giphy.gif'
display.HTML('<img src="{}">'.format(display_gif))

NameError: name 'score' is not defined