## 1. Import libraries

TELL: These are tools we are using that someone else already wrote. This is what data scientists do in practice - the hard part is determining what to do, not neccessarily how to do it. Here, we'll use *pandas* to manipulate our data and *sklearn* to build our decision tree model.

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree

## 2. Load data
TELL: We're going to load our data from the CSV file we saw before - it looks just like a spreadsheet, but we want to be able to access the information using code rather than reading the whole thing as humans. The data.head() line shows us the first few rows so we can get a sense of what the rest of the data holds -- we COULD show the whole dataset (ONLY FACILITATOR DO: demonstrate creating a new cell, writing data, and showing output) -- but that's not particularly useful.

Notice that you can scroll horizontally.

In [2]:
data = pd.read_csv("HR_data.csv")
data.head()

Unnamed: 0,department,region,termination_date,bracket_salary 1,salary,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spend_company,work_accident,promotion_last_5years,resigned
0,technical,US,2015-06-27T00:00:00Z,low,52000,0.82,0.63,4,232,4,0,0,0
1,product_mng,Europe,2017-02-24T00:00:00Z,low,36000,0.72,0.79,4,154,3,0,0,0
2,sales,Canada,2008-12-06T00:00:00Z,medium,77000,0.71,0.88,3,140,2,0,0,0
3,support,US,2015-11-22T00:00:00Z,medium,70000,0.53,0.75,4,239,2,1,0,0
4,technical,US,2009-03-29T00:00:00Z,medium,76000,0.49,0.49,2,245,3,0,0,0


## 3. Build model
TELL: Now that we have our dataset, we're ready to begin modelling to build our automated decision tree. As we've discussed before, machine learning works by taking inputs and learning to map them to outputs. We'll use a premade tool for modelling (similar to the premade template we used in the human-powered tree).

SHOW: https://scikit-learn.org/stable/modules/tree.html and talk through example.

We need to tell the model what the inputs and outputs should be. We'll use the standard names X and Y.

Let's start with the output -- which column are we trying to predict? "resigned"
Notice that the values of the resigned column have been converted from No/Yes into 0/1. This is because our tree handles only numerical data.

We want to use all rows of the "resigned column," which in Python we write as:
Y = data.loc[:, "resigned"]

What will we use to predict whether each person resigned? We might be tempted to use "the rest," but remember: our tree can only use numerical data. In the human-powered one, we saw there was a difference between words and numbers. For example, we could say "Department is equal to Sales," but not "Department is greater than Sales." So we want all rows, but only the columns with numerical data, which appear from the column "salary" to the column "promotion_last_5years." In Python, we write it as:

X = data.loc[:, "salary":"promotion_last_5years"]

In [3]:
# Learners write this
# Look at your data and determine the columns for your output (Y) and usable input (X)
Y = data.loc[:, "resigned"]
X = data.loc[:, "salary":"promotion_last_5years"]

TELL: You might notice our code here is pretty minimal and matches the documentation closely. Data scientists typically try to use standard notation and naming conventions so that it's easier to collaborate and reuse code. The only difference you'll see between this code and the documentation is that we've specified that we want our tree to be only up to 2 layers deep, just like our human-powered version.

In [4]:
clf = tree.DecisionTreeClassifier(max_depth = 2)
clf = clf.fit(X, Y)

## 4. Evaluate

TELL: That's it! That's it? How do you know whether this is a good tree? It'd be great to get a sense of how accurately it sorted the data.

In [5]:
clf.score(X,Y)

0.8525374376039934

TELL: That seems OK. We noted before that humans are difficult to make predictions about, so it's not surprising that with only two layers, we don't reach 100% accuracy.

Let's see what questions were chosen.

In [None]:
import graphviz

dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names = X.columns,  
                         class_names = ["stay","leave"],  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

ASK: Did anyone ask similar questions? We wouldn't expect the questions to be identical to any human-powered tree because we used two different data sets.

TELL: So now we're ready to make our prediction. We could manually trace our tree to predict our original question, "Will Maria quit?", or we could use code.

In [8]:
maria_data = (np.array([[58000, .82, .85, 4, 274, 5, 0, 0]]))
clf.predict(maria_data)

array([0])