## Decision Tree
A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience.

To make a decision tree, all data has to be numerical.

## Gini

Use the GINI method to split the samples

The Gini method uses this formula:

Gini = 1 - (x/n)2 - (y/n)2

In [None]:
# In the example, a person will try to decide if he/she should go to a comedy show or not.
# We have to convert the non numerical columns 'Nationality' and 'Go' into numerical values.
# Pandas has a map() method that takes a dictionary with information on how to convert the values.
# {'UK': 0, 'USA': 1, 'N': 2}

# Use the GINI method
# Gini = 1 - (x/n)2 - (y/n)2
# x is the number of positive answers("GO"), 
# n is the number of samples,
# y is the number of negative answers ("NO"), which gives us this calculation:
# 1 - (7 / 13)2 - (6 / 13)2 = 0.497

import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

df = pandas.read_csv("data.csv")

d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)

features = ['Age', 'Experience', 'Rank', 'Nationality']

# We have to separate the feature columns from the target column.
# The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict.
X = df[features]
y = df['Go']

dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)

tree.plot_tree(dtree, feature_names=features)

# The decision tree uses your earlier decisions to calculate the odds for you to wanting to go see a comedian or not.

# Rank
# Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and the rest will follow the False arrow (to the right).

# gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle.

# samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this is the first step.

# value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".

# We can use the Decision Tree to predict new values.
# Use predict() method to predict new values:
print(dtree.predict([[40, 10, 7, 1]]))