## Decision Trees

In scikit the Decision Tree algorithm uses the CART algorithm. CART simply stands for Classification And Regression Tree. This doesn't really tell us much other than it is capable of performing both classification and regression functions.

Regression we have already looked at when we used the specific Linear Regression Algorithm. We want to use the Decision Tree algorithm to create a classification model. 

By Classification we mean that from a set of feature values from an observation we want to place the the observation in to one class or another. We can have as many classifications as we want but often there is only two, making for a boolean decision.

For example, customer churn , mortgage application acceptance (or not).

For our example we are again going to use a dataset from scikit. In this case the Boston breast cancer dataset. The dataset has many features but the 'target' only has two classifications; 'Malignant' or 'Benign' (0 and 1).

The process that we will follow, very closely mirrors the Regression approach. 

Much of the work we need to do occurs before we construct the model.

The test we do afterwards to establish how good the model is, are somewhat different to what we used for Regression but are there for the same purpose.

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

cancer = load_breast_cancer()
print(cancer.DESCR)

In [None]:
df_cancer = pd.DataFrame(cancer.data,columns=cancer.feature_names)
df_cancer['target'] = pd.Series(cancer.target)
#df_cancer.head()

In [None]:
df_cancer.describe()

## Missing values

As with the Boston data there are no missing values for us to deal with.

As before we will create a few plots of the data to aid our understanding

In [None]:
# only need the pyplot functions
import matplotlib.pyplot as plt

# needed by jupyter to ensure that the plots appear inline (in the usual output cell)
%matplotlib inline

In [None]:
# both histograms ...
for col in df_cancer.columns:
    df_cancer[col].hist(bins = 20)
    plt.title('Histogram of ' + col)
    plt.show() 

In [None]:
#  ...   and boxplots can be useful

for col in df_cancer.columns:
    df_cancer.boxplot(column = col)
    plt.title('Boxplot of ' + col)
    plt.show() 

In [None]:
# variable correlations
import seaborn as sns
sns.pairplot(df_cancer)
plt.show()

In [None]:
# we can look at the correlation between each pair of variables

corr = df_cancer.corr()

# easier to see in a csv
#corr.to_csv('cancer_corr.csv')

In [None]:
# or graphically with a heatmap 
import seaborn as sns

fig, ax = plt.subplots(figsize=(15,12))
heat_map = sns.heatmap(corr)
plt.show()

## Outliers and Normalisation 

Do nothing! in this case.

The box plots show some technical outliers but nothing isolated.

Because a decision tree treats all of the preditors independently, there is not need to normalise the data

## Categorical Variables

Some decision tree algorithms can deal with categoricgal values by potentially creating a split on each value. The CART algorithm on the other hand always uses a binary split and because of this it cannot (certainly in scikit implementation) deal with categorical values.

In the case of the Cancer dataset, there are in fact no categorical values anyway. However if there were we would need to introduce dummy variables to deal with them.

We shall have a slight digression to show how we could deal with this situation.



In [None]:
# Read in a dataset (very small)

df_ratings = pd.read_csv('ratings.csv')
df_ratings.head()

In [None]:
# The 'Rating' column has categorical values 

# We can create another dataframe which has columns for each of the different values in the 'Rating' column.
# The number of rows in the dataframe will match the original and the values in the new columns will be 
# either 0 or 1 depending on whether one depending the original 'Rating' value.

df_dummies = pd.get_dummies(df_ratings['Rating'])

In [None]:
df_dummies

In [None]:
# If you need to create dummy variables for more than one column, it can be useful to name them with a prefix, 
# typically of the original column name

df_dummies = pd.get_dummies(df_ratings['Rating'], prefix = 'Rating')
df_dummies

In [None]:
# Next we want to combine the two dataframes and drop the original categorical column

result = pd.concat([df_ratings, df_dummies], axis=1)
result.drop(['Rating'], axis = 1, inplace=True)
result

## Splitting the data

We will do this in a similar way as we did for regression, however because there is an in blaance inthe target values we will stratify the split on the target so as to ensure that both values are fairly represented in both the training and test datasets.

In [None]:
print(df_cancer['target'].value_counts())

In [None]:
# the predictors
df_cancer_X = pd.DataFrame(df_cancer,columns=cancer.feature_names)
#df_cancer_X.describe()
# the tagets
df_cancer_y = pd.DataFrame(df_cancer,columns=['target'])
#df_boston_y.describe()

In [None]:
# now we cn use the train_test_split function from sklearn

# We need to provide both the predictors and the target dataframes
# We also provide a 'test_size' value to indicate the % of the rows to be used for the test datframe. 
#
# Because there is an uneven split in the values of the target, we will use the stratify parameter in the train_test_split
# function to ensure that each iis equallty represented in the test and train dataframes.

from sklearn import model_selection

#X_train, X_test, y_train, y_test = model_selection.train_test_split(df_cancer_X, df_cancer_y, test_size = 0.2, random_state = 42)
X_train, X_test, y_train, y_test = model_selection.train_test_split(df_cancer_X, df_cancer_y, test_size = 0.2, random_state = 42, stratify = df_cancer_y)

In [None]:
# get shape of test and training sets that have been created
print('Training Set Row Count: ', X_train.shape[0])
print('Test Set Row Count: ', X_test.shape[0])

## Creating the model

In this example we are going to build a decisiontree model

We are also importing export_graphviz function to help us visualise the created decisiontree.

graphviz is an open source standa alone vizualisation product that you need to install. There are Windows, Mac and Linux versions available. You also need to install a couple of libraries into the Python environment

#pip install graphviz

#pip install pydotplus


In [None]:
from sklearn import tree 
from sklearn.tree import export_graphviz
#import pydotplus
import graphviz

model = tree.DecisionTreeClassifier()


In [None]:
## We can now fit the model

model = model.fit(X_train,y_train)

In [None]:
# we can find out which features were used and how important they were to the model

d = dict(zip(cancer.feature_names, model.feature_importances_))
for key in d :
    print(key, ":", d[key])

In [None]:
# We can use graphviz to create a picture of the model
# The picture makes it very easy to understand the and explain the model. 
# in fact you could use it to make your own predictions.

list_col = list(df_cancer.columns)
list_col.remove('target')
#dot_data = export_graphviz(model)
graphviz.Source(export_graphviz(model,out_file=None, feature_names = list_col, impurity = True))

## Explain the tree

In [None]:
# you may prefer to save the tree image as a pdf

# to output to a .dot file
with open("cancer_classifier.dot", "w") as f:
    f = tree.export_graphviz(model, out_file=f, feature_names = sorted(list_col))

# Then us a command line instruction to generate the pdf in the local directory
!dot -Tpdf cancer_classifier.dot -o cancer_classifier.pdf

## How good is the model?

We are going to use two metrics to look at how the model is.

A confusion matrix and the Receiver Operating Characteristic Area Under the Curve (ROC AUC)

## What is a confusion matrix?

A confusion matrix compare the number of correct prediction against the total number of predictions (i.e. number of test observations.)

The confusion matrix is an n*n matrix where n is the number of values in the clasification. In our case there are only 2.

<img src="Confusion_matrix.png" />

To calculate the accuracy of the model we sum the values in the diagonal cell and divide by the sum of the values in all of the cells.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [None]:
y_pred = model.predict(X_test)
y_pred

In [None]:
cf = confusion_matrix(y_test,y_pred)
print(cf)

In [None]:
accuracy = (cf[0,0] + cf[1,1])/(cf[0,0] + cf[0,1] + cf[1,0] + cf[1,1] )
print("Accuracy of the model is : ", round(accuracy,2))

In [None]:
# ROC_AUC

# See https://en.wikipedia.org/wiki/Receiver_operating_characteristic 
# for details

probabilities = model.predict_proba(X_train)
fpra, tpra, threshold = roc_curve(y_train, probabilities[:,1])
plt.title('ROC curve')
plt.plot(fpra,tpra, 'b')
plt.plot([0,1],[0,1], 'g--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive rate')
plt.xlabel('False Positive rate')
plt.show()

Both of our measures seem to indicate excellent results for the model. 

However Decisiontrees are prone to overfitting

Exercise: What is overfitting and what can you do about it.



## Other models are available

In [None]:
from sklearn.ensemble import RandomForestRegressor

# 1 Create an object
rfr = RandomForestRegressor()
# 2 fit the model
rfr.fit(X_train, y_train)

# 3 try some predictions
y_pred = rfr.predict(X_test)
y_pred

In [None]:
cf = confusion_matrix(y_test,y_pred.round())
print(cf)

In [None]:
accuracy = (cf[0,0] + cf[1,1])/(cf[0,0] + cf[0,1] + cf[1,0] + cf[1,1] )
print("Accuracy of the model is : ", round(accuracy,2))