# Data Science Ex 07 - Classification (Decision Trees)

05.04.2020, Lukas Kretschmar (lukas.kretschmar@hsr.ch)

## Let's have some Fun with Decision Trees!

In this exercise, we are going to use Decision Trees to predict classes an entry belongs to.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Introduction

In this introduction, we are going to see the decision tree algorithm provided by scikit-learn.
This algorithm is an implementation of CART (Classification and Regression Trees) and this basically means, the result is a binary tree (only two children).
An effect of this: You can only use numerical data but no categories containing more than 2 values.
Categories with more values must be split into separate columns (see at the end of the introduction).

Today, we are going to create a classifier for mobile phone price classification.

### Data

In [None]:
data = pd.read_csv("./Demo_MobilePhones.csv")
print(len(data))
data.head(5)

As you can see, we have 2000 mobile phone specs and their *price_range* from `0` (low) to `3` (very high).

In [None]:
labels = ["low", "medium", "high", "very high"]
features = data.columns.drop("price_range")

We split this data now into the two parts we need.
One dataset containing all the features and one that contains the *price_ranges*.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = data.drop("price_range", axis=1)
y = data["price_range"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=42)

### Creating Decision Trees

First, we have to import our classifier.

In [None]:
from sklearn.tree import DecisionTreeClassifier

And now we can create the model and train it.

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

As you can see, the `DecisionTreeClassifier()` comes with several **hyperparameters** that we can use to tweak the model.
We will have a look on them later.

In [None]:
y_pred = model.predict(X_test)

### Evaluating the Tree

We can use the same approaches we used in the last exercise to get feedback on the model's performance.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
accuracy_score(y_test, y_pred)

We are more than 80% correct using our test set.
Which is not bad.

In [None]:
matrix = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(matrix.T, ax=ax, square=True, annot=True, fmt="d", cbar=False, xticklabels=labels, yticklabels=labels)
ax.set(xlabel="True Labels", ylabel="Predicted Labels")

And we are most of the time correct or at least in the price range close to expected one.
And the deviation is on both sides in the same range.

### Visualizing the Tree

To get a better understanding of the model, we can visualize it using the `plot_tree()` method from scikit-learn.

In [None]:
from sklearn.tree import plot_tree

In [None]:
sns.reset_orig() # We need to reset the seaborn style, otherwise we cannot see the arrows as they are plotted in white
                 #(and I was unable to find the styling information to fix that with active seaborn styling)
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(model, ax=ax, filled=True, rounded=True, feature_names=features, class_names=labels)

As you can see, we get a big tree.
This is due to the default parameter values of the model.

We can have a closer look at the model by setting `max_depth` and `fontsize`.

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(model, ax=ax, filled=True, rounded=True, feature_names=features, class_names=labels, max_depth=4, fontsize=10)

We can also see which features are more relevant in classifing than others.

In [None]:
feature_importance = pd.DataFrame(X_train.columns,columns=["feature"])
feature_importance["importance"] = model.feature_importances_
feature_importance.sort_values("importance", ascending=False)

### Problem: Overfitting

Let's have a closer look at the parameters of the model.
These parameters are called *hyperparameters*.

In [None]:
model.get_params()

To limit our tree, we will set the following hyperparameters:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`

`max_depth` is by default `None` and this means the tree grows until every leaf contains its `min_samples_leaf` or `min_samples_split` is not reached.
`min_samples_split` with a default value of `2` means that every node in the tree is split as long as it has 2 or more values of different classes.
And `min_samples_leaf` with a default value of `1` allows the tree to contain leafs with one single value which determines the resulting class if the parent isn't pure.
So with these values, we will run in the problem of **overfitting**.

Overfitting is the problem of training the model to close to fit the training data.
With the default parameters of `max_depth = None`, `min_samples_split = 2` and `min_samples_leaf = 1` we do exactly that.
The `model` is perfectly prepared for our training data since every entry of the data we used to train gets its own leaf.

In [None]:
accuracy_score(y_train, model.predict(X_train))

The solution to this is pruning.

### Solution: Pruning

Pruning essentially means that we allow the tree to grow only under certain conditions.
The class of the leaf is then the class with the most samples within that leaf.
Setting `max_depth` stops the growth at a certain depth without regarding how many values are in the leaf.
With `min_samples_leaf` we can halt growth when child leafs have fewer than specified values.
And `min_samples_split` prevents growth when too few samples are present.

In [None]:
model = DecisionTreeClassifier(min_samples_leaf=20, min_samples_split=40, max_depth=7)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
matrix = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(matrix.T, ax=ax, square=True, annot=True, fmt="d", cbar=False, xticklabels=labels, yticklabels=labels)
ax.set(xlabel="True Labels", ylabel="Predicted Labels")

Preventing the tree from growing actually helped it to perform better.
We have now an accuracy of 85%.
And our tree is smaller.

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(model, ax=ax, filled=True, rounded=True, feature_names=features, class_names=labels)

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(model, ax=ax, filled=True, rounded=True, feature_names=features, class_names=True, max_depth=4, fontsize=10)

And besides the tree size, the number of features used to predict the price range also decreased.

In [None]:
feature_importance = pd.DataFrame(X_train.columns,columns=["feature"])
feature_importance["importance"] = model.feature_importances_
feature_importance.sort_values("importance", ascending=False)

Here, we just set some values directly to get our classifier to perform better.
Finding good values for these hyperparameters is a science for itself.
We will have a look at it later on in this course.
Defining them by trail & error is one approach, but certainly not the most efficient.

But for now, you just need to know that usually every model comes with its own set of hyperparameters and that you can change them to get better results faster.

### Using the Model

We have another dataset that contains only mobile phone specs without any price ranges.
And we can use now our trained model to predict them for these phones.

In [None]:
data_test = pd.read_csv("./Demo_MobilePhones_Test.csv")
data_test.head()

In [None]:
price_range = pd.Series(model.predict(data_test), name="price_range")
data_test_p = pd.concat([price_range, data_test], axis=1)
data_test_p.head(5)

### Dummy Variables

As you see, the data used in the introduction only contained numbers.
And the decision tree classifier here can only work with numbers.
Now, in many datasets, we have the problem, that we have also categorical data (eg. [weak, strong], [low, high], [friend, neutral, enemy]).

If we encounter such data, we need to transform them into a number.
One possible way is to create a dictionary, assigning a number to every value and then replacing the value with the number (you've already seen that in previous exercises and introductions).
This approach is good, when you have many values within a category.

Another possibility is create a column per value.
And here, Pandas offers a method to achieve this.
With `pd.get_dummies(data)` [Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) you get a column for every value that is in `data`.

In [None]:
eg = pd.Series(list("abcabc"))
eg

In [None]:
pd.get_dummies(eg)

As you can see, every value got its own column.
And within a column, you have only `0` or `1`.
This is a nice feature, if you have some values that you don't want to encode with numbers but have own columns for.

*Hint:* You can also set a prefix for the columns.

In [None]:
pd.get_dummies(eg, prefix="char")

## Exercises

### Ex01 - Password Strength

In this exercise, you are going to train a classifier that can say if your password is weak or strong.
First, load **Ex07_01_Data.csv**.
This file contains the data you'll use for training and testing.

How many passwords are in the dataset?

Create the labels array containing *weak*, *medium* and *strong*.

Create the features array without *strength* and *password*.

Split the data into two parts.
One dataframe containing all the values of the features and one only containing the *strength*.

Create the train (2/3) and test sets.

Train the model and predict the strength of the test set.
What's the accuracy of your model?

Plot the confusion matrix.

Plot the entire tree of your model.

As you can see, the strength is solely based on one feature.
So, the enforced rules (lower case, upper case, digit and special character) are just there to introduce some complexity.

#### Bonus: Do the exercise without the *length*.

The following method creates a new dataframe for a given password.
Use this to predict the following passwords.

In [None]:
import re

def decodePassword(pw):
    data = pd.DataFrame()
    data["password"] = pd.Series(pw)
    data["length"] = pd.Series(len(pw))
    data["lower"] = pd.Series(len(re.findall("[a-z]", pw)))
    data["upper"] = pd.Series(len(re.findall("[A-Z]", pw)))
    data["digits"] = pd.Series(len(re.findall("[0-9]", pw)))
    data["special_chars"] = pd.Series(len(re.findall("[^a-zA-Z0-9]", pw)))
    data = data.set_index("password")
    return data

- "asdf"

- "Admin1234"

- "WellThisPWisVeryLong"

- "$bwZKaw.T34o2!"

#### Solution

In [None]:
# %load ./Ex07_01_Sol.py

### Ex02 - Home Loans

In this exercise, you'll try to predict if a customer is eligible to get a home loan.
First, load **Ex07_02_Data.csv**.
*LoanStatus* contains the information if a customer got a loan (`1`) or not (`0`).

Create arrays for the labels and features.
Ignore the *Loan_ID*.

Create the train (70%) and test sets.

Create the model (`min_samples_leaf = 24`), train it, predict the classes for the test set and show the accuracy score.

74% isn't that bad. Let's have a look at the confusion matrix.

Plot the tree of the model.

Do you see a problem with the decision making process?

Since in several leafs the class isn't close to unique, you can assign the probability per class.
But first, load another dataset (**Ex07_02_Data_Use.csv**) that has no *LoanStatus* assigned.

Predict the probabilites with `predict_proba()`.
Don't forget to ignore *Loan_ID*.

Pack the resulting probabilities in a new `DataFrame` and assign column names.
Try to have *Yes* before *No*.

Attach this information to the dataset.
Try to have it right after the *Loan_ID* column.

Congratulation!
Although, the model isn't perfect, but with this approach you support the decision making process of the home loan consultant.

#### Solution

In [None]:
# %load ./Ex07_02_Sol.py

### Ex03 - Hotel Bookings (Part 1)

In this an the next exercise, you are going to predict if a hotel guest will cancel the reservation.
But before you can build the model (next exercise), you need to process the provided data.
There is a preprocessed dataset available for the next exercise.
So you don't have to succeed here to go on.

But let's start, load **Ex07_03_Data.csv**.

Check if there are any `NaN` values.

If there are, fill them with `0`.
And don't forget to transform the columns to `int` after you removed the `NaN` values.

Display some information about the dataset with `info()`.

As you can see, there are several columns that do not contain numbers.
It's now your task, to transform them into a readable format for the decision tree classifier.

Don't forget to always remove the old column when you add new columns.

*Hint:* If you have to use `pd.get_dummies()`, your code should look like
```python
pd.concat([df.drop(column, axis=1), pd.get_dummies(df[column])], axis=1)
```
*df* is the variable containg the dataset.
And *column* stands for the name of the column you need to transform.
The first part of the concat takes your data without the column you are going to transform, and the second part is the transformation of the column.

First, transform the *hotel* column with `pd.get_dummies()` and attach the new columns to the end of the dataset.

Now, get rid of the months.
We need the number of month, not the name.
The following dictionary will help you with that.

In [None]:
import calendar
months = dict((v,k) for k,v in enumerate(calendar.month_name))
del(months[""])
months

Now, transform the *meal* column.
Again into multiple columns with `get_dummies()`.

Do the same with *distribution_channel*.
But this time, add a prefix of "dist_ch".

And also for *deposite_type*.
But this time without a prefix again.

And finally, replace the *customer_type* the same way.

Check your dataset again.
You should only be left with the rooms.

So, replace the room with numbers.
But since there are many different room names, you'll replce the actual values with numbers.
The following dictionary will help you.

In [None]:
import string
rooms = dict(zip(string.ascii_uppercase, range(1,27)))
rooms

Do a final check, if all the values within your dataset are now numbers.

Congratulations!
You completed this challenge as well.
Now, go with the next exercise and work with this dataset.

#### Solution

In [None]:
# %load ./Ex07_03_Sol.py

### Ex04 - Hotel Bookings (Part 2)

In this exercise, you can use your final dataset of the previous exercise or load **Ex07_04_Data.csv**.

In [None]:
# Load the data in here if needed

Create two arrays, one containing the labels *No* and *Yes* for cancellation (or you can choose your own values).
And the other with the features.

Create the train (75%) and test sets.

Create the model, train it and predict the cancellations.
Show the accuracy score of your model.

Now, plot the confusion matrix.

And finally, show the first 4 levels of the resulting tree.

#### Solution

In [None]:
# %load ./Ex07_04_Sol.py