# Data Science Ex 08 - Classification (Ensemble Methods)

19.04.2023, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some Fun with Random Forests and Parameter Optimization!

In this exercise, we're having a look at some advanced methods in classification and in general for Data Science.
Concrete, we'll introduce the concept of **bagging** and **random forests** (multiple decision trees).
And you'll get to know an approach how you can find good values for the **hyperparameters** of your models.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

## Introduction

### Ensemble Methods

In the previous exercises, we run into several problems.
We encountered problems with *overfitting* (classifier is too perfectly trained to predict the training set) and that some features have a huge impact on the result (and therefore might point into the wrong direction).
And we know that using slightly different training data will result in different classifier models.
Usually, the results will vary to some degree, but shouldn't be that far apart.

These problems or differences are mainly there because we only use one model that is trained once with a specific training set.
A solution to mitigate these effects is called *ensemble method*.
And it basically means, we take many classifiers that are trained on a subset of all the data and features.
And the resulting class for a new data point is the class a majority of all these models return.

This apprach is called *bagging*, and if we use bagging with decision trees, the classifier is called *random forest* (I mean, obviously many trees in the same area are called a forest).

In the last exercise, we tried to predict the price ranges for mobile phones based on their specs.
And with a simple decision tree classifier we got an accuracy of 80%.

In [None]:
from sklearn.model_selection import train_test_split

data = pd.read_csv("./Demo_MobilePhones.csv", sep=";")
labels = ["low", "medium", "high", "very high"]
features = data.columns.drop("price_range")

X = data.drop("price_range", axis=1)
y = data["price_range"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=42)

data.head(5)

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

In the last exercise, we also used pruning to increase the accuracy.
But here, we ignore pruning and go with the default values of the classifier.

#### Bagging

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

Let's see if we can increase the accuracy from above.
A general, bagging classifier can be loaded from the `ensemble` module.

In [None]:
from sklearn.ensemble import BaggingClassifier

And the application is the same as with every other classifier you saw.
There exists a `fit()` and `predict()` method that we can use.
As you can see, the first argument of the `BaggingClassifier` is the model that should be used within the `BaggingClassifier`.
It is also possible to parameterize the used classifier (here a `DecisionTreeClassifier`).

In [None]:
submodel = DecisionTreeClassifier()
model = BaggingClassifier(submodel, n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

Above, we combined bagging with decision trees.
`n_estimators = 100` means that we trained 100 different trees.
And as you can see, the classifier performs significantly better than just the one decision tree from the beginning.

Let's see if we can do even better.
We can limit the training data for every tree to 80% by setting `max_samples=.8`.

In [None]:
submodel = DecisionTreeClassifier()
model = BaggingClassifier(submodel, n_estimators=100, max_samples=.8, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

And we increased the accuracy again.

As you saw above, the `BaggingClassifier` comes with its own set of hyperparameters.
We used `n_estimators` and `max_samples`.
Another interesting hyperparameter is `max_features`.
With this one, we can specify how many features should be used per tree.
This can further increase the accuracy of the ensemble method since every model has its own training set with a subset of all features.

In the next example, we'll use 80% of all features.
And we get into the same range of accuracy as we saw before.

In [None]:
f"Number of features: {len(features)} --> 80% => {int(len(features)*.8)}"

In [None]:
model = BaggingClassifier(submodel, n_estimators=100, max_features=.8, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

#### Random Forest

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Since the combination of bagging and a decision tree classifier is common, there is a specific classifier for that.
And the `RandomForestClassifier` comes with its own set of hyperparameters.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.get_params()

In [None]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

#### Disassemble the Forst

Both, the `BaggingClassifier` and `RandomForestClassifier`, allow you to access the models that were trained and used for the prediction by calling the `estimators_` property.

In [None]:
print(len(model.estimators_))
model.estimators_[:5]

And you can plot them with the same method you used in the last exercise.

In [None]:
from sklearn.tree import plot_tree

In [None]:
models = model.estimators_

fig, ax = plt.subplots(2,2,figsize=(20, 20))
plot_tree(models[0], ax=ax[0,0], filled=True, rounded=True, feature_names=features, class_names=labels, max_depth=2, fontsize=10)
ax[0,0].set(title="Estimator #1")
# We can create a dictionary holding all the styling parameters
tree_style = {"filled":True, "rounded":True, "feature_names":features, "class_names":labels, "max_depth":2, "fontsize":10}
# And use it as an argument at the end of the method
plot_tree(models[32], ax=ax[0,1], **tree_style)
ax[0,1].set(title="Estimator #33")
plot_tree(models[65], ax=ax[1,0], **tree_style)
ax[1,0].set(title="Estimator #66")
plot_tree(models[99], ax=ax[1,1], **tree_style)
ax[1,1].set(title="Estimator #100")

### Hyperparameter Optimization

The topic on finding the values for hyperparameters is a science for itself.
Within these exercises, we get already good results with the default values.
And we can simply test some combinations by hand, or with `for`-loops.

But depending on the data we have, the model we want to use and the accuracy of the default values, it might be necessary to automate the process of finding good values for the hyperparameters.

A simple solution is the scikit-learn [grid search algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) that tries all possible combinations for a given set of hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV

The algorithm offers many parameters that we can use to configure its execution.
But most interesting for use are:
- `estimator`: The model we want to find good hyperparameter values for.
- `param_grid`: A dictionary containing the hyperparameters we want to test and their ranges.
- `cv` (optional): The number of cross validation checks the algorithm should run.
- `n_jobs` (option): Number of parallel executions. Default is `1`, `-1` means as many as possible (= 100% CPU usage).

Let's run a grid search for the given problem.
We want to limit the number of features checked (`max_features`), the depth (`max_depth`) and how long splits are allowed (`min_samples_split`).

In [None]:
params = {
    "max_features": [12, 16, "auto", None],   # 12, 16, auto (sqrt) or all features
    "max_depth": [4, None],                   # 4 levels, or full depth
    "min_samples_split": [2, 10, 18]
}

The preparation for the grid search algorithm looks like the following:

In [None]:
submodel = RandomForestClassifier(n_estimators=100, random_state=42)
grid = GridSearchCV(submodel, params, cv=5, n_jobs=4)

And to run the algorithm, we call the `fit()` method.
Depending on the number of hyperparameters we test, and their ranges, the execution can take a while.
Thus, the usage of `n_jobs` could be a wise choice.

*Note:* Since the algorithm uses cross validation internally, we can use the whole dataset (`X`) and not just the training set (`X_train`).

In [None]:
grid.fit(X, y)

When the algorithm is finished, we can get the best combination by calling the `best_params_` property.

In [None]:
grid.best_params_

The cross validation score for the best model can be found in the `best_score_` property.

In [None]:
grid.best_score_

And we can even get a configured model by calling the `best_estimator_` property.

In [None]:
model = grid.best_estimator_
model

In [None]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

As you can see, we were able to increase the accuracy of our classifier.
And with other parameters or broader ranges, further improvements might still be possible.

## Exercises

### Ex01 - Home Loans

In the last exercise, you trained a classifier to suggest if a person is eligible for a loan.
And you reached an accuracy of 74%.

Now you'll do the same classification again, but this time you'll use a `RandomForestClassifier`.

Load the file **Ex08_01_Data.csv**.

One of the problems identified in the last exercise was that the decision tree basically denied loans for every new customer.
Thus, let's see if we have a problem with class imbalance.
To do so, count the values in `LoanStatus` and show them in a bar chart.

So, class imbalance is probably an issue within this dataset.
Thus, build a new dataset that takes all the accepted applications (`LoanStatus==1`) and add 400 entries of denied loan applications (`LoanStatus==0`).
And plot the bar chart again.

*Hint:* `sample()` (Ex04) and `pd.concat()` (Ex02) may become handy. 

Create a train set with 80% of the resampled data.
And don't forget to drop the `Loan_ID` column.

Create variables containing the labels and features.

Create your `RandomForestClassifier` and train it.
Use 100 internal trees and `min_samples_split=24`.

Predict the classes and show the accuracy score.

With the new classifier, you were able to increase the accuracy by some percent.

Plot the confusion matrix for the model.

Now, draw the top 3 levels of 6 decision trees in a 2x3 grid (like in the introduction).
You are free to choose which 6 trees you want to plot.

#### Solution

In [None]:
# %load ./Ex08_01_Sol.py

### Ex02 - Marketing Campaign

In this exercise you are going to analyse marketing data of a bank.
The goal is to predict whether a customer will open a deposit account or not if targeted by a campaign (`campaign_success`).

First, load the data from **Ex08_02_Data.csv**.

Check if there is a problem with class imbalance.
Plot a bar chart for that.

As you can see, there is a massive difference between data on failed campaigns compared to successful calls.
Thus, you need to upsample the successful cases.
Generate 35000 successful data points from the existing ones, take all the failed cases and show that the class imbalance is gone in your new dataset.

Create the labels and features collection.

Create your train (80%) and test sets.

Create your `RandomForestClassifier` using 100 trees and 80% of the samples per tree.
What's the accuracy?

Nice!
Over 95% - great work.
Show the confusion matrix to see where the errors are.

That looks quite good.
Thus, create a new model (with the same hyperparameters used above) and train it on the full dataset.

Load another dataset (**Ex08_02_Data_Use.csv**) that has no information on the campaign success.

Predict the probabilities of a campaigns success for these new data points.
Use the `predict_proba()` method and add the new columns to the beginning of the dataset.
The column containing the success probability should be the first column.

Now, to not lose time, sort the data points by their expected success with the most likely customer to sign up for a new account on top of the list.

Great work!
Your job is done.
Now you can send the sheet to your marketing department, it's now their job to reach out to these customers.

#### Solution

In [None]:
# %load ./Ex08_02_Sol.py

### Ex03 - eCommerce

Let's assume you run an eCommerce business.
And you want to know when people buy something from your website.

For every visitor on your site, you log certain information (e.g. which pages were visited, how long was the visitor on a page, when was that, did you already know the visitor, etc).
Based on this information, you'll try to estimate if the visitor will actually order something.

The data is logged in **Ex08_03_Data.csv**.

The `Revenue` column is what you want to know.
But before you can go on, check for class imbalance.
Show it with a bar chart.

As you can see, you have to fix this problem first.
Generate 10'000 data points where a visitor actually ordered something, and combine them with all the data points where the visitor left without any purchase.
Show the result again as bar chart to be sure the class imbalance is gone.

Create the dataset containing the features and the series containing the `Revenue` column.

Now, instead of just training one `RandomForestClassifier`, do a grid search with:
- `max_features`: `[8, 16, "sqrt", None]`
- `max_samples`: `[.8, None]`
- `min_samples_split`: `[2, 10, 18]`

What's the best combination of values for your hyperparameters? 

What's the cross validation score of the best model?

Get the best model and train it with the given data.

Now, load **Ex08_03_Data_Use.csv**.
This file contains new logs of visitors on your page.

Predict the probability that a visitor will buy something from you.
And show the value in the first column.

Congratulations, you've created a classifier that predicts if a visitor will buy something.
Now you can optimize your website to increase the likelihood of a purchase for those who are likely to buy something.

#### Solution

In [None]:
# %load ./Ex08_03_Sol.py