## Class 6: Random Forests
---

A few examples are reproduced or adapted from

https://github.com/jakevdp/PythonDataScienceHandbook

The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT).


# 1.  $\underline{{\rm Random\ Forest\ Classifier}}$

In [None]:
#Let's start with our imports, you might notice some new ones
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold, KFold, cross_validate
from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay

## 1.1 Why choose an ensemble method (such as a RF)

Let's make up some data:

In [None]:
from sklearn.datasets import make_blobs

In [None]:
pos, colors = make_blobs(n_samples=500, centers=4,
                  random_state=0, cluster_std=1.5)
plt.figure(figsize=(10,10))
plt.scatter(pos[:, 0], pos[:, 1], c=colors, s=50, cmap='rainbow');

How should we attack this problem as a classification problem? But first, why might we need an ensemble learning method and not just a single decsion tress?

We can see that there are several areas of overlapping. How might one single tree make these splits?

Let's do a 5 fold cross validation with a decision tree to get an idea of the performance and whether we are suffering from high bias or high variance.

In [None]:
#I'm going to make a random seed to use throughout my whole notebbok!
seed = 5

In [None]:
model = DecisionTreeClassifier()

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=seed)

In [None]:
scores = cross_validate(model, pos, colors, cv=cv, scoring = 'accuracy', \
                        return_train_score = True)
scores

In [None]:
    test = scores['test_score']
    train = scores['train_score']

    print('Test scores:', test.mean(), test.std())
    print('Train scores:', train.mean(), train.std())

Let's take a look at our scores. Are they within one standard deviation? What dow we think about the difference between the training and test scores?

While the dataset seems balanced, in which case bias is low, there seems to be some high variance (over fitting). This is where a Random forest could come in handy!

Let's do the same thing as above but using a Random Forest with standard parameters.

In [None]:
model_rf = RandomForestClassifier(n_estimators=50)

In [None]:
scores = cross_validate(model_rf, pos, colors, cv=cv, scoring = 'accuracy', \
                        return_train_score = True)
scores

In [None]:
test = scores['test_score']
train = scores['train_score']

print('Test scores:', test.mean(), test.std())
print('Train scores:', train.mean(), train.std())

Ok, while those aren't great scores, we can see that the RF performed slightly better.
#### A Random Forest ensemble characteristic lessens overfitting.

## 1.2 Hyperparameter tuning with a Grid Search

Scikiit-learn comes with great buil-in finctions that search the parameter space of hyperparameters to find combinations that result in the best model. Let's see how it's implimented.

We begin by creating a dictionary that holds all of the hyperparameters we want to explore.

In [None]:
### LET'S DESCIBE THE SYNTAX
hyperparam_grid = {
'max_depth': [7,15],
#'max_features': [2, 4],
'min_samples_leaf': [2, 5, 10],
#'min_samples_split': [2,3, 5],
'n_estimators': [50, 100]
}
### AND THERE ARE SO MANY MORE OPTIONS

#### Note: The typical hyperparameters that one tunes in RF are: n_estimators, min_samples_leaf, min_samples_split, max_depth, and max_features.

n_estimators (number of trees) increasing number of trees is typycally good but at some point it doesn't get any better and more trees=more time.

max_features is a good parameter to explore (it's the size of the subset of random features used to create splits) but in this data set there are only two features so not much fun.

min_samples_split and min_samples_leaf are the minimum amount of examples that need to be in each split/leaf node. Setting one of these parameters to a higher number (with respect to their default values of 2/1 respectively) is a great way to avoid overfitting.

Max_depth is the maximum number of splits in a tree. This is also a good parameter to tune to avoid overfitting.

#### Now we create a variable that will hold the result of our grid search, like so:

In [None]:
search_1 = GridSearchCV(estimator=RandomForestClassifier(), param_grid = hyperparam_grid,\
                        scoring='recall_weighted', cv = cv, verbose = 1,\
                           return_train_score=True)

In [None]:
### this is for timing how long my code takes
import time

And we perform the search

In [None]:
start = time.time()
search_1.fit(pos, colors)
print('number of minutes to perform the serach:', (time.time() - start)/60)

In [None]:
print('mean test scores:',search_1.cv_results_['mean_test_score'])
print('std test scores:',search_1.cv_results_['std_test_score'])
print('mean train scores:',search_1.cv_results_['mean_train_score'])
print('std train scores:',search_1.cv_results_['std_train_score'])
print('best test score:',search_1.best_score_)

So what do we notice?

And now let's see what best hyperparameter values our serach found:`m

In [None]:
results_1 = search_1.best_params_
results_1

Ok! Let's create a model and train.

First thing's first... split the data

In [None]:
### WHAT ARE OUR FEATURES AND TARGETS???
features = pos
target   = colors

In [None]:
### Another way to compliment looking at a visual representation of a distribution is to use np.unique
### to get numerical values

classes, counts = np.unique(target, return_counts=True)
print(classes)
print(counts)

So we see we have 4 classes  - 0,1,2,3 - and they each have 125 data points (instances) in each class.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2,\
                                                    random_state=seed)

In [None]:
np.unique(y_train, return_counts=True)

In [None]:
np.unique(y_test, return_counts=True)

In [None]:
### LET"S BUILD OUR MODEL USING THE GRID SERACH RESULTS
### We plug in the results from the dictionary that holds the best_results_
model_1 = RandomForestClassifier(n_estimators=results_1['n_estimators'],\
                                 max_depth=results_1['max_depth'],\
                                 min_samples_leaf=results_1['min_samples_leaf'])

# $\underline{{\rm Exercise\ A.}}$
Please complete the procedure! Your turn to do the:
- fitting
- prediction
- evaluate the model using a confusion matrix and a numerical score

You can read all about the RF Classifier here:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

# 2.  $\underline{{\rm Splitting\ Imbalanced\ Data}}$

What happens if our data is seriously imbalanced?

In [None]:
blob_dist=np.array([200, 150, 125, 100, 40, 20, 12, 10, 5, 4])

In [None]:
pos, colors = make_blobs(n_samples=blob_dist,
                  random_state=0, cluster_std=1.5)
plt.figure(figsize=(10,10))
plt.scatter(pos[:, 0], pos[:, 1], c=colors, s=50, cmap='rainbow');

let's see what happens in a standard 80/20 train test split:

In [None]:
colors.shape

In [None]:
features_blobs = np.zeros((colors.shape[0], 2))
features_blobs[:,0] =colors
features_blobs[:,1] = pos[:,1]
target_blobs = pos[:,0]

In [None]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(features_blobs, target_blobs,\
                                                                    test_size=0.2,\
                                                                    random_state=seed)

We know that there are 10, colors, so let's check the color values of the whole data set"

In [None]:
# color values of whole data set
np.unique(colors, return_counts=True)

And now let's make sure that all color values made it to the test set..

In [None]:
# these are the color values represented in the test set....
np.unique(X_test_2[:,0], return_counts=True)

In [None]:
#compare to what's in the training set...
np.unique(X_train_2[:,0], return_counts=True)

### We can make sure that all "classes" are represented in both the training and test set by using stratification

## 3.1 Stratified KFold

We can use a stratified KFold cross-validation generator to make sure all classes are included in the 5-KFolds during cross-validation.

You implement something like this:

In [None]:
strat_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

And when creating the test/train split, you set the keyword "stratify=" and you set it equal to the property you want to make sure is represented in both trainining and test sets.

This can be EITHER a feature OR a target:


In [None]:
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(features_blobs, target_blobs,\
                                                            test_size=0.2,\
                                                            stratify=features_blobs[:,0],\
                                                            random_state=seed)

And now let's see the features in our test set:

In [None]:
np.unique(X_test_3[:,0], return_counts=True)

# 3.  $\underline{{\rm The\ Random\ Forest\ Regressor}}$

The RF is a powerful regression tool.


## 3.1 How can we turn this into a regression problem?

In [None]:
pos, colors = make_blobs(n_samples=1000, centers=50,
                  random_state=0, cluster_std=1.5)
plt.figure(figsize=(10,10))
plt.scatter(pos[:, 0], pos[:, 1], c=colors, s=50, cmap='rainbow');

In [None]:
features_reg = np.zeros((colors.shape[0],2))

In [None]:
features_reg[:,0] = colors
features_reg[:,1] = pos[:,1]
target_reg = pos[:,0] #now I've made the target a "continuous" variable

In [None]:
plt.hist(target_reg);

This is a nice "bell"-like distribution

## Normalizing data

What can we do if the range in our feature values is very large?

We can normalize our data!

### we can transform the data so it keeps the same distribution but we limit the range
### common normalizing strategies:
- for very large (or very small), i.e, $10^{12}$ or $10^{-5}$, we often just take the log base 10 of the values!

$log_{10}(y)$

- limit the range of the data between 0 and 1 with the function:

$y_{\rm norm} = \frac{y - y_{\rm min}}{y_{\rm max} - y_{\rm min}}$

Let's write a norming function

In [None]:
def norm_func(array):
  '''This function takes a 1D array and normalizes the elements
  such that they maintain the same distribution but range from 0 to 1
  '''
  n = (array - np.min(array))/(np.max(array)-np.min(array))
  return n

In [None]:
# here's our data
for i in range(features_reg.shape[1]):
  plt.hist(features_reg[:,i], alpha=0.5)

In [None]:
# here's our normed data
### DESCRIBE THE OUTPUT COMPARED TO THAT ABOVE
feature_0_normed = norm_func(features_reg[:,0])
feature_1_normed = norm_func(features_reg[:,1])

plt.hist(feature_0_normed, alpha=0.5)
plt.hist(feature_1_normed, alpha=0.5);

# $\underline{{\rm Exercise\ B.}}$
Run the complete Random Forest Regressor training.
- split the data
- get best params from a grid search, you will need to change your estimator and score !
- make the model
- fit the model
- get predictions from the model
- evaluate the model using a numerical score
- make a sctatter plot of true values on the x-axis and predictions on the y-axis

## 3.2 The feature importance attribute!

In [None]:
### CHANGE TO THE NAME OF YOUR MODEL
importances = model_reg.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(features_reg.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(figsize=(16,6))
plt.title("Feature importances")
plt.bar(range(features_reg.shape[1]), importances[indices],
       color="r", align="center")
plt.xticks(range(features_reg.shape[1]), indices)
plt.xlim([-1, features_reg.shape[1]])

This feature importance plot isn't too exciting because there are only two features. But it can be a powerful tool for analyzing results.

# $\underline{{\rm Exercise\ C.}}$: Discussion

What happens if we have imbalanced data in a regression problem? How would we know our data is imbalanced if there are no classses? What might we do to address the imbalance if we can't use stratification?

* You only need to jot your ideas down, this exercise doesn't involve code

