# Random Forests 🌴🌳🌴🌳🌳🌴🌴

After this session you will 
- develop a basic intuition of the algorithm behind (bagged) non-linear classification 
- be able to apply the Random Forest Classifier to your own datasets
- understand and use the Scikit Learn framework for applying ML workflow (prep data, train a model, evaluate model, make predictions)
- explain in your own words what the pros and cons of the Random Forest Classifier are


Note:
- In general, there is a Classification Random Forest-model and one for Regression in sklearn => we will only talk about the classification model.

In [3]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


In [7]:
df = pd.read_csv('penguins_simple.csv', sep=';')
df

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Species              333 non-null    object 
 1   Culmen Length (mm)   333 non-null    float64
 2   Culmen Depth (mm)    333 non-null    float64
 3   Flipper Length (mm)  333 non-null    float64
 4   Body Mass (g)        333 non-null    float64
 5   Sex                  333 non-null    object 
dtypes: float64(4), object(2)
memory usage: 15.7+ KB


In [9]:
X=df[["Culmen Length (mm)","Body Mass (g)"]]
y=df["Species"]

In [10]:
train,test=train_test_split(df,random_state=42)

#### 1. Inspect the shape of the train and test DataFrames

In [12]:
train.shape,test.shape

((249, 6), (84, 6))

### Train a Baseline Model

In [16]:
X_train=train[["Culmen Length (mm)","Body Mass (g)"]]
X_test=test[["Culmen Length (mm)","Body Mass (g)"]]
y_train=train["Species"]
y_test=test["Species"]

In [18]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((249, 2), (84, 2), (249,), (84,))

#### 2. Train a Decision Tree with maximum depth 2

In [20]:
clf_DT=DecisionTreeClassifier(max_depth=2,random_state=42)

In [21]:
clf_DT.fit(X_train,y_train)

DecisionTreeClassifier(max_depth=2, random_state=42)

#### 3. Calculate training and test accuracy

In [28]:
print("Accuracy score of baseline classifier (on training set):",round(clf_DT.score(X_train,y_train),3))
print("Accuracy score of baseline classifier (on test set):",round(clf_DT.score(X_test,y_test),3))

Accuracy score of baseline classifier (on training set): 0.928
Accuracy score of baseline classifier (on test set): 0.929


### Train a Random Forest from Scratch

#### 4. Build a list of trees

Repeat the following 100 times:
    
* draw 50 random penguins (with `df.sample()`)
* train a decision tree on the sample with `max_depth=3`
* add the tree to the forest

In [47]:
forest=[]

for t in range(100):
    sample=train.sample(50,random_state=t)
    X=sample[["Culmen Length (mm)","Body Mass (g)"]]
    y=sample["Species"]
    tree=DecisionTreeClassifier(max_depth=3,random_state=t)
    tree.fit(X,y)
    forest.append(tree)

In [48]:
len(forest)

100

#### 5. Calculate a list of training scores for all trees on the full training set

In [49]:
predictions=[]
for tree in forest:
    tree_trained=tree.score(X_train,y_train)
    predictions.append(tree_trained)
predictions


[0.8955823293172691,
 0.9076305220883534,
 0.8995983935742972,
 0.8995983935742972,
 0.9236947791164659,
 0.8393574297188755,
 0.9156626506024096,
 0.8955823293172691,
 0.8674698795180723,
 0.8674698795180723,
 0.9236947791164659,
 0.927710843373494,
 0.8875502008032129,
 0.8995983935742972,
 0.8674698795180723,
 0.8554216867469879,
 0.8795180722891566,
 0.9076305220883534,
 0.9076305220883534,
 0.8955823293172691,
 0.8514056224899599,
 0.9036144578313253,
 0.8634538152610441,
 0.891566265060241,
 0.8393574297188755,
 0.8835341365461847,
 0.8755020080321285,
 0.9116465863453815,
 0.8594377510040161,
 0.9076305220883534,
 0.8955823293172691,
 0.8393574297188755,
 0.891566265060241,
 0.8955823293172691,
 0.9076305220883534,
 0.8955823293172691,
 0.8554216867469879,
 0.8755020080321285,
 0.8995983935742972,
 0.8674698795180723,
 0.891566265060241,
 0.8674698795180723,
 0.9036144578313253,
 0.891566265060241,
 0.8995983935742972,
 0.9076305220883534,
 0.8955823293172691,
 0.895582329317269

In [53]:
training_scores_forest=[tree.score(X_train,y_train )for tree in forest]
training_scores_forest

[0.8955823293172691,
 0.9076305220883534,
 0.8995983935742972,
 0.8995983935742972,
 0.9236947791164659,
 0.8393574297188755,
 0.9156626506024096,
 0.8955823293172691,
 0.8674698795180723,
 0.8674698795180723,
 0.9236947791164659,
 0.927710843373494,
 0.8875502008032129,
 0.8995983935742972,
 0.8674698795180723,
 0.8554216867469879,
 0.8795180722891566,
 0.9076305220883534,
 0.9076305220883534,
 0.8955823293172691,
 0.8514056224899599,
 0.9036144578313253,
 0.8634538152610441,
 0.891566265060241,
 0.8393574297188755,
 0.8835341365461847,
 0.8755020080321285,
 0.9116465863453815,
 0.8594377510040161,
 0.9076305220883534,
 0.8955823293172691,
 0.8393574297188755,
 0.891566265060241,
 0.8955823293172691,
 0.9076305220883534,
 0.8955823293172691,
 0.8554216867469879,
 0.8755020080321285,
 0.8995983935742972,
 0.8674698795180723,
 0.891566265060241,
 0.8674698795180723,
 0.9036144578313253,
 0.891566265060241,
 0.8995983935742972,
 0.9076305220883534,
 0.8955823293172691,
 0.895582329317269

#### 6. Calculate the mean training score
is the mean score better or worse than the baseline?

In [51]:
print("Average accuracy score of Random Forest classifier (on training set, without mode):", round(sum(training_scores_forest)/len(training_scores_forest), 3))

Average accuracy score of Random Forest classifier (on training set, without mode): 0.89


#### 7. Calculate the mean test score in the same way
is the mean score better or worse than the baseline?

In [55]:
test_scores_forest=[tree.score(X_test,y_test)for tree in forest]
test_scores_forest
print("Average accuracy score of Random Forest classifier (on training set, without mode):", round(sum(test_scores_forest)/len(test_scores_forest), 3))

Average accuracy score of Random Forest classifier (on training set, without mode): 0.892


### Majority Vote

#### 8. Create a list of predictions for every tree

In [65]:

predictions_forest = [tree.predict(X_test) for tree in forest]
predictions_forest

[array(['Adelie', 'Gentoo', 'Adelie', 'Chinstrap', 'Adelie', 'Gentoo',
        'Gentoo', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Gentoo',
        'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
        'Chinstrap', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Gentoo',
        'Chinstrap', 'Adelie', 'Adelie', 'Gentoo', 'Gentoo', 'Chinstrap',
        'Gentoo', 'Chinstrap', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo',
        'Gentoo', 'Chinstrap', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie',
        'Adelie', 'Chinstrap', 'Chinstrap', 'Adelie', 'Adelie', 'Gentoo',
        'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo',
        'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Chinstrap',
        'Chinstrap', 'Gentoo', 'Gentoo', 'Gentoo', 'Adelie', 'Adelie',
        'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie',
        'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Adelie',
        'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie',
        'Adel

#### 9. Convert the list into a DataFrame
Inspect the result

* The shape of the DF should be (100, 84)
* What do the dimensions of the DF mean?
* Do the trees predictions are in agreement?

In [57]:
predictions_forest = pd.DataFrame(predictions_forest)
predictions_forest.shape

(100, 84)

#### 10. Calculate accuracy from most frequent prediction on each data point
* Is the overall accuracy better than the accuracy of the baseline?
* Do you have more or less overfitting?

In [58]:
most_frequent_prediction = predictions_forest.mode().T[0]
print('majority vote test score:',round(accuracy_score(most_frequent_prediction, y_test), 3))

majority vote test score: 0.94


## RandomForest with Scikit

In [60]:
clf_RF=RandomForestClassifier(n_estimators=100,max_depth=2)

In [62]:
clf_RF.fit(X_train,y_train)

RandomForestClassifier(max_depth=2)

In [64]:
print("Accuracy score Random Forest classifier (training set):", round(clf_RF.score(X_train, y_train), 3))
print("Accuracy score Random Forest classifier (test set):", round(clf_RF.score(X_test, y_test), 3))

Accuracy score Random Forest classifier (training set): 0.936
Accuracy score Random Forest classifier (test set): 0.929


Questions_clarify the aggregation component in bagging