### ** Trees: Ensemble Methods - Bagging

Bagging: Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. (Summary!)

BAGGing stands for Bootstrapping(sampling with replacement) and AGGregating (Averaging predictions).

With Random Forest in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

With Random Forest, our goal is to reduce the variance of a decision Tree. We end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

- forests = high variance, low bias base learners
- Bagging to decrease the model’s variance

<img src="./images/boostrap_aggregating.png" width="500" height="500" />

### <strong> Extremely Randomized Trees </strong>

Extremely Randomized Trees, abbreviated as ExtraTrees in Sklearn, adds one more step of randomization to the random forest algorithm. 

Random forests will 

1. compute the optimal split to make for each feature within the randomly selected subset, and it will then choose the best feature to split on. 
2. builds multiple trees with bootstrap = True (by default), which means it samples replacement.

ExtraTrees on the other hand(compared to Random Forests) will instead choose a random split to make for each feature within that random subset, and it will subsequently choose the best feature to split on by comparing those randomly chosen splits. (nodes are split on random splits, not best splits.)

Like random forest, the Extra Trees algorithm will randomly sample the features at each split point of a decision tree. Unlike random forest, which uses a greedy algorithm to select an optimal split point, the Extra Trees algorithm selects a split point at random.

In terms of computational cost, and therefore execution time, the Extra Trees algorithm is faster. This algorithm saves time because the whole procedure is the same, but it randomly chooses the split point and does not calculate the optimal one.

Extremely randomized trees are much more computationally efficient than random forests, and their performance is almost always comparable. In some cases, they may even perform better!

![Bagging](./images/rf_extra.png)

Link to Paper: https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf

In [1]:
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score

In [7]:
#load dataset

X,y = load_iris(return_X_y=True)

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)  #fit on the data

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [8]:
rf.score(X_test, y_test)

1.0

In [10]:
#random forest with gini
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [11]:
rf.score(X_test, y_test)

1.0

Exercise: Can you get a better mean absolute error compared to the random forest used at https://shoe-size-predict.herokuapp.com/

Mean absolute error at present is 1.033

In [2]:
from sklearn.metrics import mean_absolute_error

In [3]:
df = pd.read_csv('./data/shoesize_data/shoesizes.csv', index_col=0)

In [4]:
df.head()

Unnamed: 0,height,sex_no,shoe_size
0,160.0,2,40
1,171.0,2,39
2,174.0,2,39
3,176.0,2,40
4,195.0,1,46


In [35]:
df.describe()

Unnamed: 0,height,sex_no,shoe_size
count,174.0,174.0,174.0
mean,167.465517,1.632184,40.522989
std,31.266641,0.506945,4.80571
min,1.0,0.0,35.0
25%,163.0,1.0,38.0
50%,170.0,2.0,39.0
75%,175.75,2.0,42.75
max,364.0,2.0,88.0


In [6]:
X.shape, y.shape

((174, 2), (174,))

In [39]:
(df.height > 210) | (df.height < 138)

0     False
1     False
2     False
3     False
4     False
      ...  
39    False
40    False
41    False
42    False
43    False
Name: height, Length: 174, dtype: bool

In [40]:
df_clean = df.drop(df[(df.height < 139) | (df.height > 210)].index)

In [41]:
df_clean.describe()

Unnamed: 0,height,sex_no,shoe_size
count,169.0,169.0,169.0
mean,170.242604,1.621302,40.284024
std,10.452003,0.510388,3.192595
min,140.0,0.0,35.0
25%,163.0,1.0,38.0
50%,170.0,2.0,39.0
75%,176.0,2.0,43.0
max,206.0,2.0,50.0


In [48]:
X, y = df_clean.loc[:, ["height","sex_no"]], df_clean.loc[:, "shoe_size"]

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [50]:
from sklearn.ensemble import ExtraTreesRegressor

In [51]:
etc = ExtraTreesRegressor(
    n_estimators=175,
    #criterion='squared_error',
    max_depth=4,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features='auto',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,
    n_jobs=-1,
    random_state=42,
    verbose=0,
    warm_start=False,
    ccp_alpha=0.0,
    max_samples=None
)


In [52]:
etc.fit(X_train, y_train)

ExtraTreesRegressor(bootstrap=True, max_depth=4, n_estimators=175, n_jobs=-1,
                    random_state=42)

In [53]:
etc_pred = etc.predict(X_test)

In [54]:
mean_absolute_error(y_test, etc_pred)

1.1008033475842989