### ** Trees: Ensemble Methods - Bagging

Bagging: Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. (Summary!)

BAGGing stands for Bootstrapping(sampling with replacement) and AGGregating (Averaging predictions).

### <strong> Random Forest </strong>

With Random Forest in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

With Random Forest, our goal is to reduce the variance of a decision Tree. We end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

- forests = high variance, low bias base learners
- Bagging to decrease the model’s variance

<img src="./images/boostrap_aggregating.png" width="500" height="500" />

### <strong> Extremely Randomized Trees </strong>

Extremely Randomized Trees, abbreviated as ExtraTrees in Sklearn, adds one more step of randomization to the random forest algorithm. 

Random forests will 

1. compute the optimal split to make for each feature within the randomly selected subset, and it will then choose the best feature to split on. 
2. builds multiple trees with bootstrap = True (by default), which means it samples replacement.

ExtraTrees on the other hand(compared to Random Forests) will instead choose a random split to make for each feature within that random subset, and it will subsequently choose the best feature to split on by comparing those randomly chosen splits. (nodes are split on random splits, not best splits.)

Like random forest, the Extra Trees algorithm will randomly sample the features at each split point of a decision tree. Unlike random forest, which uses a greedy algorithm to select an optimal split point, the Extra Trees algorithm selects a split point at random.

In terms of computational cost, and therefore execution time, the Extra Trees algorithm is faster. This algorithm saves time because the whole procedure is the same, but it randomly chooses the split point and does not calculate the optimal one.

Extremely randomized trees are much more computationally efficient than random forests, and their performance is almost always comparable. In some cases, they may even perform better!

![Bagging](./images/rf_extra.png)

Link to Paper: https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf

In [4]:
#import libraries
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score

In [5]:
#load dataset

X,y = load_iris(return_X_y=True)

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)  #fit on the data

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [6]:
#random forest with gini
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

Exercise: Using the data in scout_data, build a model to predict a product tier(Classification) and a model to predict the number of detail views.(Regression)

In [24]:
#import libraries
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, mean_squared_error

In [25]:
df = pd.read_csv('./data/scout_data/Data_Description.csv',sep=';')

#df.info()
df.head()

Unnamed: 0,column name,description
0,article_id,unique article identifier
1,product_tier,premium status of the article
2,make_name,name of the car manufacturer
3,price,price of the article
4,first_zip_digit,first digit of the zip code of the region the ...


In [26]:
df = pd.read_csv('./data/scout_data/Case_Study_Data.csv',sep=';')

df.head()

Unnamed: 0,article_id,product_tier,make_name,price,first_zip_digit,first_registration_year,created_date,deleted_date,search_views,detail_views,stock_days,ctr
0,350625839,Basic,Mitsubishi,16750,5,2013,24.07.18,24.08.18,3091.0,123.0,30,0.037803299902944
1,354412280,Basic,Mercedes-Benz,35950,4,2015,16.08.18,07.10.18,3283.0,223.0,52,0.06792567773378
2,349572992,Basic,Mercedes-Benz,11950,3,1998,16.07.18,05.09.18,3247.0,265.0,51,0.0816137973514013
3,350266763,Basic,Ford,1750,6,2003,20.07.18,29.10.18,1856.0,26.0,101,0.0140086206896551
4,355688985,Basic,Mercedes-Benz,26500,3,2014,28.08.18,08.09.18,490.0,20.0,12,0.0408163265306122


In [32]:
#load data
X_class, y_class = df.drop(columns=['product_tier']), df['product_tier']
X_reg, y_reg = df.drop(columns=['detail_views']), df['detail_views']

In [37]:
#train test split for classification
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

#train test split for regression
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

In [41]:
#classifier for random forests
rfc = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)
rfc.fit(X_train_class, y_train_class)

y_predict_class = rfc.predict(X_test_class)
print('Random Forest Classification f1: ')
print(f1_score(y_test_class, y_predict_class, average="none"))

#classifier for extra trees
etc = ExtraTreesClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)
etc.fit(X_train_class, y_train_class)

#             #y_predict_class_et = etc.predict(X_test_class)
print('Extra Trees Classification f1: ')
print(f1_score(y_test_class, y_predict_class, average="none"))

ValueError: could not convert string to float: 'Nissan'

In [36]:
#regressor for random forests
rfr = RandomForestRegressor(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)
rfr.fit(X_train_class, y_train_class)

y_predict_reg = rfr.predict(X_test_reg)
print(f'Random Forest Regressor MSE: ')
print(mean_squared_error(y_test_r, y_pred_r_et))

#classifier for extra trees
rfc = ExtraTreesRegressor(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)
rfc.fit(X_train_class, y_train_class)

print(f'Extra Trees Regressor MSE: ')
print(mean_squared_error(y_test_r, y_pred_r_et))

InvalidParameterError: The 'criterion' parameter of RandomForestRegressor must be a str among {'friedman_mse', 'poisson', 'squared_error', 'absolute_error'}. Got 'gini' instead.