## Lab 7.  Ensemble Models / Modele de tip ansamblu (meta-modele)
#### Objective:  getting familiar with / Obiectiv: familiarizare cu
 
 * bagging
 * boosting
 * voting
 * stacking
#### and several classification and regression models / si cu modele de clasificare si regresie
 * RandomForestClassifier / RandomForestRegressor
 * AdaBoostClassifier / AdaBoostRegressor
 * GradienBoostingClassifier / GradienBoostingRegressor
 * VotingClassifier / Voting Regressor
 * StackingClassifier / StackingRegressor

In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from sklearn.model_selection import cross_val_score


In [2]:
# data for classification tasks/ date pt clasificare
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [3]:
# data for binary classification tasks/ date pt clasificare binară
BreastCancer = datasets.load_breast_cancer()
X = BreastCancer.data
y = BreastCancer.target

In [4]:
# data for regression tasks / date pt regresie
diabetes = datasets.load_diabetes()
Xregr = diabetes.data
yregr = diabetes.target

### 1. Bagging = boostrap and aggregation 

In [5]:
# Decision Tree classifier
dt = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)
scoresDT = cross_val_score(dt, X, y, cv=5)
scoresDT.mean()

0.9173730787144851

In [6]:
# Generic Bagging:  train several instances of a base classifier on randomly selected subsets of the data training set and combine the results; 
# impact:  it reduces the variance of the base 
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
bagging = BaggingClassifier(DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0),n_estimators=10, max_samples=0.5, max_features=0.5)
scoresB = cross_val_score(bagging, X, y, cv=5)
scoresB.mean()

0.9508150908244062

In [7]:
# Bagging based on Decision Trees -> Random Forest = ensemble of random trees trained for random subsets of data and features (best splitting procedures)
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
rf = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
scoresRF = cross_val_score(rf, X, y, cv=5)
scoresRF.mean()

0.9631113181183046

In [8]:
# Extra Tree Classifier = ensemble of random trees (random splitting procedure)
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
etc = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
scoresETC = cross_val_score(etc, X, y, cv=5)
scoresETC.mean()

0.9701288619779538

#### Example 1: visualization of decision surfaces for Random Forest, Extra Tree (iris dataset)
 * https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html#sphx-glr-download-auto-examples-ensemble-plot-forest-iris-py

### 2. Boosting - ensemble of (weak) models trained sequentially with different weights for the data instances / 

- AdaBoost
- Gradient Boosting

In [9]:
# AdaBoost
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
from sklearn.ensemble import AdaBoostClassifier
ab = AdaBoostClassifier(n_estimators=100, algorithm='SAMME',random_state=0)
scoresAB = cross_val_score(ab, X, y, cv=5)
scoresAB.mean()

0.9771619313771154

In [10]:
# Gradient Boosting
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
scoresGB = cross_val_score(gb, X, y, cv=5)
scoresGB.mean()

0.9666200900481293

### 3. Voting Classifiers / Clasificatori bazați pe votare

Idea: combine conceptually different classifiers and aggregate their results:
(a) majority vote (hard vote)
(b) average predicted probabilities (soft vote) 

Rmk:  Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses. 

Idee: combinarea unor clasificatori bazati pe modele diferite si agregarea rezultatelor folosind:
(a) criteriul majoritatii (strict)
(b) media probabilitatilor (nuantat)

Obs: Utili in combinarea unor clasificatori cu performante similare pt a compensa limitarile pe care le au clasificatorii individuali

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

In [12]:
# classifiers included in the collection / clasificatori inclusi in colectie
clf1 = DecisionTreeClassifier(random_state=0)
clf2 = RandomForestClassifier(n_estimators=50, random_state=0)
clf3 = GaussianNB()

In [13]:
# definition of the collection / definirea colectiei
eclf = VotingClassifier(
    estimators=[('dt', clf1), ('rf', clf2), ('gnb', clf3)],
    voting='hard')

In [14]:
# performance analysis (the components and the collection) / analiza performantei (componentele si colectia)
for clf, label in zip([clf1, clf2, clf3, eclf], ['Decision Tree', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.92 (+/- 0.02) [Decision Tree]


Accuracy: 0.96 (+/- 0.02) [Random Forest]
Accuracy: 0.94 (+/- 0.01) [naive Bayes]
Accuracy: 0.95 (+/- 0.02) [Ensemble]


### 4. Stacking (layered ensemble)

Idea:  train a model for the aggregation of the results corresponding to the models that are parte of the ensemble

Example for a regression task / Exemplu pentru o problema de regresie
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html

In [15]:
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor

In [16]:
# Estimators corresponding to the first level (their outputs will be inputs for the second layer)
# Estimatorii utilizati pe primul nivel (iesirile lor formeaza vectorul de intrare pentru al doilea nivel)
estimators = [('ridge', RidgeCV()),
              ('lasso', LassoCV(random_state=42)),
              ('knr', KNeighborsRegressor(n_neighbors=20,metric='euclidean'))]

In [17]:
# Estimator for the second layer (it can be an ensemble model) 
# / Estimatorul pt al doilea nivel (poate fi un model de tip ansamblu)
final_estimator = GradientBoostingRegressor(n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1,random_state=42)


In [18]:
# Stack of models  / Specificarea stivei de modele
# (for classification: StackingClassifier)
reg = StackingRegressor(
    estimators=estimators,
    final_estimator=final_estimator)

In [19]:
# train/test datasets + training /  construire seturi antrenare/testare + antrenare
X_train, X_test, y_train, y_test = train_test_split(Xregr, yregr, random_state=42)
reg.fit(X_train, y_train)

In [20]:
# performance evaluation / evaluare performanta
y_pred = reg.predict(X_test)
from sklearn.metrics import r2_score
print('R2 score: {:.2f}'.format(r2_score(y_test, y_pred)))

R2 score: 0.53


In [21]:
# extract the outputs of the stacked estimators / extrageree iesirilor estimatorilor din stiva
reg.transform(X_test[:10])

array([[142.36209608, 138.30724927, 146.1       ],
       [179.700576  , 182.89812552, 151.75      ],
       [139.89817956, 132.46803343, 158.25      ],
       [286.95180286, 292.65695767, 225.4       ],
       [126.88317154, 124.1215975 , 164.65      ],
       [ 97.81615945,  90.22087426, 120.15      ],
       [248.29388426, 255.81999336, 237.65      ],
       [186.10079865, 180.14531235, 184.2       ],
       [ 88.04803953,  86.17057352,  83.45      ],
       [115.28314663, 109.14944015, 127.9       ]])

In [22]:
# multi-layer stack (the final estimator is a stacking model) 
# / stiva cu mai multe nivele (estimatorul final este la rândul lui bazat pe un model de tip stiva)

final_layer_rfr = RandomForestRegressor(
    n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
final_layer_gbr = GradientBoostingRegressor(
    n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
final_layer = StackingRegressor(
    estimators=[('rf', final_layer_rfr),
                ('gbrt', final_layer_gbr)],
    final_estimator=RidgeCV()
    )
multi_layer_regressor = StackingRegressor(
    estimators=[('ridge', RidgeCV()),
                ('lasso', LassoCV(random_state=42)),
                ('knr', KNeighborsRegressor(n_neighbors=20,
                                            metric='euclidean'))],
    final_estimator=final_layer
)
multi_layer_regressor.fit(X_train, y_train)

print('R2 score: {:.2f}'
      .format(multi_layer_regressor.score(X_test, y_test)))

R2 score: 0.53


#### Exercise/Homework:  
 * analyze the strategies for defining pipelines (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)
 * analyze an example of constructing stacking-based models (https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html#sphx-glr-auto-examples-ensemble-plot-stack-predictors-py)
 * design a stacking based classifier for the airlines_delay.csv dataset (see Lab 3)  or for a dataset at your choice

#### Exercitiu/Tema
 * analizati strategiile pt definirea unor fluxuri de prelucrare a datelor (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)
 * analizati exemplul de construire a unui model de tip stiva de la https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html#sphx-glr-auto-examples-ensemble-plot-stack-predictors-py
 * proiectati un clasificator de tip stiva pentru setul de date airlines_delay.csv (Lab 3) sau pentru un set de date la alegere