# Ensemble Model
- Voting model 
    - hard voting (mode) - only for classification
    - soft voting (prob) - regression or classification
        - Averaging model

- Bagging (combination)
    - using base model (combine together)
- Bagging (self-revise)
    - AdaBoost (sequential)
    - Gradient Boosting
        - XGBoost (parallel)
        - LightBGM (parallel)
        - CatBoost
- Stacking (relay races)
    - Hand made using Sklearn
        - prepare dataset
        - build first layer of estimators
        - append predictions to the dataset
        - build second layer meta estimator
        - use the stacked model for prediction
    - MLxtend
        - A package can be directly used for building stacking model

### Voting classifier

- using mode to get the final result

In [2]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import VotingRegressor


In [None]:
#Voting classifier os used for clssification using mode to get the final result
# Instantiate the individual models
clf_knn = KNeighborsClassifier(5)
clf_lr = LogisticRegression(class_weight="balanced")
clf_dt = DecisionTreeClassifier(min_samples_leaf = 3, min_samples_split = 9, random_state=500)

# Create and fit the voting classifier
clf_vote = VotingClassifier(
    estimators=[('knn', clf_knn), ('lr', clf_lr), ('dt', clf_dt)]
)
clf_vote.fit(X_train, y_train)

# Calculate the predictions using the voting classifier
pred_vote = clf_vote.predict(X_test)


In [None]:

# Calculate the F1-Score of the voting classifier
score_vote = f1_score(y_test, pred_vote)
print('F1-Score: {:.3f}'.format(score_vote))

# Calculate the classification report
report = classification_report(y_test, pred_vote)
print(report)

### Averaging (soft voting)

- Can be used in both classifiton and regression, with argument voting= 'soft' and with the different weight of different model based on the individual performance of each model.

- Using probability of getting the lable to get the result.

In [None]:
# Build the individual models
clf_lr = LogisticRegression(class_weight='balanced')
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)
clf_svm = SVC(probability=True, class_weight='balanced', random_state=500)

# List of (string, estimator) tuples
estimators = [('lr', clf_lr), ('dt', clf_dt),  ('svm' , clf_svm)]

# Build and fit an averaging classifier
clf_avg = VotingClassifier(estimators=estimators,
voting='soft', weight = [1,2,1])  
clf_avg.fit(X_train, y_train)

# Evaluate model performance
acc_avg = accuracy_score(y_test,  clf_avg.predict(X_test))
print('Accuracy: {:.2f}'.format(acc_avg))


### Bagging
- combine different N * weak same model together 
- OOB(out of bag): everytime, not use all the data to fit the data, the training data is the sample from the whole traning data, we use the data out of the training data as the test set to evaluate the bagging value.



In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Instantiate the base model
clf_dt = DecisionTreeClassifier(max_depth=4, class_weight='balance')  
#class weight = balance to find the balance between bias and variance 

# Build and train the bagging classifier
clf_bag = BaggingClassifier(
  clf_dt,   #base model
  n_estimators=21,   #number of model in Bagging (larger the better 100-500)
  oob_score=True,   # get oob score
  random_state=500,
  bootstrap=False)   # sampling from original training set without replacement 
clf_bag.fit(X_train, y_train)

# Print the out-of-bag score
print('OOB-Score: {:.3f}'.format(clf_bag.oob_score_))

# Evaluate the performance on the test set to compare
pred = clf_bag.predict(X_test)
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, pred)))

### Boosting
- The model fitting process is just like Try-get feedback-correct the error
    - iterative learning
- Is a sequential process not parallel.
- stop training when the result become whitnoise 
- Models:
    - AdaBoost
    - Gradient Boosting
        - XGBoost (parallel)
        - LightBGM
        - CatBoost

### AdaBoost (Adaptive Boosting)
- Adaboost helps you combine multiple “weak classifiers” into a single “strong classifier”

In [None]:
#Build a base model
#Adaboost can be used in both regression and classification
reg_lm = LinearRegression(normalize=True)    

# Build and fit an AdaBoost regressor
reg_ada = AdaBoostRegressor(reg_lm, n_estimators=12, random_state=500)
reg_ada.fit(X_train, y_train)

# Calculate the predictions on the test set
pred = reg_ada.predict(X_test)

# Evaluate the performance using the RMSE
rmse = np.sqrt(mean_squared_error(y_test, pred))
print('RMSE: {:.3f}'.format(rmse))

In [None]:
# Build and fit a tree-based AdaBoost regressor
reg_ada = AdaBoostRegressor(n_estimators=12, random_state=500)  
# without base model, the AdaBoost is a tree-based model
reg_ada.fit(X_train, y_train)

# Calculate the predictions on the test set
pred = reg_ada.predict(X_test)

# Evaluate the performance using the RMSE
rmse = np.sqrt(mean_squared_error(y_test, pred))
print('RMSE: {:.3f}'.format(rmse))

### Gradient Boosting
- initial model is a weak estimator
- recommand to use all the feature when build the classifier

In [None]:
# Build and fit a Gradient Boosting classifier
clf_gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=500)
clf_gbm.fit(X_train, y_train)

# Calculate the predictions on the test set
pred = clf_gbm.predict(X_test)

# Evaluate the performance based on the accuracy
acc = accuracy_score(y_test, pred)
print('Accuracy: {:.3f}'.format(acc))

# Get and show the Confusion Matrix
cm = confusion_matrix(y_test, pred)
print(cm)

### XGBoost
- optimize for distributed computing
- parallel training by nature

In [None]:
# Build and fit a XGBoost regressor
reg_xgb = xgb.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, random_state=500)
reg_xgb.fit(X_train, y_train)
# Calculate the predictions and evaluate regressors
pred_xgb = reg_xgb.predict(X_test)
rmse_xgb = np.sqrt(mean_squared_error(y_test, pred_xgb))

print('Extreme: {:.3f}, Light: {:.3f}'.format(rmse_xgb))

### LightGBM
- fast training and efficiency
- light space
- optimized for parallel and GPU processing
- Very useful in big dataset

In [None]:
# Build and fit a LightGBM regressor
reg_lgb = lgb.LGBMRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, seed=500)
reg_lgb.fit(X_train, y_train)
# Calculate the predictions and evaluate regressors

pred_lgb = reg_lgb.predict(X_test)
rmse_lgb = np.sqrt(mean_squared_error(y_test, pred_lgb))

print('Extreme: {:.3f}, Light: {:.3f}'.format(rmse_lgb))

### CatBoost
- Have built in handling of categorical features, we do not need to preprocess by ourselves 
- Accurate, Robust, fast and scalable
- Different API called catboost

In [12]:
#Install the catboost package
%pip install catboost
%pip install ipywidgets
%jupyter nbextension enable --py widgetsnbextension


Collecting catboost
  Downloading catboost-0.23.2-cp37-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (10.9 MB)
[K     |████████████████████████████████| 10.9 MB 4.3 MB/s eta 0:00:01
Collecting plotly
  Downloading plotly-4.8.1-py2.py3-none-any.whl (11.5 MB)
[K     |████████████████████████████████| 11.5 MB 12.5 MB/s eta 0:00:01
Collecting graphviz
  Downloading graphviz-0.14-py2.py3-none-any.whl (18 kB)
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11430 sha256=a60d7f053cd6f231794c91a95342cbed593129c4e9525c2e22c209cfdf7bf6cb
  Stored in directory: /Users/jepsondu/Library/Caches/pip/wheels/f9/8d/8d/f6af3f7f9eea3553bc2fe6d53e4b287dad18b06a861ac56ddf
Successfully built retrying
Installing collected packages: retrying, plot

Note: you may need to restart the kernel to use updated packages.


UsageError: Line magic function `%jupyter` not found.


In [13]:
import catboost as cb

In [None]:
# Build and fit a CatBoost regressor
reg_cat = cb.CatBoostRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=500)
reg_cat.fit(X_train, y_train)

# Calculate the predictions on the set set
pred = reg_cat.predict(X_test)

# Evaluate the performance using the RMSE
rmse_cat = np.sqrt(mean_squared_error(y_test, pred))
print('RMSE (CatBoost): {:.3f}'.format(rmse_cat))	

### Stacking
- do not have any package provided
- process:
    - prepare dataset
    - build first layer of estimators
    - append predictions to the dataset
    - build second layer meta estimator
    - use the stacked model for prediction


#### First, build first layer of estimators

In [None]:
# Build and fit a Decision Tree classifier
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)
clf_dt.fit(X_train, y_train)

# Build and fit a 5-nearest neighbors classifier using the 'Ball-Tree' algorithm
clf_knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
clf_knn.fit(X_train, y_train)

# Evaluate the performance using the accuracy score
print('Decision Tree: {:0.4f}'.format(accuracy_score(y_test, clf_dt.predict(X_test))))
print('5-Nearest Neighbors: {:0.4f}'.format(accuracy_score(y_test, clf_knn.predict(X_test))))

#### Second, append predictions to dataset - X_new_train

In [None]:
# Create a Pandas DataFrame with the predictions
pred_df = pd.DataFrame({
	'pred_dt': pred_dt,
    'pred_knn': pred_knn
}, index=X_train.index)    #using the X_train.index to split the train set

# Concatenate X_train with the predictions DataFrame
X_train_2nd = pd.concat([X_train, pred_df], axis=1)


#### Third, Build second layer meta estimator

In [None]:
# Build the second-layer meta estimator
clf_stack = DecisionTreeClassifier(random_state=500)
clf_stack.fit(X_train_2nd, y_train)

#### Fourth, Create a Pandas DataFrame with the predictions - X_new_test
#### and do the prediction 

In [None]:
# Create a Pandas DataFrame with the predictions
pred_df = pd.DataFrame({
	'pred_dt':pred_dt,
    'pred_knn':pred_knn
}, index=X_test.index)

# Concatenate X_test with the predictions DataFrame
X_test_2nd = pd.concat([X_test, pred_df ], axis=1)

# Obtain the final predictions from the second-layer estimator
pred_stack = clf_stack.predict(X_test_2nd)

# Evaluate the new performance on the test set
print('Accuracy: {:0.4f}'.format(accuracy_score(y_test, pred_stack)))

### MLxtend

In [17]:
#install mlxtend package
%pip install mlxtend  

Note: you may need to restart the kernel to use updated packages.


In [1]:
from mlxtend.classifier import StackingCla
from mlxtend.regressor import StackingRegr

ImportError: cannot import name 'StackingCla' from 'mlxtend.classifier' (/Users/jepsondu/opt/anaconda3/lib/python3.7/site-packages/mlxtend/classifier/__init__.py)

In [None]:
#Classification stacking

# Instantiate the first-layer classifiers
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)
clf_knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')

# Instantiate the second-layer meta classifier
clf_meta = DecisionTreeClassifier(random_state=500)

# Build the Stacking classifier
clf_stack = StackingClassifier(classifiers=[clf_dt, clf_knn], meta_classifier=clf_meta, use_features_in_secondary=True)
clf_stack.fit(X_train, y_train)

# Evaluate the performance of the Stacking classifier
pred_stack = clf_stack.predict(X_test)
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_stack)))

In [None]:
#Regression stacking


# Instantiate the 1st-layer regressors
reg_dt = DecisionTreeRegressor(min_samples_leaf=11, min_samples_split=33, random_state=500)
reg_lr = LinearRegression(normalize=True)
reg_ridge = Ridge(random_state=500)

# Instantiate the 2nd-layer regressor
reg_meta = LinearRegression()

# Build the Stacking regressor
reg_stack = StackingRegressor(regressors=[reg_dt, reg_lr, reg_ridge], meta_regressor=reg_meta)
reg_stack.fit(X_train, y_train)

# Evaluate the performance on the test set using the MAE metric
pred = reg_stack.predict(X_test)
print('MAE: {:.3f}'.format(mean_absolute_error(y_test, pred)))