# Ensemble of ensembles - model stacking

* **Ensemble with different types of classifiers**: 
  * Different types of classifiers (E.g., logistic regression, decision trees, random forest, etc.) are fitted on the same training data
  * Results are combined based on either 
    * majority voting (classification) or 
    * average (regression)
  

* **Ensemble with a single type of classifier**: 
  * Bootstrap samples are drawn from training data 
  * With each bootstrap sample, model (E.g., Individual model may be decision trees, random forest, etc.) will be fitted 
  * All the results are combined to create an ensemble. 
  * Suitabe for highly flexible models that is prone to overfitting / high variance. 

***

## Combining Method

* **Majority voting or average**: 
  * Classification: Largest number of votes (mode) 
  * Regression problems: Average (mean).
  
  
* **Method of application of meta-classifiers on outcomes**: 
  * Binary outcomes: 0 / 1 from individual classifiers
  * Meta-classifier is applied on top of the individual classifiers. 
  
  
* **Method of application of meta-classifiers on probabilities**: 
  * Probabilities are obtained from individual classifiers. 
  * Applying meta-classifier
  

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from google.colab import files

In [0]:
upload_file = files.upload()

Saving WA_Fn-UseC_-HR-Employee-Attrition.csv to WA_Fn-UseC_-HR-Employee-Attrition.csv


In [0]:
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
num_col = list(df.describe().columns)
col_categorical = list(set(df.columns).difference(num_col))
remove_list = ['EmployeeCount', 'EmployeeNumber', 'StandardHours']
col_numerical = [e for e in num_col if e not in remove_list]
attrition_to_num = {'Yes': 0,
                    'No': 1}
df['Attrition_num'] = df['Attrition'].map(attrition_to_num)
col_categorical.remove('Attrition')
df_cat = pd.get_dummies(df[col_categorical])
X = pd.concat([df[col_numerical], df_cat], axis=1)
y = df['Attrition_num']

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2)

In [0]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    

In [0]:
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, roc_auc_score
def print_score(clf, X_train, X_test, y_train, y_test, train=True):
    '''
    v0.1 Follow the scikit learn library format in terms of input
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_train)
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score(lb.transform(y_train), 
                                                        lb.transform(clf.predict(X_train)))))

        #cv_res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        #print("Average Accuracy: \t {0:.4f}".format(np.mean(cv_res)))
        #print("Accuracy SD: \t\t {0:.4f}".format(np.std(cv_res)))
        
    elif train==False:
        '''
        test performance
        '''
        res_test = clf.predict(X_test)
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))      
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score(lb.transform(y_test), lb.transform(res_test))))
        

## Model 1: Decision Tree

In [0]:
from sklearn.tree import DecisionTreeClassifier

In [0]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [0]:
print_score(tree_clf, X_train, X_test, y_train, y_test, train=True)
print_score(tree_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       177
           1       1.00      1.00      1.00       999

    accuracy                           1.00      1176
   macro avg       1.00      1.00      1.00      1176
weighted avg       1.00      1.00      1.00      1176


Confusion Matrix: 
 [[177   0]
 [  0 999]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.7585

Classification Report: 
               precision    recall  f1-score   support

           0       0.37      0.27      0.31        60
           1       0.82      0.88      0.85       234

    accuracy                           0.76       294
   macro avg       0.60      0.58      0.58       294
weighted avg       0.73      0.76      0.74       294


Confusion Matrix: 
 [[ 16  44]
 [ 27 207]]

ROC AUC: 0.5756



## Model 2: Random Forest

In [0]:
from sklearn.ensemble import RandomForestClassifier

In [0]:
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train.ravel())

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [0]:
print_score(rf_clf, X_train, X_test, y_train, y_test, train=True)
print_score(rf_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       177
           1       1.00      1.00      1.00       999

    accuracy                           1.00      1176
   macro avg       1.00      1.00      1.00      1176
weighted avg       1.00      1.00      1.00      1176


Confusion Matrix: 
 [[177   0]
 [  0 999]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.8095

Classification Report: 
               precision    recall  f1-score   support

           0       0.70      0.12      0.20        60
           1       0.81      0.99      0.89       234

    accuracy                           0.81       294
   macro avg       0.76      0.55      0.55       294
weighted avg       0.79      0.81      0.75       294


Confusion Matrix: 
 [[  7  53]
 [  3 231]]

ROC AUC: 0.5519



In [0]:
en_en = pd.DataFrame()

In [0]:
tree_clf.predict_proba(X_train)

array([[0., 1.],
       [1., 0.],
       [0., 1.],
       ...,
       [0., 1.],
       [1., 0.],
       [0., 1.]])

In [0]:
en_en['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)

In [0]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,Attrition_num
0,1.0,0.89,1
1,0.0,0.28,0
2,1.0,0.89,1
3,1.0,0.97,1
4,0.0,0.25,0


In [0]:
tmp = list(col_name)
tmp.append('ind')
en_en.columns = tmp

In [0]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ind
0,1.0,0.89,1
1,0.0,0.28,0
2,1.0,0.89,1
3,1.0,0.97,1
4,0.0,0.25,0


# Meta Classifier

In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
m_clf = LogisticRegression(fit_intercept=False, solver='lbfgs')

In [0]:
m_clf.fit(en_en[['tree_clf', 'rf_clf']], en_en['ind'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
en_test = pd.DataFrame()

In [0]:
en_test['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_test))[1]
col_name = en_en.columns
en_test['combined'] = m_clf.predict(en_test[['tree_clf', 'rf_clf']])

In [0]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [0]:
tmp

['tree_clf', 'rf_clf', 'combined', 'ind']

In [0]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [0]:
en_test.columns = tmp

In [0]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined   0    1
ind              
0         16   44
1         27  207


In [0]:
print(round(accuracy_score(en_test['ind'], en_test['combined']), 4))

0.7585


In [0]:
print(classification_report(en_test['ind'], en_test['combined']))

              precision    recall  f1-score   support

           0       0.37      0.27      0.31        60
           1       0.82      0.88      0.85       234

    accuracy                           0.76       294
   macro avg       0.60      0.58      0.58       294
weighted avg       0.73      0.76      0.74       294



***

# Single Classifier

In [0]:
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [0]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2


In [0]:
df.Attrition.value_counts() / df.Attrition.count()

No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

In [0]:
from sklearn.ensemble import RandomForestClassifier

In [0]:
from sklearn.ensemble import BaggingClassifier

In [0]:
from sklearn.ensemble import AdaBoostClassifier

In [0]:
class_weight = {0:0.839, 1:0.161}

In [0]:
pd.Series(list(y_train)).value_counts() / pd.Series(list(y_train)).count()

1    0.84949
0    0.15051
dtype: float64

In [0]:
forest = RandomForestClassifier(class_weight=class_weight, n_estimators=100)

In [0]:
ada = AdaBoostClassifier(base_estimator=forest, n_estimators=100,
                         learning_rate=0.5, random_state=42)

In [0]:
ada.fit(X_train, y_train.ravel())

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=RandomForestClassifier(bootstrap=True,
                                                         class_weight={0: 0.839,
                                                                       1: 0.161},
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features='auto',
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
            

In [0]:
print_score(ada, X_train, X_test, y_train, y_test, train=True)
print_score(ada, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       177
           1       1.00      1.00      1.00       999

    accuracy                           1.00      1176
   macro avg       1.00      1.00      1.00      1176
weighted avg       1.00      1.00      1.00      1176


Confusion Matrix: 
 [[177   0]
 [  0 999]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.8061

Classification Report: 
               precision    recall  f1-score   support

           0       0.71      0.08      0.15        60
           1       0.81      0.99      0.89       234

    accuracy                           0.81       294
   macro avg       0.76      0.54      0.52       294
weighted avg       0.79      0.81      0.74       294


Confusion Matrix: 
 [[  5  55]
 [  2 232]]

ROC AUC: 0.5374



In [0]:
bag_clf = BaggingClassifier(base_estimator=ada, n_estimators=50,
                            max_samples=1.0, max_features=1.0, bootstrap=True,
                            bootstrap_features=False, n_jobs=-1,
                            random_state=42)

In [0]:
bag_clf.fit(X_train, y_train.ravel())

BaggingClassifier(base_estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                                    base_estimator=RandomForestClassifier(bootstrap=True,
                                                                                          class_weight={0: 0.839,
                                                                                                        1: 0.161},
                                                                                          criterion='gini',
                                                                                          max_depth=None,
                                                                                          max_features='auto',
                                                                                          max_leaf_nodes=None,
                                                                                          min_impurity_decrease=0.0,
                                       

In [0]:
print_score(bag_clf, X_train, X_test, y_train, y_test, train=True)
print_score(bag_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 0.9634

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.76      0.86       177
           1       0.96      1.00      0.98       999

    accuracy                           0.96      1176
   macro avg       0.98      0.88      0.92      1176
weighted avg       0.96      0.96      0.96      1176


Confusion Matrix: 
 [[134  43]
 [  0 999]]

ROC AUC: 0.8785

Test Result:

accuracy score: 0.8027

Classification Report: 
               precision    recall  f1-score   support

           0       0.75      0.05      0.09        60
           1       0.80      1.00      0.89       234

    accuracy                           0.80       294
   macro avg       0.78      0.52      0.49       294
weighted avg       0.79      0.80      0.73       294


Confusion Matrix: 
 [[  3  57]
 [  1 233]]

ROC AUC: 0.5229



***