# Stacking multiple ML models for classification

In this notebook, I demonstrate how to perform an ensemble method called stacking. In stacking, multiple ML models are trained on the same dataset, the output of these models plus the dataset is then fed into a metamodel (another ML model) which makes the final predictions. 

First load in all the necessary modules and the <a href="https://archive.ics.uci.edu/ml/datasets/mushroom"> dataset</a> available from the UCI ML repository. This dataset contains various properties of mushroom species and my goal is to build a model that will predict whether the mushroom is toxic or not. 
<br>

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV



In [9]:
df=pd.read_csv('agaricus-lepiota.data', delimiter=",",header=None,
              names=['class','cap-shape','cap-surface','cap-color','bruises','odor','gill-attachment',
                    'gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','ssar',
                    'ssbr','scar','scbr','veil-type','veil-color','ring-number','ring-type','spc',
                    'population','habitat'])

In [10]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,ssbr,scar,scbr,veil-type,veil-color,ring-number,ring-type,spc,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [12]:
df.dtypes

class              object
cap-shape          object
cap-surface        object
cap-color          object
bruises            object
odor               object
gill-attachment    object
gill-spacing       object
gill-size          object
gill-color         object
stalk-shape        object
stalk-root         object
ssar               object
ssbr               object
scar               object
scbr               object
veil-type          object
veil-color         object
ring-number        object
ring-type          object
spc                object
population         object
habitat            object
dtype: object

In [15]:
df.replace('?',np.nan,inplace=True)

In [16]:
df.isna().sum()

class                 0
cap-shape             0
cap-surface           0
cap-color             0
bruises               0
odor                  0
gill-attachment       0
gill-spacing          0
gill-size             0
gill-color            0
stalk-shape           0
stalk-root         2480
ssar                  0
ssbr                  0
scar                  0
scbr                  0
veil-type             0
veil-color            0
ring-number           0
ring-type             0
spc                   0
population            0
habitat               0
dtype: int64

<br>
The columns have been named according to the <a href="https://archive.ics.uci.edu/ml/datasets/mushroom"> description</a> provided on the dataset webpage. Every column is categorical so the object dtype is fine. The description tells us that null values are only found in the 'stalk-root' column as a '?'. Upon inspection, I find around 30% is null. This is a substantial amount, so I will drop this column from the dataset when training a model. However, one can also try different tactics such as imputing and see which method produces the best model. I will also separate the 'class' column since this is the target variable.
<br>

In [20]:
y=df['class']
features=df.drop(['class','stalk-root'],axis=1)

In [24]:
y.value_counts(normalize=True)

e    0.517971
p    0.482029
Name: class, dtype: float64

<br>
There are roughly equal instances of each class, so I do not have to take any extra steps like when dealing with imbalanced classes. Next I will encode each column to represent whether each feature is present or not by a 1 or 0 respectively. This will drastically increase the number of columns since each categorical variable is now its own column.
<br>

In [25]:
df2=pd.get_dummies(features,drop_first=True)

In [30]:
df2.head()

Unnamed: 0,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_c,cap-color_e,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,0,0,0,1,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,1,0,1,0,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,1,0,0,0
3,0,0,0,0,1,0,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


<br>
I encode the target variable to represent each instance as a 1 or 0
<br>

In [39]:
le=LabelEncoder()
target=le.fit_transform(y)

<br>
Now I build my stacking classifier, for the base models, I will use a K-nearest neighbors, decision tree and naive Bayes classifier. For the metamodel, I will use a logistic regressor. I use sklearn's StackingClassifier class to stack the models.
<br>

In [53]:
estimators=[('knn', KNeighborsClassifier(algorithm='ball_tree')), 
            ('dt', DecisionTreeClassifier(random_state=500)), 
            ('nb', GaussianNB())]

clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

<br>
To demonstrate how we can perform hpyerparameter tuning on a stacked model, I build a parameter grid and include a few parameters to explore for each model used. Then I tune the model with 5-fold cross validation for each of these parameter settings using GridSearchCV. The final model will be evaluated using accuracy, which is ok to use since the classes are balanced. However, if the problem has some specific criteria, e.g., if we want to be extra cautious and make sure the model doesn't missclassify a toxic mushroom as edible (i.e. minimize the number of false negatives), we can use another metric like recall.
<br>

In [None]:
params = {'knn__n_neighbors': [5,10],
          'dt__min_samples_leaf': [5,10], 'dt__min_samples_split':[10,15]}
grid = GridSearchCV(estimator=clf, param_grid=params,verbose=1, scoring='accuracy', cv=5,n_jobs=-1)

I vary 3 parameters in the grid, each having 2 values, therefore $2^{3}$ models will be explored for each fold, and since there are 5 folds a total of $5\times8=40$ models will be fit.
<br>

In [54]:
grid.fit(df2,y)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


GridSearchCV(cv=5,
             estimator=StackingClassifier(estimators=[('knn',
                                                       KNeighborsClassifier(algorithm='ball_tree')),
                                                      ('dt',
                                                       DecisionTreeClassifier(random_state=500)),
                                                      ('nb', GaussianNB())],
                                          final_estimator=LogisticRegression()),
             n_jobs=-1,
             param_grid={'dt__min_samples_leaf': [5, 10],
                         'dt__min_samples_split': [10, 15],
                         'knn__n_neighbors': [5, 10]},
             scoring='accuracy', verbose=1)

In [55]:
grid.best_params_

{'dt__min_samples_leaf': 5,
 'dt__min_samples_split': 10,
 'knn__n_neighbors': 10}

In [57]:
grid.best_score_

0.8963252747252748

<br>
Above I show the best parameters for the fit and I get almost a 90% accuracy, not bad for such a small grid! As an exercise, one can repeat this analysis, but this time including 'stalk-root' column and/or use a larger grid to see if the model can be improved.