At times, a single model might be able to provide the expected/ satisfactory results. We might need to combine more than 1 model to get better results. This technique of combining more than one model is called ensembling of models.
<pre>
Ensembling models can be categorised into 
Bagging model - the combined models work parallely on parts of the training data, and their results are aggregated
Boosting model - the combined models work sequentially on the entire data, subsequent models rectify the error of       previous models (multiple weeker models, chained to become a single strong model)
</pre>

In [1]:
#importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

#### Read data
In this demo, we are going to work on spambase dataset, where the normalized frequency of different words in an email are recorded, based on which an email is labelled as spam (1) or not spam (0)

In [2]:
#reading input data from csv file
spam_data = pd.read_csv("datasets/spambase.csv")

#### Splitting the data into train and test set

In [3]:
from sklearn.model_selection import train_test_split
features = spam_data.columns.drop('spam')
target = "spam"
X=spam_data[features]
Y=spam_data[target]
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=100)


### Bagging using Random Forest Algorithm

In [4]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10,
                               min_samples_split=20,
                               min_impurity_decrease=0.05)
model.fit(X_train,Y_train)
train_accuracy = model.score(X_train,Y_train)
test_accuracy = model.score(X_test,Y_test)
print(train_accuracy,test_accuracy)

0.8633152173913043 0.8577633007600435


#### Reviewing the feature importances 

In [5]:
feature_imps = pd.DataFrame(np.array([features,
                                      model.feature_importances_]).T,
                            columns=["feature","importance"])
feature_imps.sort_values(by="importance",ascending=False)

Unnamed: 0,feature,importance
20,word_freq_your,0.213714
15,word_freq_free,0.200673
52,char_freq_$,0.187872
51,char_freq_!,0.139886
6,word_freq_remove,0.069712
22,word_freq_000,0.0551342
18,word_freq_you,0.0432799
23,word_freq_money,0.0322997
54,capital_run_length_average,0.029615
19,word_freq_credit,0.0278139


#### Optional Exercise: Tune the hyperparamters of the random forest classifier using KFold cross validation

## Boosting using Adaboost Algorithm

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(n_estimators=10)
model.fit(X_train,Y_train)
train_accuracy = model.score(X_train,Y_train)
test_accuracy = model.score(X_test,Y_test)
print(train_accuracy,test_accuracy)

0.9195652173913044 0.9272529858849077


#### Reviewing the feature importances 

In [7]:
feature_imps = pd.DataFrame(np.array([features,
                                      model.feature_importances_]).T,
                            columns=["feature","importance"])
feature_imps.sort_values(by="importance",ascending=False)

Unnamed: 0,feature,importance
15,word_freq_free,0.1
45,word_freq_edu,0.1
55,capital_run_length_longest,0.1
52,char_freq_$,0.1
6,word_freq_remove,0.1
51,char_freq_!,0.1
26,word_freq_george,0.1
44,word_freq_re,0.1
24,word_freq_hp,0.1
36,word_freq_1999,0.1


#### Optional Exercise: Tune the adaboost classifier using cross validation