# Ensemble


Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.

Most ensemble methods use a single base learning algorithm i.e. learners of the same type, leading to homogeneous ensembles.

There are also some methods that use heterogeneous learners, i.e. learners of different types, leading to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual members, the base learners have to be as accurate as possible and as diverse as possible.



### Bagging

Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates. For example, we can train M different trees on different subsets of the data (chosen randomly with replacement).

Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.

### Boosting

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost is one of the most successful boosting algorithms developed for binary classification.

### Libraries useful in Ensemble are listed below

### Import all the libraries required

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

### Load the data

In [2]:
# import dataset
dataset = 'model/gesture_classifier/gesture_akshay.csv'

df = pd.read_csv(dataset, header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,33,34,35,36,37,38,39,40,41,42
0,0,-0.422222,0.0,-0.255556,-0.4,-0.183333,-0.316667,-0.222222,-0.138889,0.022222,...,-0.455556,-1.0,0.311111,-0.866667,-0.266667,-0.016667,-0.255556,-0.044444,-0.244444,-0.072222
1,0,-0.418478,0.0,-0.228261,-0.358696,-0.157609,-0.298913,-0.228261,-0.125,0.032609,...,-0.461957,-1.0,0.288043,-0.891304,-0.25,-0.01087,-0.244565,-0.032609,-0.255435,-0.065217
2,0,-0.362637,0.0,-0.186813,-0.373626,-0.131868,-0.318681,-0.241758,-0.175824,0.104396,...,-0.340659,-1.0,0.373626,-0.807692,-0.247253,-0.06044,-0.230769,-0.054945,-0.236264,-0.076923
3,0,-0.394444,0.0,-0.25,-0.394444,-0.166667,-0.311111,-0.211111,-0.133333,0.022222,...,-0.472222,-1.0,0.288889,-0.877778,-0.266667,-0.011111,-0.25,-0.038889,-0.244444,-0.072222
4,0,-0.387435,0.0,-0.282723,-0.403141,-0.141361,-0.324607,-0.21466,-0.162304,0.078534,...,-0.429319,-1.0,0.34555,-0.837696,-0.287958,-0.041885,-0.246073,-0.052356,-0.240838,-0.068063


### Split the dataset into training and testing parts (70-30 ratio with a random state value 30)

In [3]:
# Select the independent variables and the target attribute
X = df.iloc[:, 1:]
Y = df.iloc[:, :1]

# print(X.head)
# print(Y.head)

print(X)
print(Y)

            1    2         3         4         5         6         7   \
0    -0.422222  0.0 -0.255556 -0.400000 -0.183333 -0.316667 -0.222222   
1    -0.418478  0.0 -0.228261 -0.358696 -0.157609 -0.298913 -0.228261   
2    -0.362637  0.0 -0.186813 -0.373626 -0.131868 -0.318681 -0.241758   
3    -0.394444  0.0 -0.250000 -0.394444 -0.166667 -0.311111 -0.211111   
4    -0.387435  0.0 -0.282723 -0.403141 -0.141361 -0.324607 -0.214660   
...        ...  ...       ...       ...       ...       ...       ...   
7777 -0.512346  0.0 -0.376543 -0.395062 -0.259259 -0.327160 -0.203704   
7778 -0.500000  0.0 -0.382716 -0.401235 -0.265432 -0.320988 -0.203704   
7779 -0.500000  0.0 -0.382716 -0.395062 -0.271605 -0.327160 -0.209877   
7780 -0.506173  0.0 -0.388889 -0.407407 -0.265432 -0.327160 -0.216049   
7781 -0.506173  0.0 -0.388889 -0.407407 -0.265432 -0.327160 -0.216049   

            8         9         10  ...        33   34        35        36  \
0    -0.138889  0.022222 -0.288889  ... -0.45

In [4]:
# Divide the dataset into training and testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

### Q1. Ensemble Method by manipulation of Dataset (Bagged Decision Trees)

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

We will create decision tree classifiers with and without bagging ensemble method and compare their performance.

In [5]:
# Implement the decision tree classifier using entropy and random state value as 30
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)

In [6]:
# Use k-fold cross validation with k=5
dtree_entropy = dtree_entropy.fit(X_train,Y_train)
scores = cross_val_score(dtree_entropy, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.98990826 0.99541284 0.99265381 0.99081726 0.9862259 ]
mean score:  0.9910036141228801


### Prediction and Evaluation

In [7]:
# Predict results on the testing part
predictions = dtree_entropy.predict(X_test)

In [8]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       158
           1       0.99      0.99      0.99       188
           2       1.00      1.00      1.00       317
           3       0.99      1.00      1.00       225
           4       1.00      1.00      1.00       228
           5       0.99      1.00      1.00       374
           6       0.97      0.97      0.97        76
           7       1.00      0.99      0.99       238
           8       1.00      0.99      1.00       193
           9       1.00      0.99      1.00       189
          10       0.92      0.92      0.92        25
          11       1.00      0.95      0.98        42
          12       0.98      1.00      0.99        82

    accuracy                           0.99      2335
   macro avg       0.99      0.98      0.99      2335
weighted avg       0.99      0.99      0.99      2335

Confusion Matrix
[[157   0   0   0   0   0   1   0   0   0   0   0   0]
 [  0 1

### Comparison with Bagged Decision Tree

In [9]:
# Create a model using bagging using 5 decision tree classifiers
from sklearn.ensemble import BaggingClassifier

seed = 30
dtree = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
num_trees = 5
model = BaggingClassifier(base_estimator=dtree, n_estimators=num_trees, random_state=seed)

In [10]:
# Use k-fold cross validation with k=5
model_bagged = model.fit(X_train,Y_train)
scores = cross_val_score(model_bagged, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


scores:  [0.99082569 0.99633028 0.98806244 0.99632691 0.99081726]
mean score:  0.9924725149746001


### Prediction and Evaluation

In [11]:
# Predict results on the testing part
predictions = model_bagged.predict(X_test)

In [12]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       158
           1       1.00      0.99      1.00       188
           2       1.00      1.00      1.00       317
           3       0.99      1.00      1.00       225
           4       1.00      1.00      1.00       228
           5       0.99      1.00      1.00       374
           6       0.97      0.99      0.98        76
           7       1.00      0.99      0.99       238
           8       1.00      0.98      0.99       193
           9       1.00      1.00      1.00       189
          10       0.92      0.96      0.94        25
          11       0.98      1.00      0.99        42
          12       0.99      1.00      0.99        82

    accuracy                           0.99      2335
   macro avg       0.99      0.99      0.99      2335
weighted avg       0.99      0.99      0.99      2335

Confusion Matrix
[[156   0   0   0   0   1   0   0   0   0   1   0   0]
 [  0 1

### Q2. Ensemble Method by manipulation of Classifiers (using Voting Classifier)

The VotingClassifier takes in a list of different estimators as arguments and a voting method. The **hard** voting method uses the predicted labels and a majority rules system, while the **soft** voting method predicts a label based on the argmax/largest predicted value of the sum of the predicted probabilities.

After we provide the desired classifiers, we need to fit the resulting ensemble classifier object. We can then get predictions and use accuracy metrics.

In [23]:
# Implement the different classifiers
from sklearn.ensemble import VotingClassifier

dtree_gini = DecisionTreeClassifier(criterion='gini')
K3_euclidean = KNeighborsClassifier(n_neighbors=3, metric = "euclidean")
K5_euclidean = KNeighborsClassifier(n_neighbors=5, metric = "euclidean")
K5_manhattan = KNeighborsClassifier(n_neighbors=5, metric = "manhattan")
nb = GaussianNB()

In [26]:
# Build Voting Classifier using above estimators and hard voting method
# Function to be used: VotingClassifier(estimators,voting)
# Estimators represent the base classifiers used taken as ('base classifier name', variable_name)
model = VotingClassifier(estimators=[('dt', dtree_gini), ('knn3', K3_euclidean), ('knn5', K5_euclidean),('knn5_man', K5_manhattan), ('gnb', nb)], voting='hard')

In [27]:
# Fit the voting classifier model and print scores using k-fold cross validation with k=5
model_voting = model.fit(X_train,Y_train)
scores = cross_val_score(model_voting, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.94178571 0.94392857 0.94357143 0.94464286 0.94392857]
mean score:  0.9435714285714285


### Prediction and Evaluation

In [28]:
# Predict results on the testing part
predictions = model_voting.predict(X_test)

In [29]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           A       0.98      1.00      0.99       229
           B       0.84      0.97      0.90       228
           C       0.96      0.96      0.96       220
           D       0.88      0.98      0.93       219
           E       0.95      0.92      0.93       232
           F       0.93      0.93      0.93       225
           G       0.94      0.91      0.93       234
           H       0.89      0.92      0.90       206
           I       0.94      0.97      0.96       236
           J       0.97      0.93      0.95       209
           K       0.93      0.91      0.92       213
           L       0.99      0.95      0.97       239
           M       0.97      0.98      0.98       240
           N       0.98      0.95      0.96       239
           O       0.91      0.94      0.92       243
           P       0.95      0.94      0.95       243
           Q       0.96      0.96      0.96       228
           R       0.95    

### Q3. Manipulating the features

In [73]:
# Generate five random vectors
import numpy as np
vector = []
for i in range(5):
 vect=np.random.choice(np.arange(1,17),10, replace=False)
 vector.append(vect)
    
df1 = df [vector[0]]
df2 = df [vector[1]]
df3 = df [vector[2]]
df4 = df [vector[3]] 
df5 = df [vector[4]]


In [74]:
# Model 1
# Select the independent variables 
# select only the target lableled column

X = df1[df1.columns[1:]]
Y = df[df.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)


# Train the model
model1 = DecisionTreeClassifier(criterion='entropy')
model1 = model1.fit(X_train,Y_train)

In [75]:
# Model 2
# Select the independent variables 
# select only the target lableled column
X = df2[df2.columns[1:]]
Y = df[df.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model2 = DecisionTreeClassifier(criterion='entropy')
model2 = model2.fit(X_train,Y_train)

In [76]:
# Model 3
# Select the independent variables 
# select only the target lableled column
X = df3[df3.columns[1:]]
Y = df[df.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model3 = DecisionTreeClassifier(criterion='entropy')
model3 = model3.fit(X_train,Y_train)

In [77]:
# Model 4
# Select the independent variables 
# select only the target lableled column
X = df4[df4.columns[1:]]
Y = df[df.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model4 = DecisionTreeClassifier(criterion='entropy')
model4 = model4.fit(X_train,Y_train)

In [78]:
# Model 5
# Select the independent variables 
# select only the target lableled column
X = df5[df5.columns[1:]]
Y = df[df.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model5 = DecisionTreeClassifier(criterion='entropy')
model5 = model5.fit(X_train,Y_train)

In [79]:
# Apply Voting Classifier
model = VotingClassifier(estimators=[('m1', model1), ('m2', model2), ('m3', model3),('m4', model4), ('m5', model5)], voting='hard')

In [80]:
# Calculate and print confusion matrix and other performance measures 

model_voting = model.fit(X_train,Y_train)

# Predict results on the testing part
predictions = model_voting.predict(X_test)

print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           A       0.86      0.84      0.85       229
           B       0.67      0.79      0.73       228
           C       0.84      0.80      0.82       220
           D       0.70      0.75      0.73       219
           E       0.83      0.82      0.82       232
           F       0.72      0.68      0.70       225
           G       0.69      0.72      0.70       234
           H       0.76      0.73      0.74       206
           I       0.91      0.89      0.90       236
           J       0.87      0.87      0.87       209
           K       0.81      0.83      0.82       213
           L       0.86      0.83      0.85       239
           M       0.82      0.81      0.81       240
           N       0.82      0.84      0.83       239
           O       0.83      0.77      0.80       243
           P       0.80      0.82      0.81       243
           Q       0.73      0.79      0.76       228
           R       0.71    

### Q4. Manipulating the classes

In [65]:
# Generate 5 sets of two class representation
import random
feature_vector = np.zeros((5,13))
for i in range(5):
 feature_vector[i]=random.sample(range(1,26),13)
print(feature_vector)
    
df1 = df.copy (deep=True) 
df2 = df.copy (deep=True)
df3 = df.copy (deep=True)
df4 = df.copy (deep=True)
df5 = df.copy (deep=True)

for idx in range(len(df)):
    col = ord(df.iloc[idx,0]) -64
    
    if col not in feature_vector[0]:
        df1.iloc[idx,0] = '1'
    else:
        df1.iloc[idx,0] = '0'
        
    if col not in feature_vector[1]:
        df1.iloc[idx,0] = '1'
    else:
        df1.iloc[idx,0] = '0'
        
    if col not in feature_vector[2]:
        df1.iloc[idx,0] = '1'
    else:
        df1.iloc[idx,0] = '0'
        
    if col not in feature_vector[3]:
        df1.iloc[idx,0] = '1'
    else:
        df1.iloc[idx,0] = '0'
        
    if col not in feature_vector[4]:
        df1.iloc[idx,0] = '1'
    else:
        df1.iloc[idx,0] = '0'

[[ 8. 21.  2. 13.  9. 11. 23. 22. 24. 16. 14.  6. 17.]
 [ 5.  8. 17. 22.  9. 18.  1. 21. 20. 23.  6. 12. 14.]
 [ 7. 18.  1. 21. 20. 14. 15. 11.  8. 17.  4.  2. 10.]
 [ 5. 20.  3.  2. 24. 11.  9.  8. 14. 23.  4.  7. 19.]
 [15. 18. 10.  5. 23. 19. 20.  8.  9.  7.  6. 11. 22.]]


In [66]:
# Model 1
# Select the independent variables 
# select only the target lableled column
X = df1[df1.columns[1:]]
Y = df1[df1.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)


# Train the model
model1 = DecisionTreeClassifier(criterion='entropy')
model1 = model1.fit(X_train,Y_train)

In [67]:
# Model 2
# Select the independent variables 
# select only the target lableled column
X = df2[df2.columns[1:]]
Y = df2[df2.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model2 = DecisionTreeClassifier(criterion='entropy')
model2 = model2.fit(X_train,Y_train)

In [68]:
# Model 3
# Select the independent variables 
# select only the target lableled column
X = df3[df3.columns[1:]]
Y = df3[df3.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model3 = DecisionTreeClassifier(criterion='entropy')
model3 = model3.fit(X_train,Y_train)

In [69]:
# Model 4
# Select the independent variables 
# select only the target lableled column
X = df4[df4.columns[1:]]
Y = df4[df4.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model4 = DecisionTreeClassifier(criterion='entropy')
model4 = model4.fit(X_train,Y_train)

In [70]:
# Model 5
# Select the independent variables 
# select only the target lableled column
X = df5[df5.columns[1:]]
Y = df5[df5.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

# Train the model
model5 = DecisionTreeClassifier(criterion='entropy')
model5 = model5.fit(X_train,Y_train)

In [71]:
# Apply Voting Classifier
model = VotingClassifier(estimators=[('m1', model1), ('m2', model2), ('m3', model3),('m4', model4), ('m5', model5)], voting='hard')

In [72]:
# Calculate and print confusion matrix and other performance measures 
model_voting = model.fit(X_train,Y_train)

# Predict results on the testing part
predictions = model_voting.predict(X_test)

print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           A       0.94      0.90      0.92       229
           B       0.83      0.84      0.83       228
           C       0.88      0.90      0.89       220
           D       0.79      0.86      0.82       219
           E       0.85      0.87      0.86       232
           F       0.82      0.80      0.81       225
           G       0.87      0.80      0.83       234
           H       0.75      0.78      0.76       206
           I       0.88      0.93      0.90       236
           J       0.93      0.89      0.91       209
           K       0.84      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.93      0.92       240
           N       0.88      0.89      0.89       239
           O       0.88      0.81      0.85       243
           P       0.84      0.92      0.88       243
           Q       0.88      0.83      0.86       228
           R       0.83    

### Q5. Which method performs the best

Getting an accuracy of about 87% by manipulating the classes (at question no. 4)