# Name:- Ravindra kumar sharma
# Bits id :- 2020H1030126P

# Ensemble


Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.

Most ensemble methods use a single base learning algorithm i.e. learners of the same type, leading to homogeneous ensembles.

There are also some methods that use heterogeneous learners, i.e. learners of different types, leading to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual members, the base learners have to be as accurate as possible and as diverse as possible.



### Bagging

Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates. For example, we can train M different trees on different subsets of the data (chosen randomly with replacement).

Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.

### Boosting

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost is one of the most successful boosting algorithms developed for binary classification.

### Libraries useful in Ensemble are listed below

### Import all the libraries required

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

### Load the "letter-recognition" data

In [2]:
# import dataset
df = pd.read_csv("letter-recognition.data.txt", header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


### Split the dataset into training and testing parts (70-30 ratio with a random state value 30)

In [3]:
# Select the independent variables and the target attribute
X = df[df.columns[1:]] # Selecting the independent variables
Y = df[df.columns[0]] # selecting only the target lableled column
X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [141]:
Y

0        T
1        I
2        D
3        N
4        G
        ..
19995    D
19996    C
19997    T
19998    S
19999    A
Name: 0, Length: 20000, dtype: object

In [226]:
# Divide the dataset into training and testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

### Q1. Ensemble Method by manipulation of Dataset (Bagged Decision Trees)

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

We will create decision tree classifiers with and without bagging ensemble method and compare their performance.

In [5]:
# Implement the decision tree classifier using entropy and random state value as 30
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30) 

In [6]:
# Use k-fold cross validation with k=5
dtree_entropy = dtree_entropy.fit(X_train,Y_train)
scores = cross_val_score(dtree_entropy, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.865      0.86964286 0.85214286 0.86       0.87      ]
mean score:  0.8633571428571427


### Prediction and Evaluation

In [7]:
# Predict results on the testing part
Y_pred = dtree_entropy.predict(X_test)

In [8]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Comparison with Bagged Decision Tree

In [9]:
# Create a model using bagging using 5 decision tree classifiers
from sklearn.ensemble import BaggingClassifier

seed = 30
dtree = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
num_trees = 5
model = BaggingClassifier(base_estimator=dtree, n_estimators=num_trees, random_state=seed)

In [10]:
# Use k-fold cross validation with k=5
scores = cross_val_score(model, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.88714286 0.89035714 0.89       0.88285714 0.88714286]
mean score:  0.8875


### Prediction and Evaluation

In [12]:
# Predict results on the testing part
Y_pred = model.fit(X_train, Y_train).predict(X_test)

In [13]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred))

              precision    recall  f1-score   support

           A       0.93      0.98      0.95       229
           B       0.78      0.92      0.84       228
           C       0.88      0.89      0.89       220
           D       0.78      0.90      0.84       219
           E       0.84      0.91      0.87       232
           F       0.87      0.81      0.84       225
           G       0.85      0.82      0.84       234
           H       0.82      0.86      0.84       206
           I       0.90      0.93      0.91       236
           J       0.93      0.89      0.91       209
           K       0.87      0.92      0.89       213
           L       0.94      0.92      0.93       239
           M       0.93      0.93      0.93       240
           N       0.96      0.90      0.92       239
           O       0.87      0.81      0.84       243
           P       0.89      0.93      0.91       243
           Q       0.89      0.89      0.89       228
           R       0.89    

### Q2. Ensemble Method by manipulation of Classifiers (using Voting Classifier)

The VotingClassifier takes in a list of different estimators as arguments and a voting method. The **hard** voting method uses the predicted labels and a majority rules system, while the **soft** voting method predicts a label based on the argmax/largest predicted value of the sum of the predicted probabilities.

After we provide the desired classifiers, we need to fit the resulting ensemble classifier object. We can then get predictions and use accuracy metrics.

In [14]:
#Import required library
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier


In [15]:
# Implement the different classifiers
clf1 = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
clf2 = KNeighborsClassifier(n_neighbors=3, metric = 'euclidean')
clf3 = KNeighborsClassifier(n_neighbors=5, metric = 'euclidean')
clf4 = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
clf5 = GaussianNB()

In [19]:
# Build Voting Classifier using above estimators and hard voting method
# Function to be used: VotingClassifier(estimators,voting)
# Estimators represent the base classifiers used taken as ('base classifier name', variable_name)
eclf1 = VotingClassifier(estimators=[
        ('dtree', clf1), ('3nn', clf2), ('5nn_euc', clf3), ('5nn_man', clf4), ('nb', clf5)], voting='hard')

In [21]:
# Fit the voting classifier model and print scores using k-fold cross validation with k=5

scores = cross_val_score(eclf1, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.94392857 0.9425     0.9425     0.945      0.94428571]
mean score:  0.9436428571428571


### Prediction and Evaluation

In [22]:
# Predict results on the testing part
Y_pred = eclf1.fit(X_train, Y_train).predict(X_test)

In [23]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred))

              precision    recall  f1-score   support

           A       0.98      1.00      0.99       229
           B       0.82      0.97      0.89       228
           C       0.96      0.95      0.96       220
           D       0.88      0.98      0.93       219
           E       0.96      0.94      0.95       232
           F       0.93      0.93      0.93       225
           G       0.95      0.91      0.93       234
           H       0.89      0.91      0.90       206
           I       0.94      0.97      0.96       236
           J       0.98      0.93      0.96       209
           K       0.94      0.89      0.91       213
           L       0.99      0.95      0.97       239
           M       0.96      0.98      0.97       240
           N       0.97      0.93      0.95       239
           O       0.92      0.95      0.93       243
           P       0.96      0.94      0.95       243
           Q       0.97      0.96      0.96       228
           R       0.93    

### Q3. Manipulating the features

In [98]:
# Generate five random vectors
import numpy as np
random_state = np.random.RandomState(seed=5)
x = random_state.randint(16, size=(5, 10))

In [99]:
x

array([[ 3, 14, 15, 13,  6,  6,  0,  9,  8,  4],
       [ 7, 14, 11,  0, 14,  0,  7, 12, 15,  1],
       [ 5,  7,  0, 11, 12, 13, 11,  1, 15, 14],
       [ 4,  6, 13,  2, 14,  9,  9, 10,  9,  9],
       [14,  1,  2,  7,  0, 14,  5, 10,  0,  0]])

In [123]:
# Model 1
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_1 = X_train.iloc[:, x[0]]
Y_train_1 = Y_train
dtree_entropy_1 = DecisionTreeClassifier(criterion='entropy', random_state = 30,max_depth=50) 
dtree_entropy_1 = dtree_entropy_1.fit(X_train_1,Y_train_1)


In [124]:
# Model 2
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_2 = X_train.iloc[:, x[1]]
Y_train_2 = Y_train
dtree_entropy_2 = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
dtree_entropy_2 = dtree_entropy_2.fit(X_train_2,Y_train_2)


In [125]:
# Model 3
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_3 = X_train.iloc[:, x[2]]
Y_train_3 = Y_train
dtree_entropy_3 = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
dtree_entropy_3 = dtree_entropy_3.fit(X_train_3,Y_train_3)

In [126]:
# Model 4
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_4 = X_train.iloc[:, x[3]]
Y_train_4 = Y_train
dtree_entropy_4 = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
dtree_entropy_4 = dtree_entropy_4.fit(X_train_4,Y_train_4)

In [127]:
# Model 5
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_5 = X_train.iloc[:, x[4]]
Y_train_5 = Y_train
dtree_entropy_5 = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
dtree_entropy_5 = dtree_entropy_5.fit(X_train_5,Y_train_5)

In [129]:
# Apply Voting Classifier
dtree_eclf_1 = VotingClassifier(estimators=[
        ('dtree_1', dtree_entropy_1), ('dtree_2',dtree_entropy_2), ('dtree_3', dtree_entropy_3), ('dtree_4', dtree_entropy_4), ('dtree_5', dtree_entropy_5)], voting='hard')

In [130]:
# Calculate and print confusion matrix and other performance measures 
Y_pred = dtree_eclf_1.fit(X_train, Y_train).predict(X_test)

In [131]:
print(classification_report(Y_test,Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Q4. Manipulating the classes

In [135]:
# Generate 5 sets of two class representation
random_state = np.random.RandomState(seed=5)
x = random_state.randint(26, size=(5, 13))

In [296]:
x

array([[ 3, 14, 15,  6, 22, 16,  9,  8,  4,  7, 16, 16,  7],
       [12, 15, 17, 21,  7, 16, 12, 13, 11,  1, 15, 20, 22],
       [18, 25,  9, 10,  9,  9,  1, 18,  7, 16, 14,  5,  0],
       [16,  4, 14,  4,  9, 19,  2,  4,  6,  9, 19, 19, 18],
       [17, 21,  7,  4, 12, 13, 11, 11,  3,  1, 23,  3, 14]])

In [306]:
# Select the independent variables and the target attribute
df_new_1 = df.copy(deep=True)
df_new_2 = df.copy(deep=True)
df_new_3 = df.copy(deep=True)
df_new_4 = df.copy(deep=True)
df_new_5 = df.copy(deep=True)


In [308]:
for i in range(len(df_new)):
    letter = ord(df.iloc[i,0]) - ord('A')
    
    if letter not in x[0]:
        df_new_1.iloc[i,0] = 1
    else:
        df_new_1.iloc[i,0] = 0
        
    if letter not in x[1]:
        df_new_2.iloc[i,0] = 1
    else:
        df_new_2.iloc[i,0] = 0
        
    if letter not in x[2]:
        df_new_3.iloc[i,0] = 1
    else:
        df_new_3.iloc[i,0] = 0
        
    if letter not in x[3]:
        df_new_4.iloc[i,0] = 1
    else:
        df_new_4.iloc[i,0] = 0
        
    if letter not in x[4]:
        df_new_5.iloc[i,0] = 1
    else:
        df_new_5.iloc[i,0] = 0

In [137]:
# Model 1
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_1 = df_new_1[df_new_1.columns[1:]]
Y_train_1 = df_new_1[df_new_1.columns[0]]
dtree_entropy_1 = DecisionTreeClassifier(criterion='entropy', random_state = 30,max_depth=50) 
dtree_entropy_1 = dtree_entropy_1.fit(X_train_1,Y_train_1)

In [None]:
# Model 2
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_2 = df_new_2[df_new_2.columns[1:]]
Y_train_2 = df_new_2[df_new_2.columns[0]]
dtree_entropy_1 = DecisionTreeClassifier(criterion='entropy', random_state = 30,max_depth=50) 
dtree_entropy_1 = dtree_entropy_1.fit(X_train_2,Y_train_2)

In [None]:
# Model 3
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_1 = df_new_1[df_new_1.columns[1:]]
Y_train_1 = df_new_1[df_new_1.columns[0]]
dtree_entropy_1 = DecisionTreeClassifier(criterion='entropy', random_state = 30,max_depth=50) 
dtree_entropy_1 = dtree_entropy_1.fit(X_train_1,Y_train_1)

In [None]:
# Model 4
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_1 = df_new_1[df_new_1.columns[1:]]
Y_train_1 = df_new_1[df_new_1.columns[0]]
dtree_entropy_1 = DecisionTreeClassifier(criterion='entropy', random_state = 30,max_depth=50) 
dtree_entropy_1 = dtree_entropy_1.fit(X_train_1,Y_train_1)

In [None]:
# Model 5
# Select the independent variables 
# select only the target lableled column
# Train the model
X_train_1 = df_new_1[df_new_1.columns[1:]]
Y_train_1 = df_new_1[df_new_1.columns[0]]
dtree_entropy_1 = DecisionTreeClassifier(criterion='entropy', random_state = 30,max_depth=50) 
dtree_entropy_1 = dtree_entropy_1.fit(X_train_1,Y_train_1)

In [None]:
# Apply Voting Classifier

In [None]:
# Calculate and print confusion matrix and other performance measures 

### Q5. Which method performs the best