# Ensembles

**Q1(a)**  
Load the Wine dataset using the CSV file provided, and assess the accuracy of a decision tree classifier using 10-fold cross-validation.  
What percentage of instances are correctly classified?

In [50]:
import pandas as pd
import numpy as np
from collections import Counter
wine_pd = pd.read_csv('Wine.csv')
wine_pd.head()

Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,OD280/OD315_of_diluted_wines,Proline,class
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,Type1
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,Type1
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,Type1
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,Type1
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,Type1


In [51]:
y = wine_pd.pop('class').values
X = wine_pd.values
X.shape

(178, 13)

In [52]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import BaggingClassifier

dtree = DecisionTreeClassifier(criterion='entropy')
folds = 10
v = 0

In [53]:
scores_tree = cross_val_score(dtree, X, y, cv=folds, n_jobs = -1)
print("Mean for Tree {:.2f}".format(scores_tree.mean()))

Mean for Tree 0.92


**Q1(b)**  
Now, apply ensemble classification using bagging to achieve diversity and with a decision tree classifier.   
What percentage of instances are now correctly classified with an ensemble of size 10? 


In [54]:
tree_bag = BaggingClassifier(dtree,
                             n_estimators = 10,
                             max_samples = 1.0, # bootstrap resampling 
                             bootstrap = True)

scores_tree_bag = cross_val_score(tree_bag, X, y, cv=folds, n_jobs = -1)
print("Mean for 10 D_Tree_bag {:.2f}".format(scores_tree_bag.mean()))

Mean for 10 D_Tree_bag 0.94


**Q1(c)**   
Repeat (b), for ensembles of size 10, 50, 100, 200 and 300 classifiers.  
What level of improvement does this provide, in terms of percentage of instances correctly classified?

In [55]:
trees = [10,50,100,200,300]
for v in trees:
    tree_bag = BaggingClassifier(dtree,
                                 n_estimators = v,
                                 max_samples = 1.0, # bootstrap resampling 
                                 bootstrap = True)
    
    scores_tree_bag = cross_val_score(tree_bag, X, y, cv=folds, n_jobs = -1)
    print("Mean for {} D_Tree_bag {:.2f}".format(v, scores_tree_bag.mean()))

Mean for 10 D_Tree_bag 0.91
Mean for 50 D_Tree_bag 0.96
Mean for 100 D_Tree_bag 0.96
Mean for 200 D_Tree_bag 0.96
Mean for 300 D_Tree_bag 0.96


**Q1(d)**  
Why does the level of improvement in accuracy often “level off” after an ensemble has been increased to a certain size?  

## Q2
**Q2(a)**  

Load the Blood Alcohol Content (BAC) dataset using the CSV file provided.   
This dataset contains a mix of numeric and categorical data, use one-hot encoding to convert to a numeric format. When this dataset was collected the BAC limit for driving was 0.8mg/ml. Convert this to a classification task by adding a binary Over/Under feature where Over is a BAC level > 0.8mg/ml. 


In [56]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
BAC_pd = pd.read_csv('BAC.csv')
BAC_pd.head()

Unnamed: 0,Gender,FrameSize,AmountConsumed,Meal,Duration,BAC
0,male,1,1,snack,60,0.2
1,female,2,3,none,120,0.8
2,female,4,4,full,90,0.8
3,male,4,6,none,120,1.0
4,male,4,3,none,60,0.5


In [57]:
BAC_pd['Over']='Yes'
BAC_pd.loc[BAC_pd['BAC'] <= 0.8 ,'Over'] = 'No'
Counter(BAC_pd['Over'])

Counter({'Yes': 44, 'No': 41})

In [58]:
 BAC_cat = BAC_pd[['Gender','Meal']]                         # categorical features 
BAC_num = BAC_pd[['FrameSize','AmountConsumed','Duration']] # numeric features

In [59]:
onehot_encoder = OneHotEncoder(sparse_output=False, drop = 'first')
BAC_cat_oh = onehot_encoder.fit_transform(BAC_cat)
BAC_cat_oh.shape

(85, 3)

In [60]:
BAC_num_np = BAC_num.values
X = np.concatenate((BAC_cat_oh, BAC_num_np), axis=1) # merge the two numeric arrays 
y = BAC_pd['Over'].values
X.shape, y.shape

((85, 6), (85,))

**Q2(b)**  
Using 10-fold cross validation, compare the performance of:  
- a single decision tree, 
- a bagging ensemble (100 members) and 
- a boosting ensemble (also 100 members).

**Q2(c)**  
Are the results from a single cross validation run stable?


In [61]:
dtree = DecisionTreeClassifier(criterion='entropy')

tree_bag = BaggingClassifier(dtree, 
                            n_estimators = 100,
                            max_samples = 1.0, # bootstrap resampling 
                            bootstrap = True)

GBC = GradientBoostingClassifier(n_estimators=100, random_state=4)

In [62]:
def mean(el):
    return np.array(el).mean()

In [63]:
tree_l, bag_l, boost_l = [],[],[]
reps = 1

for rep in range(reps):
    kf = KFold(n_splits=10, shuffle = True) # needed to ensure`shuffling   
    tree_l.append(mean(cross_val_score(dtree, X, y, cv=kf, verbose = v, n_jobs = -1)))
    bag_l.append(mean(cross_val_score(tree_bag, X, y, cv=kf, verbose = v, n_jobs = -1)))
    boost_l.append(mean(cross_val_score(GBC, X, y, cv=kf, verbose = v, n_jobs = -1)))
    
print("Mean for Tree {:.2f}".format(mean(tree_l)))
print("Mean for Tree Bag {:.2f}".format(mean(bag_l)))
print("Mean for Tree Boost {:.2f}".format(mean(boost_l)))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done   2 out of  10 | elapsed:    0.8s remaining:    3.1s
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.8s remaining:    1.8s
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:    0.8s remaining:    1.2s
[Parallel(n_jobs=-1)]: Done   5 out of  10 | elapsed:    0.8s remaining:    0.8s
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.8s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.8s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done   8 out of  10 | elapsed:    0.8s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.08355975151062012s.) Setting ba

*With reps set to 1 we get quite different results from successive runs.*     
**Q2(d)**  
Repeat the 10-fold cross validation comparison 50 times to get a more robust comparison.

## Q3
**Q3(a)**  
Load the glass dataset from glass.csv.  
Evaluate a 1-NN classifier using 10-fold cross-validation.   
What is the overall accuracy achieved?

**Q3(b)**   
Apply bagging with a 1-NN classifier for an ensemble size of 100.  
What is the improvement in terms of overall accuracy?

In [64]:
from sklearn.preprocessing import StandardScaler
glass_pd = pd.read_csv('glass.csv')
glass_pd.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [65]:
y = glass_pd.pop('Type').values
Xr = glass_pd.values

gScal = StandardScaler().fit(Xr)
X = gScal.transform(Xr)         # We are cheating on the scaling.

In [66]:
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier(n_neighbors = 1)
kNN_bag = BaggingClassifier(kNN, 
                            n_estimators = 100,
                            max_samples = 1.0, # bootstrap resampling 
                            bootstrap = True)

In [67]:
def mean(el):
    return np.array(el).mean()

In [68]:
kNN_l, bag_l = [],[]
reps = 20

for rep in range(reps):
    kf = KFold(n_splits=10, shuffle = True) # needed to ensure`shuffling   
    kNN_l.append(mean(cross_val_score(kNN, X, y, cv=kf, verbose = v, n_jobs = -1)))
    bag_l.append(mean(cross_val_score(kNN_bag, X, y, cv=kf, verbose = v, n_jobs = -1)))
    
print("Mean for kNN {:.2f}".format(mean(kNN_l)))
print("Mean for kNN Bag {:.2f}".format(mean(bag_l)))


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0023131370544433594s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done   2 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   8 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | ela

**Q3(c)**  
Now apply random subspacing with a 1-NN classifier for an ensemble size of 100.  
How does it compare to the results from (b)? How do you explain this difference?


In [69]:
kNN_RS = BaggingClassifier(kNN, 
                            n_estimators = 100,
                            max_features = 0.5,
                            bootstrap = False)

In [70]:
kNN_l, bag_l, RS_l = [],[],[]
reps = 20
v = 0

for rep in range(reps):
    kf = KFold(n_splits=10, shuffle = True) # needed to ensure`shuffling   
    kNN_l.append(mean(cross_val_score(kNN, X, y, cv=kf, verbose = v, n_jobs = -1)))
    bag_l.append(mean(cross_val_score(kNN_bag, X, y, cv=kf, verbose = v, n_jobs = -1)))
    RS_l.append(mean(cross_val_score(kNN_RS, X, y, cv=kf, verbose = v, n_jobs = -1)))

print("Mean for kNN {:.2f}".format(mean(kNN_l)))
print("Mean for kNN Bag {:.2f}".format(mean(bag_l)))
print("Mean for kNN RS {:.2f}".format(mean(RS_l)))

[CV] START .....................................................................
[CV] END ................................ score: (test=0.889) total time=   0.0s
[CV] START .....................................................................
[CV] END ................................ score: (test=0.750) total time=   0.1s
[CV] START .....................................................................
[CV] END ................................ score: (test=0.750) total time=   0.0s
[CV] START .....................................................................
[CV] END ................................ score: (test=0.762) total time=   0.1s
[CV] START .....................................................................
[CV] END ................................ score: (test=0.571) total time=   0.1s
[CV] START .....................................................................
[CV] END ................................ score: (test=0.571) total time=   0.1s
[CV] START .................

*Bagging doesn't achieve diversity with kNN whereas Random Subspacing does.*  
**Q3(d)**  
What happens to the overall ensemble accuracy when we increase the subspace size to a value closer to 1 (e.g. max_features=0.8)?  
What is the explanation for the change in accuracy?


In [71]:
kNN_RS = BaggingClassifier(kNN, 
                            n_estimators = 100,
                            max_features = 0.8,
                            bootstrap = False)

In [72]:
RS_l = []
reps = 20


for rep in range(reps):
    kf = KFold(n_splits=10, shuffle = True) # needed to ensure`shuffling   
    RS_l.append(mean(cross_val_score(kNN_RS, X, y, cv=kf, verbose = v, n_jobs = -1)))
print("Mean for kNN RS {:.2f}".format(mean(RS_l)))

Mean for kNN RS 0.71
