# Ensembles

**Q1(a)**  
Load the Wine dataset using the CSV file provided, and assess the accuracy of a decision tree classifier using 10-fold cross-validation.  
What percentage of instances are correctly classified?

In [7]:
import pandas as pd
import numpy as np
from collections import Counter
wine_pd = pd.read_csv('Wine.csv')
wine_pd.head()

Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,OD280/OD315_of_diluted_wines,Proline,class
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,Type1
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,Type1
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,Type1
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,Type1
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,Type1


In [8]:
y = wine_pd.pop('class').values
X = wine_pd.values
X.shape

(178, 13)

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import BaggingClassifier

dtree = DecisionTreeClassifier(criterion='entropy')
folds = 10
v = 0

In [10]:
scores_tree = cross_val_score(dtree, X, y, cv=folds, verbose = v, n_jobs = -1)
print("Mean for Tree {:.2f}".format(scores_tree.mean()))

Mean for Tree 0.89


**Q1(b)**  
Now, apply ensemble classification using bagging to achieve diversity and with a decision tree classifier.   
What percentage of instances are now correctly classified with an ensemble of size 10? 


In [6]:
wine_bag = BaggingClassifier(dtree,
                            n_estimators = 10,
                            max_samples = 1.0, # bootstrap resampling
                            bootstrap = True)
scores_bag = cross_val_score(wine_bag,X,y,cv= folds,verbose= v, n_jobs= -1)
print("")

**Q1(c)**   
Repeat (b), for ensembles of size 10, 50, 100, 200 and 300 classifiers.  
What level of improvement does this provide, in terms of percentage of instances correctly classified?

In [5]:
trees = [10,50,100,200,300]


**Q1(d)**  
Why does the level of improvement in accuracy often “level off” after an ensemble has been increased to a certain size?  

**Q2(a)**  

Load the Blood Alcohol Content (BAC) dataset using the CSV file provided.   
This dataset contains a mix of numeric and categorical data, use one-hot encoding to convert to a numeric format. When this dataset was collected the BAC limit for driving was 0.8mg/ml. Convert this to a classification task by adding a binary Over/Under feature where Over is a BAC level > 0.8mg/ml. 


In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
BAC_pd = pd.read_csv('BAC.csv')
BAC_pd.head()

**Q2(b)**  
Using 10-fold cross validation, compare the performance of:  
- a single decision tree, 
- a bagging ensemble (100 members) and 
- a boosting ensemble (also 100 members).

**Q2(c)**  
Are the results from a single cross validation run stable?


In [None]:
dtree = DecisionTreeClassifier(criterion='entropy')

tree_bag = BaggingClassifier(dtree, 
                            n_estimators = 100,
                            max_samples = 1.0, # bootstrap resampling 
                            bootstrap = True)

GBC = GradientBoostingClassifier(n_estimators=100, random_state=4)

In [None]:
def mean(el):
    return np.array(el).mean()

In [None]:
tree_l, bag_l, boost_l = [],[],[]
reps = 1

for rep in range(reps):
    kf = KFold(n_splits=10, shuffle = True) # needed to ensure`shuffling   
    tree_l.append(mean(cross_val_score(dtree, X, y, cv=kf, verbose = v, n_jobs = -1)))
    bag_l.append(mean(cross_val_score(tree_bag, X, y, cv=kf, verbose = v, n_jobs = -1)))
    boost_l.append(mean(cross_val_score(GBC, X, y, cv=kf, verbose = v, n_jobs = -1)))
    
print("Mean for Tree {:.2f}".format(mean(tree_l)))
print("Mean for Tree Bag {:.2f}".format(mean(bag_l)))
print("Mean for Tree Boost {:.2f}".format(mean(boost_l)))

*With reps set to 1 we get quite different results from successive runs.*     
**Q2(d)**  
Repeat the 10-fold cross validation comparison 50 times to get a more robust comparison.

*It seems both Bagging and Boosting improve over a single tree.  
There is not much to choose between the two ensemble options.*  

**Q3(a)**  
Load the glass dataset from glass.csv.  
Evaluate a 1-NN classifier using 10-fold cross-validation.   
What is the overall accuracy achieved?

**Q3(b)**   
Apply bagging with a 1-NN classifier for an ensemble size of 100.  
What is the improvement in terms of overall accuracy?

In [None]:
from sklearn.preprocessing import StandardScaler
glass_pd = pd.read_csv('glass.csv')
glass_pd.head()

In [None]:
y = glass_pd.pop('Type').values
Xr = glass_pd.values

gScal = StandardScaler().fit(Xr)
X = gScal.transform(Xr)         # We are cheating on the scaling.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier(n_neighbors = 1)
kNN_bag = BaggingClassifier(kNN, 
                            n_estimators = 100,
                            max_samples = 1.0, # bootstrap resampling 
                            bootstrap = True)

In [None]:
kNN_l, bag_l = [],[]
reps = 20


**Q2(b)**  
Now apply random subspacing with a 1-NN classifier for an ensemble size of 100.  
How does it compare to the results from (b)? How do you explain this difference?


In [None]:
kNN_RS = BaggingClassifier(kNN, 
                            n_estimators = 100,
                            max_features = 0.5,
                            bootstrap = False)

In [None]:
kNN_l, bag_l, RS_l = [],[],[]
reps = 20



*Bagging doesn't achieve diversity with kNN whereas Random Subspacing does.*  
**Q3(d)**  
What happens to the overall ensemble accuracy when we increase the subspace size to a value closer to 1 (e.g. max_features=0.8)?  
What is the explanation for the change in accuracy?


In [None]:
kNN_RS = BaggingClassifier(kNN, 
                            n_estimators = 100,
                            max_features = 0.8,
                            bootstrap = False)

In [None]:
RS_l = []
reps = 20



max_feature = 0.8 breaks Random Subspacing in this case. Presumably there is not enough diversity with 80% of features selected each time.  