## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [None]:
NAME = "Sophia Katsaki"
AEM = "3656"

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [None]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file. 
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not. 


In [None]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x7f1d3691cd10>)

In [None]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **10-fold cross validation** for your tests and report the average f-measure weighted and balanced accuracy of your models. You can use [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) and select both metrics to be measured during the evaluation. Otherwise, you can use [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses three **simple** estimators/classifiers. Test both soft and hard voting and choose the best one. Consider as simple estimators the following:


*   Decision Trees
*   Linear Models
*   Probabilistic Models (Naive Bayes)
*   KNN Models  

In [None]:
# BEGIN CODE HERE
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

#Three simple classifiers
cls1 = DecisionTreeClassifier(random_state=RANDOM_STATE,max_depth=3) #Decision Tree Classifier with max depth = 3
cls2 = LogisticRegression(random_state=RANDOM_STATE) # Logistic Regression Classifier
cls3 = KNeighborsClassifier() # K-Nearest Neighbors Classifier

#Soft and Hard Voting Classifiers
soft_vcls = VotingClassifier(estimators = [('dt',cls1),('lr',cls2),('knn',cls3)], voting ='soft', n_jobs= 4) # Soft Voting Classifier
hard_vcls = VotingClassifier(estimators = [('dt',cls1),('lr',cls2),('knn',cls3)], n_jobs= -1) # Hard Voting Classifier: the default voting is "Hard Voting"

#Cross validation: The cross_val_score only calculates the results of cross validation and DOES NOT return a fitted model, but we dont need a fitted model at the particular moment.
cv = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
svlcs_scores = [cross_val_score(soft_vcls, X, y, scoring='balanced_accuracy', cv=cv,n_jobs=-1), cross_val_score(soft_vcls, X, y, scoring='f1_weighted', cv=cv,n_jobs=-1)]
hvlcs_scores = [cross_val_score(hard_vcls, X, y,scoring='balanced_accuracy' ,cv=cv, n_jobs=-1), cross_val_score(hard_vcls, X, y,scoring='f1_weighted',cv=cv,n_jobs =-1)]

#F-measure weighted and Balanced Accuracy of Classifiers
s_avg_fmeasure = svlcs_scores[1].mean() # The average f-measure weighted
s_avg_accuracy = svlcs_scores[0].mean() # The average balanced accuracy
h_avg_fmeasure = hvlcs_scores[1].mean() # The average f-measure weighted
h_avg_accuracy = hvlcs_scores[0].mean() # The average accuracy
#END CODE HERE

In [None]:
print("Classifier:")
print(soft_vcls) 
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(s_avg_fmeasure,4), round(s_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('dt',
                              DecisionTreeClassifier(max_depth=3,
                                                     random_state=42)),
                             ('lr', LogisticRegression(random_state=42)),
                             ('knn', KNeighborsClassifier())],
                 n_jobs=4, voting='soft')
F1 Weighted-Score: 0.8447 & Balanced Accuracy: 0.8369


You should achive above 82% (Soft Voting Classifier)

In [None]:
print("Classifier:")
print(hard_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(h_avg_fmeasure,4), round(h_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('dt',
                              DecisionTreeClassifier(max_depth=3,
                                                     random_state=42)),
                             ('lr', LogisticRegression(random_state=42)),
                             ('knn', KNeighborsClassifier())],
                 n_jobs=-1)
F1 Weighted-Score: 0.825 & Balanced Accuracy: 0.8167


You should achieve above 80% in both! (Hard Voting Classifier)

### 1.2 Stacking ###
Create a stacking classifier which uses two more complex estimators. Try different simple classifiers (like the ones mentioned before) for the combination of the initial estimators. Report your results in the following cell.

Consider as complex estimators the following:

*   Random Forest
*   SVM
*   Gradient Boosting
*   MLP




In [None]:
# BEGIN CODE HERE
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

#Three complex estimators
cls1 = RandomForestClassifier(max_depth=5,n_jobs=-1,random_state = RANDOM_STATE) # Random Forest Classifier with the default n=100 estimators
#cls2 = GradientBoostingClassifier(n_estimators=50,max_depth=1,random_state = RANDOM_STATE) # Gradient Boosting Classifier
cls2 = SVC(random_state = RANDOM_STATE) #C-Support Vector Classification
cls3 =  MLPClassifier(random_state=RANDOM_STATE) #Multi-Layer Perceptron Classifier

#Stacking Classifier
s_est = LogisticRegression(random_state=RANDOM_STATE,n_jobs=-1)
cls = [('rf',cls1),('svm',cls2),('mlp',cls3)] #the two simple estimators
scls = StackingClassifier(cls,s_est) 

#Cross Validation
cv_s = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
scores = [cross_val_score(scls, X, y, scoring='balanced_accuracy', cv=cv_s,n_jobs=-1), cross_val_score(scls, X, y, scoring='f1_weighted', cv=cv_s,n_jobs=-1)]
avg_fmeasure = scores[1].mean() # The average f-measure weighted
avg_accuracy = scores[0].mean() # The average balanced accuracy 
#END CODE HERE



In [None]:
print("Classifier:")
print(scls)
print("F1 Weighted Score: {} & Balanced Accuracy: {}".format(round(avg_fmeasure,4), round(avg_accuracy,4)))

Classifier:
StackingClassifier(estimators=[('rf',
                                RandomForestClassifier(max_depth=5, n_jobs=-1,
                                                       random_state=42)),
                               ('svm', SVC(random_state=42)),
                               ('mlp', MLPClassifier(random_state=42))],
                   final_estimator=LogisticRegression(n_jobs=-1,
                                                      random_state=42))
F1 Weighted Score: 0.8617 & Balanced Accuracy: 0.8557


You should achieve above 85% in both

## 2.0 Randomization ##

**2.1** You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1_weighted/balanced_accuracy score. The dictionaries should contain four different elements.  

In [None]:
# BEGIN CODE HERE
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

#Simple DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=RANDOM_STATE)

#Random Subspace
ens1 = BaggingClassifier(random_state=RANDOM_STATE,max_features=0.65) #The base estimator is the default DecisionTreeClassifier. Each estimator can access 65% of the features of the dataset.

#Random Patches
ens2 = BaggingClassifier(random_state=RANDOM_STATE, max_features=0.65, max_samples=0.65) #Each estimator can access 65% of the samples and 65% of the features of the set.

#Simple Bagging
ens3 = BaggingClassifier(random_state=RANDOM_STATE)

#10-Fold Cross Validation
cv= KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)

#Four lists with the balanced_accuracy and f1_weighted scores. One list for every Classifier.
tree_scores = [cross_val_score(tree, X, y, scoring='balanced_accuracy', cv=cv,n_jobs=-1), cross_val_score(tree, X, y, scoring='f1_weighted', cv=cv,n_jobs=-1)]
rs_scores = [cross_val_score(ens1, X, y, scoring='balanced_accuracy', cv=cv,n_jobs=-1), cross_val_score(ens1, X, y, scoring='f1_weighted', cv=cv,n_jobs=-1)]
rp_scores = [cross_val_score(ens2, X, y, scoring='balanced_accuracy', cv=cv,n_jobs=-1), cross_val_score(ens2, X, y, scoring='f1_weighted', cv=cv,n_jobs=-1)]
b_scores = [cross_val_score(ens3, X, y, scoring='balanced_accuracy', cv=cv,n_jobs=-1), cross_val_score(ens3, X, y, scoring='f1_weighted', cv=cv,n_jobs=-1)]

#f_weighted and balanced accuracy dictionaries
f_measures = dict()
accuracies = dict()

f_measures['Simple Decision Tree'] = tree_scores[1].mean() #The average f1_weighted score for the simple DecisionTreeClassifier
accuracies['Simple Decision Tree'] = tree_scores[0].mean() #The average balanced_accuracy for the simple DecisionTreeClassifier

f_measures['Random subspace'] = rs_scores[1].mean() #The average f1_weighted score for the Random subspace ensemble
accuracies['Random subspace'] = rs_scores[0].mean() #The average balanced_accuracy score for the Random subspace ensemble

f_measures['Random Patches'] = rp_scores[1].mean() #The average f1_weighted score for the Random Patches ensemble
accuracies['Random Patches'] = rp_scores[0].mean() #The average f1_weighted score for the Random Patches ensemble

f_measures['Bagging'] = b_scores[1].mean() #The average f1_weighted score for the Bagging ensemble
accuracies['Bagging'] = b_scores[0].mean() #The average f1_weighted score for the Bagging ensemble
#END CODE HERE

In [None]:
print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier:{} -  F1 Weighted:{}".format(name,round(score,4)))
for name,score in accuracies.items():
    print("Classifier:{} -  BalancedAccuracy:{}".format(name,round(score,4)))

BaggingClassifier(max_features=0.65, random_state=42)
BaggingClassifier(max_features=0.65, max_samples=0.65, random_state=42)
BaggingClassifier(random_state=42)
DecisionTreeClassifier(random_state=42)
Classifier:Simple Decision Tree -  F1 Weighted:0.704
Classifier:Random subspace -  F1 Weighted:0.7676
Classifier:Random Patches -  F1 Weighted:0.7767
Classifier:Bagging -  F1 Weighted:0.7789
Classifier:Simple Decision Tree -  BalancedAccuracy:0.695
Classifier:Random subspace -  BalancedAccuracy:0.7628
Classifier:Random Patches -  BalancedAccuracy:0.7718
Classifier:Bagging -  BalancedAccuracy:0.7732


**2.2** Describe your classifiers and your results.

###Classifiers
The first classifier that was used was a simple DecisionTreeClassifier which 
nodes are expanded without a limit in maximum depth. The second one is a Bagging Classifier, which uses the default base estimtator (the DecisionTreeClassifier). The max_features parameter is set to 0.65, which means that each base estimator can access 65% of the features of the dataset. In this way, we can use the Random Subspaces technique, in which we only select random features of the dataset each time for each estimator. The third one, is also a Bagging Classifier, but not only does it have the max_features parameter set to 0.65, but also the max_samples parameter is set to 0.65. That means that 65% of the samples and 65% of the features are used for each estimator.Each time,we select random patches of the data.The fourth estimator is a simple Bagging Classifier.  

###Results 
After performing 10-fold Cross Validation, we can see that the f1_weighted scores and the balanced_accuracy scores of the Bagging Classifiers are higher than the scores of the simple DecisionTreeClassifier.The results can be seen in 2.1 with detail. In general, bagging, reduces variance. So, the accuracy in such estimators is expected to be higher. In this example, we can see that bagging improves the results of a single model. Basically, we use different combinations and parts of the original data each time.


**2.3** Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

If we use the parameter n_jobs in our bagging classifier, we can solve this problem. The n_jobs parameter is setting the number of jobs to run in parallel for fitting as well as predicting. For even better results, we can set n_jobs to -1, which means that all the processors will be used for this task at the same time. Unfortunately, we cannot use the same parameter in boosting classifiers, due to the fact that they cannot produce their results at the same time. The models are trained sequentally (one after the other) and each model tries to correct the mistakes of the previous one that was trained. Due to the fact that each model depends on the previous ones, they cannot be trained in parallel.

## 3.0 Creating the best classifier ##

**3.1** In this part of the assignment you are asked to train the best possible ensemble! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure (weighted) & balanced accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code. Can you achieve a balanced accuracy over 83-84%?

In [None]:
# BEGIN CODE HERE
from sklearn.ensemble import VotingClassifier,RandomForestClassifier,GradientBoostingClassifier,StackingClassifier,AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
from sklearn import svm

#Three complex estimators
#cls1 = RandomForestClassifier(max_depth=1,n_jobs=-1,random_state = RANDOM_STATE) # Random Forest Classifier with the default n=100 estimators
#cls2 = SVC(random_state = RANDOM_STATE) #C-Support Vector Classification
#cls3 =  MLPClassifier(random_state=RANDOM_STATE) #Multi-Layer Perceptron Classifier
#adb = AdaBoostClassifier(cls1,n_estimators=1000,random_state=RANDOM_STATE)

#Stacking Classifier
#s_est = LogisticRegression(random_state=RANDOM_STATE,n_jobs=-1)
#cls = [('rf',cls1),('svm',cls2),('mlp',cls3)] #the two simple estimators
best_cls = scls

#Cross Validation
#cv_s = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
#scores = [cross_val_score(best_cls, X, y, scoring='balanced_accuracy', cv=cv_s,n_jobs=-1), cross_val_score(best_cls, X, y, scoring='f1_weighted', cv=cv_s,n_jobs=-1)]
best_fmeasure = scores[1].mean() # The average f-measure weighted
best_accuracy = scores[0].mean() # The average balanced accuracy 
#END CODE HERE

In [None]:
print("Classifier:")
#print(best_cls)  
print("F1 Weighted-Score:{} & Balanced Accuracy:{}".format(best_fmeasure, best_accuracy))

Classifier:
F1 Weighted-Score:0.8616744215730234 & Balanced Accuracy:0.8557083558436347


**3.2** Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code.

We need a strong and complex final estimator, even if takes some time to run. Here, the thought was to create a complex model, but without big grid searches for finding the best parameters. The stacking estimator was created with a RandomForestClassifier, a SVC, and an MLPClassifier. The meta-estimator was Logistic Regression. In the whole process of choosing the best estimator, the focus was on the choice and combinations of models, and not on choosing the best parameters. We can find the 'optimal' parameters with the help of grid searches. 
- The scores of the final stacking estimator are:
F1 Weighted Score: 0.8617 & Balanced Accuracy: 0.8557

The scores of other models were the following:
- The stacking classifer that we can see in 3.1, was combined with the voting classifier from 1.1(as they were the best estimators so far). The final estimator combined those two also with stacking. The run time was VERY high. The results are: 
 F1 Weighted-Score:0.8593 & Balanced Accuracy:0.8529

- The stacking classifier described above was combined with the adaboost boosting technique, with a voting classifier but the results were around 75% for both metrics.

- The stacking classifier of complex models was combined with the voting classifier of simple models from 1.1. The final estimator was again a voting classifier that combined those two. But the results were not so satisfying as the stacking classifier on its own:
F1 Weighted-Score:0.8463 & Balanced Accuracy:0.8460

- The adaboost boosting algorithm had a random forest algorithm as its base estimator. The results could have been higher:
F1 Weighted-Score:0.8437 & Balanced Accuracy:0.8359

- The adaboost boosting algotithm had a gradient boosting as its base (to see if it is possible to combine two boosting classifiers in that way). The time that it took to run was very long, so it is probably not a very bright idea.

- The usual stacking classifier with the complex models, was combined with an adaboost classifier that had as its base the random forest, through a voting classifier. The results were:
F1 Weighted-Score:0.8498 & Balanced Accuracy:0.8483

In general, we can combine a lot of simple and complex models and techniques together to find the best ensemble.


**3.3** Create a classifier that is going to be used in production - in a live system. Use the *test_set_noclass.csv* to make predictions. Store the predictions in a list.  

In [None]:
# BEGIN CODE HERE
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,GradientBoostingClassifier,StackingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import urllib.request
from sklearn.svm import SVC

#Creating the test set
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)
test_set = pd.read_csv("test_set_noclass.csv")

#Three complex estimators
#cls1 = RandomForestClassifier(max_depth=1,n_jobs=-1,random_state = RANDOM_STATE) # Random Forest Classifier with the default n=100 estimators
#cls2 = GradientBoostingClassifier(n_estimators=50,max_depth=1,random_state = RANDOM_STATE) # Gradient Boosting Classifier
#cls2 = SVC(random_state = RANDOM_STATE)
#cls3 =  MLPClassifier(random_state=RANDOM_STATE) #Multi-Layer Perceptron Classifier

#Stacking Classifier
#s_est = LogisticRegression(random_state=RANDOM_STATE,n_jobs=-1)
#cls = [('rf',cls1),('svm',cls2),('mlp',cls3)] #the two simple estimators
predictionsclassifier = scls

#10-Fold Cross Validation
#cv= KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
#scores = [cross_val_score(predictionsclassifier, X, y, scoring='balanced_accuracy', cv=cv,n_jobs=-1), cross_val_score(predictionsclassifier, X, y, scoring='f1_weighted', cv=cv,n_jobs=-1)]
predictionsclassifier.fit(X,y)
fmeasureb = scores[1].mean() # The average f-measure weighted
accuracyb = scores[0].mean() # The average balanced accuracy 
predictions = predictionsclassifier.predict(test_set)
print("F1 Weighted-Score:{} & Balanced Accuracy:{}".format(fmeasureb, accuracyb))
print(predictions)
#END CODE HERE

F1 Weighted-Score:0.8616744215730234 & Balanced Accuracy:0.8557083558436347
[1 1 1 ... 1 1 1]


The main idea of building this classifier is to avoid overifitting,as we are going to make predictions for the test set. We want the predictions to be as accurate as they can be. So, ensemble methods are a wise choice, as they reduce variance.

#### This following cell will not be executed. The test_set.csv with the classes will be made available after the deadline and this cell is for testing purposes!!! Do not modify it! ###

In [None]:
if False:
  from sklearn.metrics import f1_score, balanced_accuracy_score
  final_test_set = pd.read_csv('test_set.csv')
  ground_truth = final_test_set['CLASS']
  print("Balanced Accuracy: {}".format(balanced_accuracy_score(predictions, ground_truth)))
  print("F1 Weighted-Score: {}".format(f1_score(predictions, ground_truth, average='weighted')))

Both should aim above 85%!