## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = "Kaparinos Nikos"
AEM = "9245"

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file. 
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not. 


In [3]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x1497c8db3d0>)

In [4]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **10-fold cross validation** for your tests and report the average f-measure and accuracy of your models.

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses three estimators/classifiers. Test both soft and hard voting and choose the best one.

In [7]:
# BEGIN CODE HERE
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import time

# Options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
start = time.perf_counter()
# print(X.info())

# Classifiers
cls1 = MLPClassifier(hidden_layer_sizes=(50), random_state=RANDOM_STATE)
cls2 = MLPClassifier(hidden_layer_sizes=(50, 50), random_state=RANDOM_STATE)
cls3 = MLPClassifier(hidden_layer_sizes=(50, 50, 50), random_state=RANDOM_STATE)
vcls = VotingClassifier(estimators=[
    ('cls1', cls1), ('cls2', cls2), ('cls3', cls3)],
    voting='hard')

# Pipeline
pipe = Pipeline([
    ('scale', StandardScaler()),
    ('vcls', vcls)])
print(pipe)

# Cross validate
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
cv_results = cross_validate(pipe, X, y, cv=cv, n_jobs=8, verbose=1, scoring=['accuracy', 'f1'])
cv_results = pd.DataFrame(cv_results)
print(cv_results)

# Average results
avg_fmeasure = cv_results['test_accuracy'].mean()  # The average f-measure
avg_accuracy = cv_results['test_f1'].mean()  # The average accuracy

# Execution Time
end = time.perf_counter()
print(f"\nExecution time = {end - start:.2f} second(s)")
# END CODE HERE
#END CODE HERE

Pipeline(steps=[('scale', StandardScaler()),
                ('vcls',
                 VotingClassifier(estimators=[('cls1',
                                               MLPClassifier(hidden_layer_sizes=50,
                                                             random_state=42)),
                                              ('cls2',
                                               MLPClassifier(hidden_layer_sizes=(50,
                                                                                 50),
                                                             random_state=42)),
                                              ('cls3',
                                               MLPClassifier(hidden_layer_sizes=(50,
                                                                                 50,
                                                                                 50),
                                                             random_state=42))]))])


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:  2.0min remaining:  1.3min
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:  2.7min finished


     fit_time  score_time  test_accuracy   test_f1
0  117.529907    0.083073       0.857143  0.877551
1  116.957398    0.112101       0.850000  0.876712
2  117.919256    0.081074       0.871429  0.892430
3  117.214613    0.095087       0.861905  0.885375
4  115.064306    0.141777       0.864286  0.887129
5  115.297407    0.128049       0.852381  0.875000
6  113.735050    0.135124       0.849642  0.873239
7  113.190559    0.132122       0.854415  0.872651
8   47.387132    0.063108       0.835322  0.857143
9   46.247769    0.070063       0.873508  0.895464

Execution time = 161.27 second(s)


In [6]:
print("Classifier:")
print(vcls)
print("F1-Score:{} & Accuracy:{}".format(avg_fmeasure,avg_accuracy))

Classifier:
VotingClassifier(estimators=[('cls1', SVC(gamma=1, random_state=0)),
                             ('cls2',
                              MLPClassifier(hidden_layer_sizes=(50, 50),
                                            random_state=42)),
                             ('cls3', DecisionTreeClassifier(random_state=0))])
F1-Score:0.7866967837254234 & Accuracy:0.8386478884439328


### 1.2 Stacking ###
Create a stacking classifier which uses two estimators/classifiers. Try different classifiers for the combination of the initial classifiers. Report your results in the following cell.

In [8]:
# BEGIN CODE HERE

cls1 = "" # Classifier #1 
cls2 = "" # Classifier #2 
scls = "" # Stacking Classifier
avg_fmeasure = 0 # The average f-measure
avg_accuracy = 0 # The average accuracy

#END CODE HERE

In [None]:
print("Classifier:")
print(scls)
print("F1-Score:{} & Accuracy:{}".format(avg_fmeasure,avg_accuracy))

### 1.3 Report the results ###  
Report the results of your experiments in the following cell. How did you choose your initial classifiers? 

***
<font size="4">**1.1 Voting**</font>


Both homogenous and heterogenious voting classifiers were experimented with. The goal when using ensemble techniques, like voting, is to use models that have different behaviours. This means that the models used should should provide correct predictions on different subsets of the dataset. Then, the ensembling of those models will probably improve overall performance. Also, every model on its own should be as accurate as possible, provided the above statement holds true.

To build models that perform differently, two aproaches were followed. Using heterogeneous models and using homegenous models with varying hyperparameter values.

<font size="3">**Homogeneous**</font>
* Decision Tree Classifiers:
    Decision Trees with different maximum depths were used.<br>
    
    Classifiers Used:
        1) DecisionTreeClassifier(random_state=RANDOM_STATE)
        2) DecisionTreeClassifier(max_depth=10, random_state=RANDOM_STATE)
        3) DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE)


   

| Voting Type | F1 Score | Accuracy |
| --- | --- | --- |
| Soft | 0.7214  | 0.7670 |
| Hard | 0.7221 | 0.7670 |

* Support Vector Machine Classifiers:
    SVMs with different gamma values were used.<br>
    
    Classifiers Used:
        1) SVC(gamma=1, kernel='rbf', random_state=RANDOM_STATE)
        2) SVC(gamma=0.5, kernel='rbf', random_state=RANDOM_STATE)
        3) SVC(gamma=1.5, kernel='rbf', random_state=RANDOM_STATE)
        
| Voting Type | F1 Score | Accuracy |
| --- | --- | --- |
| Soft | 0.5879  | 0.7405 |
| Hard | 0.5879 | 0.7405 |

* Neural Network classifiers:
    Multi-layer perceptrons with different number of layers were used.<br>
    
    Classifiers Used:
        1) MLPClassifier(hidden_layer_sizes=(50), random_state=RANDOM_STATE)
        2) MLPClassifier(hidden_layer_sizes=(50, 50), random_state=RANDOM_STATE)
        3) MLPClassifier(hidden_layer_sizes=(50, 50, 50), random_state=RANDOM_STATE)
        
| Voting Type | F1 Score | Accuracy |
| --- | --- | --- |
| Soft | 0.8558  | 0.8780 |
| Hard | 0.8580 | 0.8797 |

<font size="3">**Heterogeneous**</font>
* Heterogeneous model:
     A combination of the classifiers mentioned above was used.<br>
       
       Classifiers Used:
        1) DecisionTreeClassifier(random_state=RANDOM_VARIABLE)
        2) MLPClassifier(hidden_layer_sizes=(50, 50), random_state=RANDOM_VARIABLE)
        3) SVC(gamma=1, random_state=RANDOM_VARIABLE)
        
| Voting Type | F1 Score | Accuracy |
| --- | --- | --- |
| Soft | 0.7753  | 0.8270 |
| Hard | 0.7855 | 0.8371 |

<font size="3">**Voting Report**</font>

In the experiments made, hard voting has been consistently outperforming soft voting. Also, the use of voting as an ensemble technique, improved the model performance compared to single models.

***
<font size="4">**1.2 Stacking**</font>

The 3 types of classifier mentioned above were used in combinations of 2 for stacking. The meta estimator used is logistic regression. The output of the estimators can be either a probability or simply a class prediction. Also, the input data can be passed through to the meta estimator in addition to the model predictions. Stacking estimators were created testing all 4 possible options (probability and passthrough, prediction and passthrough, probability and not passthrough, prediction and not passthrough).

<font size="3">**Stacking Results**</font>

<font size="2">**Decision Tree and Support Vector Machine**</font>

Accuracy:

|  | Probability | Prediction |
| --- | --- | --- |
| Passthrough |0.8291  |0.8276 |
| Not Passthrough |0.7223 |0.7183  |

F1 Score:


|  | Probability | Prediction |
| --- | --- | --- |
| Passthrough |0.8545  |0.8536 |
| Not Passthrough |0.7689 |0.7632  |

<font size="2">**Decision Tree and Multi-Layer Perceptron**</font>

Accuracy:


|  | Probability | Prediction |
| --- | --- | --- |
| Passthrough |0.8296  |0.8329 |
| Not Passthrough |0.8560 | 0.8551 |

F1 Score:


|  | Probability | Prediction |
| --- | --- | --- |
| Passthrough |0.8554  |0.8580 |
| Not Passthrough |0.8789 | 0.8775 |

<font size="2">**Support Vector Machine and Multi-Layer Perceptron**</font>

Accuracy:


|  | Probability | Prediction |
| --- | --- | --- |
| Passthrough | :0.8269 |0.8310 |
| Not Passthrough |0.8522 |0.8548  |

F1 Score:


|  | Probability | Prediction |
| --- | --- | --- |
| Passthrough |0.8524 |0.8563 |
| Not Passthrough |0.8747 | 0.8769 |

<font size="3">**Stacking Report**</font>

Stacking, like voting before, improved the model performance compared to single models. Passing through the input data and the type of output from the base models provided inconsistent results. Depending on the combination of base models, different stacking methods were optimal.

## 2.0 Randomization ##

**2.1** You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1/accuracy score. The dictionaries should contain four different elements.  

In [None]:
# BEGIN CODE HERE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import
ens1 = ""
ens2 = ""
ens3 = ""
tree = ""

f_measures = dict()
accuracies = dict()
# Example f_measures = {'Simple Decision':0.8551, 'Ensemble with random ...': 0.92, ...}


#END CODE HERE

In [None]:
print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier:{} -  F1:{}".format(name,score))
for name,score in accuracies.items():
    print("Classifier:{} -  Accuracy:{}".format(name,score))

**2.2** Describe your classifiers and your results.

***
<font size="3">**Randomization Report**</font>

The three ensembles that were used are:

Random Forest<br>
Bagging<br>
Ada Boost<br>

These ensembles were compared to a standard decision tree classifier. A grid search based on accuracy was conducted for each classifier in order to tune its hyper-parameters. All three ensembles provided substantial performance gains over the simple decision tree. Their performance was very similar, however the Ada Boost ensemble slightly outperformed the other two in both Accuracy and F1 metrics.


**2.3** Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

A bagging classifier can be trained in a parallel or distributed environment, thus theoretically massively improving training time. This approach cannot be applied to boosting classifiers, because every model is trained sequentially, since it needs the output of the previous model.

## 3.0 Creating the best classifier ##

**3.1** In this part of the assignment you are asked to train the best possible ensemble! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code. Can you achieve an accuracy over 83-84%?

In [None]:
# BEGIN CODE HERE
best_cls = ""

best_fmeasure = ""
best_accuracy = ""


#END CODE HERE

In [None]:
print("Classifier:")
print(best_cls)
print("F1-Score:{} & Accuracy:{}".format(best_fmeasure,best_accuracy))

**3.2** Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code.

***
<font size="3">**Approach**</font>

The approach that was followed was tuning the hyper-parameters of various models and then using them as base models in voting and stacking ensembles. Models with good performance will most likely be the best candidates to be used as base models. Thus an extensive grid search based on accuracy was conducted for each model. However, since base models that behave differently usually provide the best results when ensembled, experiments should also be made with sub optimal models. These sub optimal models can either have different hyper-parameter values than the best models, or they can simple be different types of classifiers. 

<font size="3">**Results**</font>

**3.3** Create a classifier that is going to be used in production - in a live system. Use the *test_set_noclass.csv* to make predictions. Store the predictions in a list.  

In [None]:
# BEGIN CODE HERE
cls = ""
predictions = []

#END CODE HERE

In [None]:
print(cls)
print(predictions)

LEAVE HERE ANY COMMENTS ABOUT YOUR CLASSIFIER

#### This following cell will not be executed. The test_set.csv with the classes will be made available after the deadline and this cell is for testing purposes!!! Do not modify it! ###

In [None]:
from sklearn.metrics import f1_score,accuracy_score
final_test_set = pd.read_csv('test_set.csv')
ground_truth = final_test_set['CLASS']
print("Accuracy:{}".format(accuracy_score(predictions,ground_truth)))
print("F1-Score:{}".format(f1_score(predictions,ground_truth)))