## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that.

Fill in your **NAME** and **AEM** below:

In [None]:
NAME = "Βασίλειος Παπαστέργιος"
AEM = "3651"

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [None]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file.
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not.


In [None]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x7f11ec90c820>)

In [None]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

In [None]:
!pip install -U imbalanced-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The following code will reduce the number of instances, dealing with the small imbalance of the dataset, as well as reducing the size of the dataset!

In [None]:
from collections import Counter
from imblearn.under_sampling import NeighbourhoodCleaningRule, RandomUnderSampler

ncr = NeighbourhoodCleaningRule()
X_res, y_res = ncr.fit_resample(X, y)
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_res, y_res)
print('Resampled dataset shape %s' % Counter(y_res))
X = X_res
y = y_res

Resampled dataset shape Counter({0: 1687, 1: 1687})


## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **5-fold cross validation** for your tests and report the average f-measure weighted and balanced accuracy of your models. You can use [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) and select both metrics to be measured during the evaluation.

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses two **simple** estimators/classifiers. Test both soft and hard voting and report the results. Consider as simple estimators the following:


*   Decision Trees
*   Linear Models
*   KNN Models  

In [None]:
### BEGIN SOLUTION
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import  VotingClassifier
from sklearn.model_selection import cross_validate

# USE RANDOM STATE!
cls1 = KNeighborsClassifier() # Classifier #1
cls2 = LogisticRegression(random_state=RANDOM_STATE, solver='sag', tol=0.2) # Classifier #2

soft_vcls = VotingClassifier(estimators=[
            ('knn', cls1), ('decision_tree', cls2)], voting = 'soft')  # Soft Voting Classifier
hard_vcls =  VotingClassifier(estimators=[
            ('knn', cls1), ('decision_tree', cls2)], voting = 'hard') # Hard Voting Classifier

svlcs_scores = cross_validate(estimator = soft_vcls, X = X, y = y, scoring = ['f1_weighted', 'balanced_accuracy'], cv = 5, n_jobs = -1)
s_avg_fmeasure = np.average(svlcs_scores['test_f1_weighted']) # The average f-measure
s_avg_accuracy = np.average(svlcs_scores['test_balanced_accuracy']) # The average accuracy

hvlcs_scores = cross_validate(estimator = hard_vcls, X = X, y = y, scoring = ['f1_weighted', 'balanced_accuracy'], cv = 5, n_jobs = -1)
h_avg_fmeasure = np.average(hvlcs_scores['test_f1_weighted']) # The average f-measure
h_avg_accuracy = np.average(hvlcs_scores['test_balanced_accuracy']) # The average accuracy

### END SOLUTION

print("Classifier:")
print(soft_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(s_avg_fmeasure,4), round(s_avg_accuracy,4)))

print("Classifier:")
print(hard_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(h_avg_fmeasure,4), round(h_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('knn', KNeighborsClassifier()),
                             ('decision_tree',
                              LogisticRegression(random_state=42, solver='sag',
                                                 tol=0.2))],
                 voting='soft')
F1 Weighted-Score: 0.8977 & Balanced Accuracy: 0.8977
Classifier:
VotingClassifier(estimators=[('knn', KNeighborsClassifier()),
                             ('decision_tree',
                              LogisticRegression(random_state=42, solver='sag',
                                                 tol=0.2))])
F1 Weighted-Score: 0.8563 & Balanced Accuracy: 0.8577


For both soft/hard voting classifiers the F1 weighted score should be above 0.74 and 0.79, respectively, and for balanced accuracy 0.74 and 0.80. Remember! This should be the average performance of each fold, as measured through cross-validation with 5 folds!

### 1.2 Randomization

You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1_weighted/balanced_accuracy score. The dictionaries should contain four different elements. Use the same cross-validation approach as before!

In [None]:
### BEGIN SOLUTION
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

ens1 = BaggingClassifier(estimator = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=7), n_estimators=10, random_state=RANDOM_STATE, n_jobs=-1)
ens2 = BaggingClassifier(estimator = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=7), n_estimators=10, random_state=RANDOM_STATE, n_jobs=-1, bootstrap_features=True, max_samples=0.7, max_features=0.5)
ens3 = RandomForestClassifier(max_depth=5, random_state=RANDOM_STATE, n_estimators=10, n_jobs=-1)
tree = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth = 7)

ens1_scores = cross_validate(estimator=ens1, X=X, y=y, cv=5, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)
ens2_scores = cross_validate(estimator=ens2, X=X, y=y, cv=5, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)
ens3_scores = cross_validate(estimator=ens3, X=X, y=y, cv=5, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)
tree_scores = cross_validate(estimator=tree, X=X, y=y, cv=5, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)

# Example f_measures = {'Simple Decision':0.8551, 'Ensemble with random ...': 0.92, ...}

f_measures = {'Simple Decision':np.average([tree_scores['test_f1_weighted']]),
              'Ensemble with bagging':np.average(ens1_scores['test_f1_weighted']),
              'Ensemble with random patches':np.average(ens2_scores['test_f1_weighted']),
              'Ensemble with random forest':np.average(ens3_scores['test_f1_weighted'])}

accuracies = {'Simple Decision':np.average([tree_scores['test_balanced_accuracy']]),
              'Ensemble with bagging':np.average(ens1_scores['test_balanced_accuracy']),
              'Ensemble with random patches':np.average(ens2_scores['test_balanced_accuracy']),
              'Ensemble with random forest':np.average(ens3_scores['test_balanced_accuracy'])}
### END SOLUTION

print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier: {} -  F1 Weighted: {}".format(name,round(score,4)))
for name,score in accuracies.items():
    print("Classifier: {} -  BalancedAccuracy: {}".format(name,round(score,4)))

BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=7,
                                                   random_state=42),
                  n_jobs=-1, random_state=42)
BaggingClassifier(bootstrap_features=True,
                  estimator=DecisionTreeClassifier(max_depth=7,
                                                   random_state=42),
                  max_features=0.5, max_samples=0.7, n_jobs=-1,
                  random_state=42)
RandomForestClassifier(max_depth=5, n_estimators=10, n_jobs=-1, random_state=42)
DecisionTreeClassifier(max_depth=7, random_state=42)
Classifier: Simple Decision -  F1 Weighted: 0.7615
Classifier: Ensemble with bagging -  F1 Weighted: 0.8184
Classifier: Ensemble with random patches -  F1 Weighted: 0.8228
Classifier: Ensemble with random forest -  F1 Weighted: 0.7717
Classifier: Simple Decision -  BalancedAccuracy: 0.762
Classifier: Ensemble with bagging -  BalancedAccuracy: 0.8186
Classifier: Ensemble with random patches -  BalancedAccuracy

### 1.3 Question

Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

 **Answer**: It is true that increasing the number of estimators in a bagging classifier can cause a drastical increase in the training time of a classifier. Such an argument can be intuitively explained by the principle "the more models to be trained, the more computational power is needed". In order to avoid translating "computational power" into "training time", one solution would be to use task parallelism (parallel programming). In specific, a number of estimators (ideally all of them) is trained simultaneously, based on the GPU and multi-threading capacities of the execution machine(s), given that all estimators may or may not be trained on the same machine.

 Such an approach requires that the machine/hardware specifications of the execution machine can apply multiple execution (estimator training) threads in parallel. When feasible, this is the more effective solution.

 However, some times we may not be able to attain the required number of threads to reach high parallelism factor. Under these circumstances, there are a few more (compromising) solutions. Working with a subset of the available data could be one of them, in order to decrease the burden (=time) on training each one of the estimators, leading in a decreased overall training time that is required for the ensemble model. Such a result (working with an input data subset) can either be done by feature selection or by keeping a percentage of the available example (or combined). Such an approach is obviously not optimal, yet it can be a compromising solution under some circimstances.

 Last but not least, we would like to note that increasing the number of estimators does not necessarilyl lead to an improved model performance. In particular, there can be a boundary, after which, increasing the number of estimators may not improve (or even downgrade) the model performance. As a result, we can easily understand that decreasing the training time lies (at some times and under specific requirements) in selecting the appropriate number of estimators.

Concerning the boosting classifiers, it can be easily observed that using parallel programming is, unfortunately, not an option. The way the ensemble model is constructed in such approaches requires the sequential construction of the separate estimators. The latter are no more independent with one another, since every next classifier is focused on the errors made by the previous one. The "data subset" solution is problematic, too. The nature of the boosting classifiers makes it almost intolerant to wasting information, since we need many examples so that the estimators make false presictions, etc.. Last but not least, picking the appropriate number of estimators can be applied in boosting classifiers, for the exact same reasons analyzed previously.

## 2.0 Creating the best classifier ##
In the second part of this assignment, we will try to train the best classifier, as well as to evaluate it using stratified cross valdiation.

### 2.1 Good Performing Ensemble

In this part of the assignment you are asked to train a good performing ensemble, that is able to be used in a production environment! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure (weighted) & balanced accuracy, using 10-fold stratified cross validation, of your final classifier. Can you achieve a balanced accuracy over 88%, while keeping the training time low? (Tip 1: You can even use a model from the previous parts, but you are advised to test additional configurations, and ensemble architectures, Tip 2: If you try a lot of models/ensembles/configurations or even grid searches, in your answer leave only the classifier you selected as the best!)

In [None]:
# BEGIN CODE HERE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold

clf1 = BaggingClassifier(estimator=LogisticRegression(random_state=RANDOM_STATE, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=RANDOM_STATE)
clf2 = LinearSVC(tol=0.05, random_state=RANDOM_STATE)
clf3 = GradientBoostingClassifier(n_estimators=113, learning_rate = 0.2, verbose = 0, random_state=RANDOM_STATE) # Classifier #2
clf4 = MLPClassifier(hidden_layer_sizes = 200, random_state=RANDOM_STATE)

best_cls = VotingClassifier([('bagging_logistic', clf1), ('svc', clf2), ('grad_boosting', clf3), ('mlp', clf4)], voting="hard") # Hard voting Classifier.

print(best_cls)

scores = cross_validate(best_cls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=-1)
best_fmeasure = np.average(scores['test_f1_weighted'])  # Trials gave an average of ~ 0.85.
best_accuracy = np.average(scores['test_balanced_accuracy'])  # Trials gave an average of ~ 0.85.

#END CODE HERE

print("Classifier:")
print(best_cls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(best_fmeasure, best_accuracy))

VotingClassifier(estimators=[('bagging_logistic',
                              BaggingClassifier(estimator=LogisticRegression(random_state=42,
                                                                             solver='sag',
                                                                             tol=0.05),
                                                max_features=0.7,
                                                max_samples=0.85,
                                                n_estimators=25, n_jobs=6,
                                                random_state=42)),
                             ('svc', LinearSVC(random_state=42, tol=0.05)),
                             ('grad_boosting',
                              GradientBoostingClassifier(learning_rate=0.2,
                                                         n_estimators=113,
                                                         random_state=42)),
                             ('mlp',
                

## **Classifier construction process**

### For the base models

At first, we tried a wide variety of different models, as well as "bagged" (ensemble) alterations of them. For each one of these models, we measured their balanced accuracy and f1 scores. Based on these evaluation metrics, we made a selection of three models with the best performance among the others.

The next step we took was to fine-tune the level-0 models, using grid search. The fine-tuning process resulted in three models:
  - a random forest classifier
  - a logistic regression classifier
  - a decision tree classifier

At first, we tried normal and bagged alterations of all the classifiers (both simple and complex) analyzed in classroom and within this notebook.

2. After evaluating these models, we chose some well performing ones and  tried different architectures to combine them

3. Then, these were combined in different ways as an ensemble (hard voting, stacking with logistic regression, etc.).
4. After that, after reaching a good result (>85% but still a little lower than 1.2, having that as the baseline), I tried slightly different well-performing models as base models and/or tweaking some hyperparameters of present base models (base on the results on their individual tuning) trying to get as good final scores as possible (and of course to get something better than 1.2 results). This is because it was observed that the absolute best performing models and best tuning of individual (base) models of the ensemble didn't necessarily translate to the best ensemble.
5. Finally, after already having a good model and not reaching a better one with the tests above, I stick with the best achieved so far.
6. In conclusion the model was not too much better than the one in 1.2 but has a (slightly) better Balanced Accuracy and about the same f1 score.




### 2.2 Question
 What other ensemble architectures you tried, and why you did not choose them as your final classifier?

### **Results of attempted final ensemble classifiers (10 fold cross-validation)**
* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=42)), ('bagging_sgd', BaggingClassifier(base_estimator=SGDClassifier(random_state=42), max_features=0.9, max_samples=0.5, n_estimators=35, n_jobs=4, random_state=42)), ('bagging_mlp', BaggingClassifier(base_estimator=MLPClassifier(hidden_layer_sizes=5, random_state=42), n_estimators=3, n_jobs=1, random_state=42)), ('grad_boosting', GradientBoostingClassifier(learning_rate=0.2, n_estimators=113))], final_estimator=GaussianNB(), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.853 & Balanced Accuracy:0.849


* `VotingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=42)), ('svc', LinearSVC(random_state=42, tol=0.05)), ('grad_boosting', GradientBoostingClassifier(learning_rate=0.2, n_estimators=113)), ('mlp', MLPClassifier(hidden_layer_sizes=200, random_state=42))])`

  Classifier:

  F1 Weighted-Score:0.859 & Balanced Accuracy:0.857

* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, n_estimators=15, n_jobs=6, random_state=42)), ('mlp', MLPClassifier(hidden_layer_sizes=113, random_state=42))], final_estimator=LogisticRegression(random_state=42), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.856 & Balanced Accuracy:0.850
* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, n_estimators=15, n_jobs=6, random_state=42)), ('svc', LinearSVC(random_state=42, tol=0.05)), ('grad_boosting', GradientBoostingClassifier(learning_rate=0.25, n_estimators=113)), ('mlp', MLPClassifier(hidden_layer_sizes=113, random_state=42))], final_estimator=GaussianNB(), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.856 & Balanced Accuracy:0.852

* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=42)), ('svc', LinearSVC(random_state=42, tol=0.05)), ('random_forest', RandomForestClassifier(n_estimators=200, n_jobs=6, random_state=42))], final_estimator=GaussianNB(), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.849 & Balanced Accuracy:0.844


### Results of the final classifier (10-fold cross-validation)
Model: `VotingClassifier(estimators=[('bagging_logistic',BaggingClassifier(base_estimator=LogisticRegression(random_state=42,solver='sag',tol=0.05),max_features=0.7,max_samples=0.85,n_estimators=25,n_jobs=6,random_state=42)),('svc',LinearSVC(random_state=42,tol=0.05)),('grad_boosting',GradientBoostingClassifier(learning_rate=0.2,n_estimators=113,random_state=42)),('mlp',MLPClassifier(hidden_layer_sizes=200,random_state=42))])`



Metrics (10-fold stratified cross validation):
Classifier:
F1 Weighted-Score:0.857 & Balanced Accuracy:0.855

### 2.3 Setup the Final Classifier
Finally, in this last cell, set the cls variable to either the best model as occured by the stratified cross_validation, or choose to retrain your classifier in the whole dataset (X, y). There is no correct answer, but try to explain your choice. Then, save your model using pickle and upload it with your submission to e-learning!

In [None]:
import pickle
from sklearn.metrics import balanced_accuracy_score

### BEGIN SOLUTION
cls = best_cls.fit(X, y)

# save with pickle
file_name = "best_cls_3651.joblib"
pickle.dump(cls, open(file_name, 'wb'))
### END SOLUTION


# load
cls = pickle.load(open(file_name, "rb"))

predictions = cls.predict(X)
print(balanced_accuracy_score(y, predictions))

# test_set = pd.read_csv("test_set_noclass.csv")
# predictions = cls.predict(test_set)

# We are going to run the following code
if False:
  from sklearn.metrics import f1_score, balanced_accuracy_score
  final_test_set = pd.read_csv('test_set.csv')
  ground_truth = final_test_set['CLASS']
  print("Balanced Accuracy: {}".format(balanced_accuracy_score(predictions, ground_truth)))
  print("F1 Weighted-Score: {}".format(f1_score(predictions, ground_truth, average='weighted')))

0.998814463544754


Both metrics should aim above 82%! This is going to be tested by us! Make sure your cross validation or your retrained model achieves high balanced accuracy and f1_score (based on 2.1) (more than 88%) as it should achieve at least 82% in our unknown test set!


### Best classifier selection

We opted for keeping the

Please provide your feedback regarding this project! Did you enjoy it?

The project was really interesting and constructive! ML is absolutely the best course in CSD AUTh. Thank you for the quality you offer!