## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that.

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = ""
AEM = ""

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file.
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not.


In [3]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x7f52c1c681c0>)

In [4]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

In [5]:
!pip install -U imbalanced-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The following code will reduce the number of instances, dealing with the small imbalance of the dataset, as well as reducing the size of the dataset!

In [6]:
from collections import Counter
from imblearn.under_sampling import NeighbourhoodCleaningRule, RandomUnderSampler

ncr = NeighbourhoodCleaningRule()
X_res, y_res = ncr.fit_resample(X, y)
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_res, y_res)
print('Resampled dataset shape %s' % Counter(y_res))
X = X_res
y = y_res

Resampled dataset shape Counter({0: 1687, 1: 1687})


## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **5-fold cross validation** for your tests and report the average f-measure weighted and balanced accuracy of your models. You can use [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) and select both metrics to be measured during the evaluation.

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses two **simple** estimators/classifiers. Test both soft and hard voting and report the results. Consider as simple estimators the following:


*   Decision Trees
*   Linear Models
*   KNN Models  

In [7]:
### BEGIN SOLUTION

from sklearn.model_selection import cross_validate, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score, balanced_accuracy_score

# USE RANDOM STATE!
cls1 = DecisionTreeClassifier(random_state=RANDOM_STATE)
cls2 = LogisticRegression(random_state=RANDOM_STATE)
#cls3 = KNeighborsClassifier()

# Use 2 classifiers
soft_vcls = VotingClassifier(estimators=[('dt', cls1), ('lr', cls2)], voting='soft', n_jobs=-1)
hard_vcls = VotingClassifier(estimators=[('dt', cls1), ('lr', cls2)], voting='hard', n_jobs=-1)

# Use 3 classifiers
#soft_vcls = VotingClassifier(estimators=[('dt', cls1), ('lr', cls2), ('knn', cls3)], voting='soft', n_jobs=-1)
#hard_vcls = VotingClassifier(estimators=[('dt', cls1), ('lr', cls2), ('knn', cls3)], voting='hard', n_jobs=-1)

scoring = ['f1_weighted', 'balanced_accuracy']
#scoring = {'balanced_accuracy': make_scorer(balanced_accuracy_score),'f1_weighted': make_scorer(f1_score)}

#kfold = KFold(n_splits=5, random_state=RANDOM_STATE, shuffle=True)

svlcs_scores = cross_validate(soft_vcls, X, y, cv=5, scoring=scoring, n_jobs=-1)
s_avg_fmeasure = np.mean(svlcs_scores['test_f1_weighted'])
s_avg_accuracy = np.mean(svlcs_scores['test_balanced_accuracy'])

hvlcs_scores = cross_validate(hard_vcls, X, y, cv=5, scoring=scoring, n_jobs=-1)
h_avg_fmeasure = np.mean(hvlcs_scores['test_f1_weighted'])
h_avg_accuracy = np.mean(hvlcs_scores['test_balanced_accuracy'])

### END SOLUTION

print("Classifier:")
print(soft_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(s_avg_fmeasure,4), round(s_avg_accuracy,4)))

print("Classifier:")
print(hard_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(h_avg_fmeasure,4), round(h_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('dt', DecisionTreeClassifier(random_state=42)),
                             ('lr', LogisticRegression(random_state=42))],
                 n_jobs=-1, voting='soft')
F1 Weighted-Score: 0.7483 & Balanced Accuracy: 0.7484
Classifier:
VotingClassifier(estimators=[('dt', DecisionTreeClassifier(random_state=42)),
                             ('lr', LogisticRegression(random_state=42))],
                 n_jobs=-1)
F1 Weighted-Score: 0.816 & Balanced Accuracy: 0.8192


For both soft/hard voting classifiers the F1 weighted score should be above 0.74 and 0.79, respectively, and for balanced accuracy 0.74 and 0.80. Remember! This should be the average performance of each fold, as measured through cross-validation with 5 folds!

### 1.2 Randomization

You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1_weighted/balanced_accuracy score. The dictionaries should contain four different elements. Use the same cross-validation approach as before!

In [8]:
### BEGIN SOLUTION

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold

ens1 = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=12), max_features=0.8, n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
ens2 = RandomForestClassifier(criterion='gini', max_depth=10, max_features='auto', n_estimators=500,random_state=RANDOM_STATE, n_jobs=-1)
ens3 = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=RANDOM_STATE)
#ens3 = ExtraTreesClassifier(random_state=RANDOM_STATE, n_jobs=-1)
tree = DecisionTreeClassifier(criterion='gini',max_depth=7, min_samples_leaf=2,splitter='best', random_state=RANDOM_STATE)

f_measures = dict()
accuracies = dict()
# Example f_measures = {'Simple Decision': 0.8551, 'Ensemble with random ...': 0.92, ...}

# A couple of functions to help us
def evaluation(model, X, y):
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
    scores = cross_validate(model, X, y, scoring=('accuracy', 'f1'), cv=cv, verbose=2, n_jobs=-1)
    return scores

def scoring(model, X, y):
    scores = evaluation(model, X, y)
    f1_score = np.mean(scores['test_f1'])
    acc_score = np.mean(scores['test_accuracy'])
    return f1_score, acc_score

f1_ens1, acc_ens1 = scoring(ens1, X, y)
f1_ens2, acc_ens2 = scoring(ens2, X, y)
f1_ens3, acc_ens3 = scoring(ens3, X, y)
f1_tree, acc_tree = scoring(tree, X, y)

f_measures['Bagging'] = f1_ens1
f_measures['RandomForest'] = f1_ens2
f_measures['GradientBoost'] = f1_ens3
f_measures['DecisionTree'] = f1_tree

accuracies['Bagging'] = acc_ens1
accuracies['RandomForest'] = acc_ens2
accuracies['GradientBoost'] = acc_ens3
accuracies['DecisionTree'] = acc_tree

### END SOLUTION

print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier: {} -  F1 Weighted: {}".format(name,round(score,4)))
for name,score in accuracies.items():
    print("Classifier: {} -  BalancedAccuracy: {}".format(name,round(score,4)))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 10.9min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 10.9min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.1min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.1min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  9.9min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  9.9min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=12),
                  max_features=0.8, n_estimators=100, n_jobs=-1,
                  random_state=42)
RandomForestClassifier(max_depth=10, max_features='auto', n_estimators=500,
                       n_jobs=-1, random_state=42)
GradientBoostingClassifier(max_depth=5, random_state=42)
DecisionTreeClassifier(max_depth=7, min_samples_leaf=2, random_state=42)
Classifier: Bagging -  F1 Weighted: 0.8606
Classifier: RandomForest -  F1 Weighted: 0.8549
Classifier: GradientBoost -  F1 Weighted: 0.8741
Classifier: DecisionTree -  F1 Weighted: 0.7534
Classifier: Bagging -  BalancedAccuracy: 0.8583
Classifier: RandomForest -  BalancedAccuracy: 0.8524
Classifier: GradientBoost -  BalancedAccuracy: 0.874
Classifier: DecisionTree -  BalancedAccuracy: 0.7584


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   12.4s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   12.4s finished


Gradient Boost is the most accurate.

| Classifier     | F1 Score | Accuracy |
|----------------|----------|----------|
| Bagging        | 0.8606   | 0.8583   |
| Random Forest  | 0.8549   | 0.8524   |
| Gradient Boost | 0.8741   | 0.8740   |
| Decision Tree  | 0.7534   | 0.7584   |

### 1.3 Question

Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

Increasing the number of estimators in a bagging classifier will indeed increase the training time of classifier, as each estimator is trained seperately. The solution to this problem is the well-known **parallelization**. To do this we utilize multiple processors or cores on a machine, distributing the training process across them. The n_jobs parameter in scikit-learn's bagging classifiers actually specify the number of jobs to run parallel. So, we set n_jobs=-1 to use all the cores of our machine for running our tests. Of course, the processing time depends on our machine.

In the case of boosting classifiers the solution of parallelization is not as helpful as in bagging. Boosting classifiers are sequential algorithms, meaning each estimator is trained sequentially and each subsequent estimator focuses on improving the mistakes made by the previous estimators. So, one approach to speed up the training process here is to use a subset of the data for each iteration rather than the entire dataset. By using smaller subsets, the training time can be reduced while still achieving good performance.

## 2.0 Creating the best classifier ##
In the second part of this assignment, we will try to train the best classifier, as well as to evaluate it using stratified cross valdiation.

### 2.1 Good Performing Ensemble

In this part of the assignment you are asked to train a good performing ensemble, that is able to be used in a production environment! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure (weighted) & balanced accuracy, using 10-fold stratified cross validation, of your final classifier. Can you achieve a balanced accuracy over 88%, while keeping the training time low? (Tip 1: You can even use a model from the previous parts, but you are advised to test additional configurations, and ensemble architectures, Tip 2: If you try a lot of models/ensembles/configurations or even grid searches, in your answer leave only the classifier you selected as the best!)

In [None]:
### BEGIN SOLUTION
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC

# Set one
ens1 = SVC(C=10, gamma=0.0001, probability = False , random_state=RANDOM_STATE)
ens2 = SVC(C=1, gamma=0.01, probability = False , random_state=RANDOM_STATE)
ens3 = KNeighborsClassifier(n_neighbors=8, weights='distance' )
ens4 = KNeighborsClassifier(n_neighbors=15, weights='distance' )
ens5 = DecisionTreeClassifier(criterion='entropy', max_depth=15, min_samples_leaf=50)
ens6 = DecisionTreeClassifier(criterion='gini', max_depth=8, min_samples_leaf=5)

classifiers = [('svm1', ens1), ('svm2', ens2), ('knn1', ens3), ('knn2', ens4), ('dt1', ens5), ('dt2', ens6)]

# Set two
#ens1 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)
#ens2 = KNeighborsClassifier(n_neighbors=8, weights='distance')
#ens3 = KNeighborsClassifier(n_neighbors=15)
#ens4 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False, random_state=RANDOM_STATE)
#ens5 = SVC(C=10, gamma=0.0001, probability = False, random_state=RANDOM_STATE)

#classifiers = [('rf', ens1), ('knn1', ens2), ('knn2', ens3), ('svm1', ens4), ('svm2', ens5)]

# Set three
#ens1 = KNeighborsClassifier(n_neighbors=8, weights='distance')
#ens2 = KNeighborsClassifier(n_neighbors=15)
#ens3 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False ,random_state=RANDOM_STATE)
#ens4 = SVC(C=10, gamma=0.0001, probability = False ,random_state=RANDOM_STATE)
#ens5 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)
#ens6 = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=RANDOM_STATE)

#classifiers = [('knn1',ens1),('knn2',ens2),('svm1',ens3),('svm2',ens4),('rf',ens5),('gb',ens6)]

best_cls = StackingClassifier(classifiers, cv=10, n_jobs=1)
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
scores_stacking = cross_validate(best_cls, X, y, scoring=('accuracy', 'f1'), cv=cv, verbose=2, n_jobs=-1)

best_fmeasure = np.mean(scores_stacking['test_f1'])
best_accuracy = np.mean(scores_stacking['test_accuracy'])
### END SOLUTION

print("Classifier:")
print(best_cls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(best_fmeasure, best_accuracy))

LEAVE HERE ANY COMMENTS ABOUT YOUR CLASSIFIER

As in the first sextion of the assignment, various random combinations were tested.

| Set | F1 Score | Accuracy | Time |
|-----|----------|----------|------|
| 1   | 0.91364  | 0.91375  | 61.8 |
| 2   | 0.91280  | 0.91286  | 67.8 |
| 3   |    -     |    -     | 201+ |

I explored different sets of classifiers, including SVM, KNN, Decision Trees, Random Forest and Gradient Boosting. Each set contained different variations of these classifiers with various parameter settings. I aimed to capture different characteristics of the data and leverage their strengths for ensemble learning. In this way I identify the optimal combination that maximizes the performance metrics.

To assess the performance of each ensemble configuration, I used 10-fold stratified cross-validation as needed. This was done because it ensures that the dataset is divided into balanced folds while preserving the distribution of class labels.

I used the StackingClassifier, which combines the predictions of multiple base classifiers using a meta-classifier.

Finally, to compare the models I focused on the f-measure (weighted) and balanced accuracy metrics.

### 2.2 Question
 What other ensemble architectures you tried, and why you did not choose them as your final classifier?

During the experimentation process, I explored several ensemble architectures in the StackingClassifier. I need to refer the set that I used RandomForestClassifier along with GradientBoostingClassifier. This set has significantly longer training time, so in addition with the impressive results the other sets have, I decided to reject it. Speaking for the other two sets, they have approximately the same results, but the first is a little faster, so I choose this over the second. I was expecting these results as the main difference is that  I used DecisionTreeClassifier instead of RandomForest. They are too far away.

### 2.3 Setup the Final Classifier
Finally, in this last cell, set the cls variable to either the best model as occured by the stratified cross_validation, or choose to retrain your classifier in the whole dataset (X, y). There is no correct answer, but try to explain your choice. Then, save your model using pickle and upload it with your submission to e-learning!

In [None]:
import pickle

### BEGIN SOLUTION
cls = best_cls
cls.fit(X, y)
# save with pickle
file_name = "final_model.pkl"
pickle.dump(cls, open(file_name, "wb"))
### END SOLUTION


# load
cls = pickle.load(open(file_name, "rb"))

test_set = pd.read_csv("test_set_noclass.csv")
predictions = cls.predict(test_set)

# We are going to run the following code
if False:
  from sklearn.metrics import f1_score, balanced_accuracy_score
  final_test_set = pd.read_csv('test_set.csv')
  ground_truth = final_test_set['CLASS']
  print("Balanced Accuracy: {}".format(balanced_accuracy_score(predictions, ground_truth)))
  print("F1 Weighted-Score: {}".format(f1_score(predictions, ground_truth, average='weighted')))

Both metrics should aim above 82%! This is going to be tested by us! Make sure your cross validation or your retrained model achieves high balanced accuracy and f1_score (based on 2.1) (more than 88%) as it should achieve at least 82% in our unknown test set!


Please provide your feedback regarding this project! Did you enjoy it?

In [None]:
# YOUR ANSWER HERE

It was very interesting to explore different classifiers, ensemble architectures and parameter configurations to find the optimal model. This project allowed me to gain practical experience and improve my skills in machine learning. Overall, it was a rewarding and enjoyable project!