## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = "Ιωάννης Μπαρακλιλής"
AEM = "3685"

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file. 
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not. 


In [3]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x7f7f8ce3b350>)

In [3]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **10-fold cross validation** for your tests and report the average f-measure weighted and balanced accuracy of your models. You can use [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) and select both metrics to be measured during the evaluation. Otherwise, you can use [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses three **simple** estimators/classifiers. Test both soft and hard voting and choose the best one. Consider as simple estimators the following:


*   Decision Trees
*   Linear Models
*   Probabilistic Models (Naive Bayes)
*   KNN Models  

In [None]:
# BEGIN CODE HERE

# Imports.
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_validate

# To pick the individual models, I tested each one individually in the console and picked (and then tuned) the best performing 3. These results are shown in the results-comments of 3.0 (as similar tests had to be run to choose the base models).

""" Tuning: Best combined models should lead to a good ensemble model.
from sklearn.model_selection import GridSearchCV
parameters = {
    'criterion': ["gini", "entropy"],
    'max_depth': [2, 5, 20, 50, 100, 200, None],
    'max_leaf_nodes': [2, 5, 20, 50, 100, 200, None]
}
clf = GridSearchCV(DecisionTreeClassifier(random_state=RANDOM_STATE), parameters, cv =10, verbose = 5, n_jobs = 6, scoring=['balanced_accuracy', 'f1_weighted'], refit = 'balanced_accuracy')
clf.fit(X, y)
cls = clf.best_estimator_
print(cls)  # DecisionTreeClassifier(max_depth=20, max_leaf_nodes=50, random_state=RANDOM_STATE).
scores = cross_validate(cls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=10, verbose=2)
avg_fmeasure = np.average(scores["test_f1_weighted"]) # The average f-measure -- Should be 0.732930.
avg_accuracy = np.average(scores["test_balanced_accuracy"]) # The average accuracy -- Should be 0.720639.
print(dict(avg_fmeasure=avg_fmeasure, avg_accuracy=avg_accuracy))
"""
cls1 = DecisionTreeClassifier(max_depth=20, max_leaf_nodes=50, random_state=RANDOM_STATE) # Classifier #1

""" Tuning: Best combined models should lead to a good ensemble model.
from sklearn.model_selection import GridSearchCV
parameters = {
    'tol': [1e-2, 1e-3]
}
clf = GridSearchCV(LogisticRegression(solver = 'sag', random_state = RANDOM_STATE), parameters, cv =10, verbose = 5, n_jobs = 4, scoring=['balanced_accuracy', 'f1_weighted'], refit = 'balanced_accuracy')
clf.fit(X, y)
cls = clf.best_estimator_
print(cls)  # LogisticRegression(random_state=RANDOM_STATE, solver='sag', tol=0.01).
scores = cross_validate(cls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=4, verbose=2)
avg_fmeasure = np.average(scores["test_f1_weighted"]) # The average f-measure -- Should be 0.851290.
avg_accuracy = np.average(scores["test_balanced_accuracy"]) # The average accuracy -- Should be 0.845610.
print(dict(avg_fmeasure=avg_fmeasure, avg_accuracy=avg_accuracy))
"""
cls2 = LogisticRegression(random_state=RANDOM_STATE, solver='sag', tol=0.01) # Classifier #2

""" Tuning: Best combined models should lead to a good ensemble model.
from sklearn.model_selection import GridSearchCV
parameters = {
    'n_neighbors': [1, 2, 3, 5, 10, 20, 50, 100],
    'weights': ['uniform', 'distance']
}
clf = GridSearchCV(KNeighborsClassifier(), parameters, cv =10, verbose = 5, n_jobs = 6, scoring=['balanced_accuracy', 'f1_weighted'], refit = 'balanced_accuracy')
clf.fit(X, y)
cls = clf.best_estimator_
print(cls)  # KNeighborsClassifier(n_neighbors=10, weights='distance').
scores = cross_validate(cls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=4, verbose=2)
avg_fmeasure = np.average(scores["test_f1_weighted"]) # The average f-measure -- Should be 0.814493.
avg_accuracy = np.average(scores["test_balanced_accuracy"]) # The average accuracy -- Should be 0.807473.
print(dict(avg_fmeasure=avg_fmeasure, avg_accuracy=avg_accuracy))
"""
cls3 = KNeighborsClassifier(n_neighbors=10, weights='distance') # Classifier #1

soft_vcls = VotingClassifier([('dt', cls1), ('lr', cls2), ('knn', cls3)], voting="soft") # Voting Classifier
hard_vcls = VotingClassifier([('dt', cls1), ('lr', cls2), ('knn', cls3)], voting="hard") # Voting Classifier

svlcs_scores = cross_validate(soft_vcls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose = 2)
s_avg_fmeasure = np.average(svlcs_scores["test_f1_weighted"]) # The average f-measure -- Should be 0.8431.
s_avg_accuracy = np.average(svlcs_scores["test_balanced_accuracy"]) # The average accuracy -- Should be 0.835.

hvlcs_scores = cross_validate(hard_vcls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose = 2)
h_avg_fmeasure = np.average(hvlcs_scores["test_f1_weighted"]) # The average f-measure -- Should be 0.8386.
h_avg_accuracy = np.average(hvlcs_scores["test_balanced_accuracy"]) # The average accuracy -- 0.8306.
#END CODE HERE

In [5]:
print("Classifier:")
print(soft_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(s_avg_fmeasure,4), round(s_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('dt',
                              DecisionTreeClassifier(max_depth=20,
                                                     max_leaf_nodes=50,
                                                     random_state=42)),
                             ('lr',
                              LogisticRegression(random_state=42, solver='sag',
                                                 tol=0.01)),
                             ('knn',
                              KNeighborsClassifier(n_neighbors=10,
                                                   weights='distance'))],
                 voting='soft')
F1 Weighted-Score: 0.8431 & Balanced Accuracy: 0.835


You should achive above 82% (Soft Voting Classifier)

In [6]:
print("Classifier:")
print(hard_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(h_avg_fmeasure,4), round(h_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('dt',
                              DecisionTreeClassifier(max_depth=20,
                                                     max_leaf_nodes=50,
                                                     random_state=42)),
                             ('lr',
                              LogisticRegression(random_state=42, solver='sag',
                                                 tol=0.01)),
                             ('knn',
                              KNeighborsClassifier(n_neighbors=10,
                                                   weights='distance'))])
F1 Weighted-Score: 0.8386 & Balanced Accuracy: 0.8306


You should achieve above 80% in both! (Hard Voting Classifier)

### 1.2 Stacking ###
Create a stacking classifier which uses two more complex estimators. Try different simple classifiers (like the ones mentioned before) for the combination of the initial estimators. Report your results in the following cell.

Consider as complex estimators the following:

*   Random Forest
*   SVM
*   Gradient Boosting
*   MLP




In [None]:
# BEGIN CODE HERE
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# To pick the individual models, I tested each one individually in the console and picked (and then tuned) the best performing 3. These results can be seen in the results of 3.0 (as similar tests had to be run to choose the base models).

""" Tuning:
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)
all_metrics = {'tol': {}}
for tol in [1e-3, 1e-2, 5e-2, 1e-1, 0.5]:
    if tol not in all_metrics['tol']:
        all_metrics['tol'][tol] = {'train': [], 'test': []}

    clf = LinearSVC(tol=tol, random_state=RANDOM_STATE)
    clf.fit(X_train, y_train)

    bal_acc_train = balanced_accuracy_score(y_train, clf.predict(X_train))
    bal_acc_test = balanced_accuracy_score(y_test, clf.predict(X_test))

    print(dict(tol=tol), end='\n\t\t')
    print(dict(bal_acc_train=bal_acc_train, bal_acc_test=bal_acc_test), end='\n\t\t')
    print(dict(f1_train=f1_score(y_train, clf.predict(X_train), average='weighted'),
          f1_test=f1_score(y_test, clf.predict(X_test), average='weighted')))
    all_metrics['tol'][tol]['train'].append(bal_acc_train)
    all_metrics['tol'][tol]['test'].append(bal_acc_test)

# Plot to arrive to conclusions.
plotx = sorted(all_metrics['tol'].keys())
ploty = [np.average(all_metrics['tol'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')
plotx = sorted(all_metrics['tol'].keys())
ploty = [np.average(all_metrics['tol'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')
plt.title('Average balanced accuracy for tol parameter')
plt.legend()
plt.show()

# ---- After we see the graphs, it can be observed that all choices seem extremely close, so the model picked will be pretty much the same regardless of the specific parameter chosen.
# So, after (hand) testing final ensembles of models, the submodel that had the best scores was the LinearSVC(tol=0.05, random_state=RANDOM_STATE), and so tol=0.05 was selected.

clf = LinearSVC(tol=0.05, random_state=RANDOM_STATE)
scores = cross_validate(clf, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
print('avg_fmeasure =', np.average(scores["test_f1_weighted"]) )# The average f-measure -- Should be about 0.810540.
print('avg_accuracy =', np.average(scores["test_balanced_accuracy"])) # The average accuracy -- Should be about 0.804951.
"""
cls1 = LinearSVC(tol=5e-2, random_state=RANDOM_STATE) # Classifier #1

""" Tuning:
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)
all_metrics = {'n_estimators': {}, 'learning_rate': {}}
for n_estimators in [1, 10, 50, 100, 113, 150, 200, 300]:
    if n_estimators not in all_metrics['n_estimators']:
        all_metrics['n_estimators'][n_estimators] = {'train': [], 'test': []}
    for learning_rate in [.1, .2, .5, .7, 1.]:
        if learning_rate not in all_metrics['learning_rate']:
            all_metrics['learning_rate'][learning_rate] = {'train': [], 'test': []}

        clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate = learning_rate, verbose = 0)
        clf.fit(X_train, y_train)

        bal_acc_train = balanced_accuracy_score(y_train, clf.predict(X_train))
        bal_acc_test = balanced_accuracy_score(y_test, clf.predict(X_test))

        print(dict(n_estimators=n_estimators, learning_rate=learning_rate), end='\n\t\t')
        print(dict(bal_acc_train=bal_acc_train, bal_acc_test=bal_acc_test), end='\n\t\t')
        print(dict(f1_train=f1_score(y_train, clf.predict(X_train), average='weighted'),
              f1_test=f1_score(y_test, clf.predict(X_test), average='weighted')))

        all_metrics['n_estimators'][n_estimators]['train'].append(bal_acc_train)
        all_metrics['n_estimators'][n_estimators]['test'].append(bal_acc_test)
        all_metrics['learning_rate'][learning_rate]['train'].append(bal_acc_train)
        all_metrics['learning_rate'][learning_rate]['test'].append(bal_acc_test)

# Plot to arrive to conclusions.
plotx = sorted(all_metrics['n_estimators'].keys())
ploty = [np.average(all_metrics['n_estimators'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')
plotx = sorted(all_metrics['n_estimators'].keys())
ploty = [np.average(all_metrics['n_estimators'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')
plt.title('Average balanced accuracy for n_estimators parameter')
plt.legend()
plt.show()
plotx = sorted(all_metrics['learning_rate'].keys())
ploty = [np.average(all_metrics['learning_rate'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')
plotx = sorted(all_metrics['learning_rate'].keys())
ploty = [np.average(all_metrics['learning_rate'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')
plt.title('Average balanced accuracy for learning_rate parameter')
plt.legend()
plt.show()

# ---- After we see the graphs, it can be observed that, in average, the best parameters that optimize both train and test scores (balanced accuracy because it is the hardest to raise and it is observed that with high accuracy we have high f1 as well) is:
#     * n_estimators = 113, after which both train and test scores have rise that does not justify the increased complexity.
#     * learning_rate = 0.2, because the model overtrains and the testing score does not increase significantly.
# As for validation:
clf = GradientBoostingClassifier(n_estimators=113, learning_rate = 0.2, random_state=RANDOM_STATE)
scores = cross_validate(clf, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
print('avg_fmeasure =', np.average(scores["test_f1_weighted"]) )# The average f-measure -- Should be about 0.826294.
print('avg_accuracy =', np.average(scores["test_balanced_accuracy"])) # The average accuracy -- Should be about 0.816828.
"""
cls2 = GradientBoostingClassifier(n_estimators=113, learning_rate = 0.2, verbose = 0, random_state=RANDOM_STATE) # Classifier #2

""" Tuning:
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)
all_metrics = {'hidden_layer_sizes': {}}
for hidden_layer_sizes in [1, 10, 50, 100, 200, 300]:
    all_metrics['hidden_layer_sizes'][hidden_layer_sizes] = {'train': [], 'test': []}
    clf = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes, random_state=RANDOM_STATE)
    clf.fit(X_train, y_train)
    bal_acc_train = balanced_accuracy_score(y_train, clf.predict(X_train))
    bal_acc_test = balanced_accuracy_score(y_test, clf.predict(X_test))
    f1_train = f1_score(y_train, clf.predict(X_train), average='weighted')
    f1_test = f1_score(y_test, clf.predict(X_test), average='weighted')
    print(dict(hidden_layer_sizes=hidden_layer_sizes), end='\n\t\t')
    print(dict(bal_acc_train=bal_acc_train, bal_acc_test=bal_acc_test), end='\n\t\t')
    print(dict(f1_train=f1_train, f1_test=f1_test))
    all_metrics['hidden_layer_sizes'][hidden_layer_sizes]['train'].append(bal_acc_train)
    all_metrics['hidden_layer_sizes'][hidden_layer_sizes]['test'].append(bal_acc_test)

# Plot to arrive to conclusions.
plotx = sorted(all_metrics['hidden_layer_sizes'].keys())
ploty = [np.average(all_metrics['hidden_layer_sizes'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label='Train')
plotx = sorted(all_metrics['hidden_layer_sizes'].keys())
ploty = [np.average(all_metrics['hidden_layer_sizes'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label='Test')
plt.title('Average balanced accuracy for hidden_layer_sizes parameter')
plt.legend()
plt.show()


# ---- After we see the graphs, it can be observed that, in average, the best parameters that optimize both train and test scores (balanced accuracy because it is the hardest to raise and it is observed that with high accuracy we have high f1 as well) is:
#     * hidden_layer_sizes = 200, after which train scores stay flat and test scores slowly drop.
# As for validation:
clf = MLPClassifier(hidden_layer_sizes = 200, random_state=RANDOM_STATE)
scores = cross_validate(clf, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
print('avg_fmeasure =', np.average(scores["test_f1_weighted"]) )# The average f-measure -- Should be about 0.858986.
print('avg_accuracy =', np.average(scores["test_balanced_accuracy"])) # The average accuracy -- Should be about 0.852655.
"""
cls3 = MLPClassifier(hidden_layer_sizes = 200, random_state=RANDOM_STATE) # Classifier #3 (Optional)

# For the final estimator, I tried LogisticRegression, Decision Tree, Gaussian Naive Bayes and KNN with k=3. The best performing one was LogisticRegression, so that one was used in the end.
scls = StackingClassifier(estimators=[("SVC", cls1), ("gbc", cls2), ("mlp", cls3)], final_estimator=LogisticRegression(random_state=RANDOM_STATE), n_jobs=1) # Stacking Classifier
scores = cross_validate(scls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
avg_fmeasure = np.average(scores["test_f1_weighted"]) # The average f-measure -- Should be 0.8568.
avg_accuracy = np.average(scores["test_balanced_accuracy"]) # The average accuracy -- Should be 0.8501.
#END CODE HERE

In [8]:
print("Classifier:")
print(scls)
print("F1 Weighted Score: {} & Balanced Accuracy: {}".format(round(avg_fmeasure,4), round(avg_accuracy,4)))

Classifier:
StackingClassifier(estimators=[('SVC', LinearSVC(random_state=42, tol=0.05)),
                               ('gbc',
                                GradientBoostingClassifier(learning_rate=0.2,
                                                           n_estimators=113,
                                                           random_state=42)),
                               ('mlp',
                                MLPClassifier(hidden_layer_sizes=200,
                                              random_state=42))],
                   final_estimator=LogisticRegression(random_state=42),
                   n_jobs=1)
F1 Weighted Score: 0.8568 & Balanced Accuracy: 0.8501


You should achieve above 85% in both

## 2.0 Randomization ##

**2.1** You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1_weighted/balanced_accuracy score. The dictionaries should contain four different elements.  

In [None]:
# BEGIN CODE HERE
from sklearn.metrics import balanced_accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
""" Hyper-parameter tuning process:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
ns = []
baccs = []
f1s = []
# Trying out different n_estimators.
for i in list(range(20, 1001, 100)): # Then list(range(100, 201, 50)), then list(range(140, 161, 5)). Each new range was tested after the result of the previous ones (focusing on the areas of great scores).
    clf = BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE), n_estimators=i, random_state=RANDOM_STATE,
                      n_jobs=6)
    clf.fit(X_train, y_train)
    ns.append(i)
    baccs.append(balanced_accuracy_score(y_test, clf.predict(X_test)))
    f1s.append(f1_score(y_test, clf.predict(X_test), average="weighted"))
    print('%s done: balanced_accuracy = %f, f1_score = %f'%(i, baccs[-1], f1s[-1]))
# Plot results.
plt.plot(ns, baccs, 'b', label = 'Balanced accuracy.'); plt.plot(ns, f1s, 'r', label = 'F1 weighted.'); plt.legend(); plt.show()

# The best number of estimators looked like 155: BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE), n_estimators=155, random_state=RANDOM_STATE, n_jobs=6)
# Scores (computed below for all classifier ensembles): F1 Weighted = 0.8062, Balanced Accuracy = 0.792.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
"""
ens1 = BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE), n_estimators=155, random_state=RANDOM_STATE, n_jobs=6) # Bagging with default trees (155 estimators because with more, the scores wouldn't get any better).

""" Hyper-parameter tuning process:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
params = {
    'n_estimators': [300, 350, 400, 450, 500],  # At first attempt (before refinement): [10, 50, 100, 200, 300, 400].
    'max_samples': [.55, .58, .6, .65, .68],  # At first attempt (before refinement): [.1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0].
}

clf = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE), random_state=RANDOM_STATE, n_jobs=6), params, n_jobs = 1, cv = 3, verbose = 5, scoring=['balanced_accuracy', 'f1_weighted'], refit = 'balanced_accuracy')
clf.fit(X_train, y_train)
print(clf.best_estimator_)  # BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=RANDOM_STATE), max_samples=0.68, n_estimators=350, n_jobs=6, random_state=RANDOM_STATE).
print('Best balanced accuracy: %f' % clf.best_score)

# Classifier scores (computed below for all classifier ensembles): F1 Weighted = 0.8114, Balanced Accuracy = 0.7975.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
"""
ens2 = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=RANDOM_STATE),
                         max_samples=0.68, n_estimators=350, n_jobs=6,
                         random_state=RANDOM_STATE) # Pasting: 68% of samples used.


""" Hyper-parameter tuning process:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)

models=[]
metricss=[]
max_metrics = (-1, -1)
max_model = None
for n_estimators in [100, 200, 300]:
    for max_samples in [.1, .2, .3, .4, .5, .6, .7, .8, .9]:
        for max_features in [.1, .2, .3, .4, .5, .6, .7, .8, .9]:
            clf = BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE), random_state=RANDOM_STATE, n_jobs=6, n_estimators=n_estimators
                                    , max_samples = max_samples, max_features = max_features)
            clf.fit(X_train, y_train)
            metrics = balanced_accuracy_score(y_test, clf.predict(X_test)), f1_score(y_test, clf.predict(X_test), average = "weighted")
            # Find max model (and update variable). Priority is given to balanced accuracy because it tends to be lower than f1.
            models.append(clf), metricss.append(metrics)
            if metrics[0] >= max_metrics[0] and metrics[1] > max_metrics[1]:
                max_metrics = metrics
                max_model = clf
            print({'n_estimators': n_estimators, 'max_samples': max_samples, 'max_features': max_features}, metrics)
print(max_model)

# The best number of estimators looked like max_features=0.3, max_samples=0.9, n_estimators=200 :
#                       BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=RANDOM_STATE), max_features=0.3, max_samples=0.9, n_estimators=200, n_jobs=6, random_state=RANDOM_STATE).
# Scores (computed below for all classifier ensembles): F1 Weighted = 0.8149, Balanced Accuracy = 0.8008.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
"""
ens3 = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=RANDOM_STATE),
                         max_features=0.3, max_samples=0.9, n_estimators=200, n_jobs=6,
                         random_state=RANDOM_STATE) # Random Patches: 90% of samples and 30% of features used.

""" Hyper-parameter tuning process:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
params = {
'criterion': ["gini", "entropy"],
'max_depth': [2, 10, 25, 50, 70, 100, 250, 500, 600, 700, 800, 1000, None],
'max_features': [None] + list(range(2, 4096, 100)),
'min_samples_leaf': [1, 5, 7, 10, 30, 40, 50, 70, 80, 100, 200],
'min_samples_split': [2, 5, 7, 10, 30, 40, 50, 70, 80, 100, 200],
'max_leaf_nodes': [2, 10, 50, 100, 200, 300, 500, 700, 900, 1000, 1100, 1200, 1350, 1500, 1700, 2000, 3000, 5000, None]
}

clf = RandomizedSearchCV(DecisionTreeClassifier(random_state = 42), params, n_jobs = 5, n_iter = 500, cv = 5, verbose = 1, scoring=['balanced_accuracy', 'f1_weighted'], refit = 'balanced_accuracy')
# No random state because I tried running it a number of times until it resulted in a passable model.
clf.fit(X, y)
print(clf.best_estimator_)  # DecisionTreeClassifier(criterion='entropy', max_depth=25, max_features=1102, max_leaf_nodes=900, min_samples_leaf=50, random_state=RANDOM_STATE).
print('bacc =', clf.best_score_)

# Scores (computed below for all classifier ensembles): F1 Weighted = 0.7267, Balanced Accuracy = 0.7148.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
"""
tree = DecisionTreeClassifier(criterion='entropy', max_depth=25, max_features=1102, max_leaf_nodes=900, min_samples_leaf=50, random_state=RANDOM_STATE) # Simple tree classifier with optimized parameters.

ens1_scores = cross_validate(ens1, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
ens2_scores = cross_validate(ens2, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
ens3_scores = cross_validate(ens3, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
tree_scores = cross_validate(tree, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)


f_measures = dict()
accuracies = dict()
# Example f_measures = {'Simple Decision': 0.8551, 'Ensemble with random ...': 0.92, ...}
f_measures = {'Simple Decision': np.average(tree_scores['test_f1_weighted']), 'Plain Bagging': np.average(ens1_scores['test_f1_weighted']), 'Pasting': np.average(ens2_scores['test_f1_weighted']),
              'Random Patches': np.average(ens3_scores['test_f1_weighted'])}
             # Should be {'Simple Decision': 0.7267, 'Plain Bagging': 0.8062, 'Pasting': 0.8114, 'Random Patches': 0.8149}
accuracies = {'Simple Decision': np.average(tree_scores['test_balanced_accuracy']), 'Plain Bagging': np.average(ens1_scores['test_balanced_accuracy']), 'Pasting': np.average(ens2_scores['test_balanced_accuracy']),
              'Random Patches': np.average(ens3_scores['test_balanced_accuracy'])}
             # Should be {'Simple Decision': 0.7148, 'Plain Bagging': 0.792, 'Pasting': 0.7975, 'Random Patches': 0.8008}

#END CODE HERE

In [10]:
print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier:{} -  F1 Weighted:{}".format(name,round(score,4)))
for name,score in accuracies.items():
    print("Classifier:{} -  BalancedAccuracy:{}".format(name,round(score,4)))

BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                  n_estimators=155, n_jobs=6, random_state=42)
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                  max_samples=0.68, n_estimators=350, n_jobs=6,
                  random_state=42)
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                  max_features=0.3, max_samples=0.9, n_estimators=200, n_jobs=6,
                  random_state=42)
DecisionTreeClassifier(criterion='entropy', max_depth=25, max_features=1102,
                       max_leaf_nodes=900, min_samples_leaf=50,
                       random_state=42)
Classifier:Simple Decision -  F1 Weighted:0.7267
Classifier:Plain Bagging -  F1 Weighted:0.8062
Classifier:Pasting -  F1 Weighted:0.8114
Classifier:Random Patches -  F1 Weighted:0.8149
Classifier:Simple Decision -  BalancedAccuracy:0.7148
Classifier:Plain Bagging -  BalancedAccuracy:0.792
Classifier:Pasting -  Balance

**2.2** Describe your classifiers and your results.

YOUR ANSWER HERE
***
There are four classifiers tuned by using either gridsearch or holdout:
1. **Simple Decision**: A decision tree classifier which has been optimized (having the optimal hyperparameters after tuning). It cannot model the data as well as the ensemble models and has the worse scores in all the metrics (F1 Weighted:0.7267, BalancedAccuracy:0.7148).
2. **Plain Bagging**: A bagging ensemble having a plain (with no parameters deviating from the default) tree as a base model which has been tuned to maximize the two metrics. The number of estimators is 155, because any additional wouldn't result in (significantly) better scores. Its scores are much better than the simple decision tree (F1 Weighted:0.8062, BalancedAccuracy:0.792), meaning it overcame some of its problems.
3. **Pasting**: A bagging ensemble having a plain (with no parameters deviating from the default) tree as a base model which uses pasting having every base model use 68% of the training data given and number of estimators = 350 (these numbers were calculated during tuning). We can see that it was more successful than the plain tree having better scores than it (F1 Weighted:0.8114, BalancedAccuracy:0.7975), and slight better than plain bagging.
4. **Random Patches**: A bagging ensemble having a plain (with no parameters deviating from the default) tree as a base model which uses random patches having every base model use 90% of the training data and 30% of the features (of the data) given and number of estimators = 200 (these numbers were calculated during tuning). We can see that it was the most successful model overall (F1 Weighted:0.8149, BalancedAccuracy:0.8008) but close enough to the pasting ensemble to be considered its equivalent.

From the results above, it can be observed that **all the ensembles have better results** than the plain (base) decision tree.

If we were to pick one of these models (as the best one), it would be the random patches classifier due to the fact that it appears to have a (very) slightly better score than its counterparts.
However, since the difference is so slight, it can be argued that the tree ensemble models are equivalent and picking any one of them would be about the same, but still much better than the ordinary decision tree.

Also, we can observe that the tree ensembles were not as successful as the other ensembles constructed in 1.1 and 1.2, since even with a lot of classifiers and tuning they (in the best case) reached about 81% F1 score and 80% balanced accuracy while the others (the ones in 1.1, 1.2) reached scores above 82% in both metrics with a much smaller number of classifiers.

(Note: For some reason I cannot find a solution to, each time the notebook is run, different results arise, even though there should be the fixed random state in all models.
So, I record all my findings the last time run (you should be able to see the saved output of the cell above) and my conclusions.
Also, it should also be noted that after each run the above conclusions can be still made: All the ensembles are always better than the plain tree and have about the same scores.)

**2.3** Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

YOUR ANSWER HERE
***

#### About the bagging classifiers
It is true that increasing the number of estimators in a bagging classifier will result in higher a training time.

1. Using parallelism.
  *   The simplest way to combat this is to use multiple threads and cores (the **n_jobs** parameter in sklearn) and as many possible ideally (-1 to use) all available CPU threads, so that many models can be trained in parallel and reduce the total time spent training.

  *   However, even though increasing the n_jobs parameter does reduce the ensemble's total training time, it **is highly dependent on the computer's CPU capabilities**, available threads and current workload (among other things) and  may not work for every situation. But, if it is available to us and is effective for the problem at hand, it is the best choice around (it speeds up training without compromising it).

2. Using a subset of the given data.
  *   Alternatively, the number of samples used in each model's training can be reduced (**max_samples** parameter in sklearn can be used (change the default 1.0 to a lower number) to reduce the number of training examples each sub-model uses in training). This should reduce the burden of training each model and, as the result, reduce the ensemble's overall training time.
  *   Similarly, the number of features used in each model's training can be reduced, (the max_features in sklearn (change the default 1.0 to a lower number)), expecting similar results.
  *   However, choosing to do one (or both) of the two above options, changes the set of training data given to each sub-model, having the **risk of changing the model's results** or even making the ensemble model as a whole less accurate since some important information has a higher chance of getting lost in the (re)sampling.
3. Picking the best number of estimators.

    Lastly, infinitely increasing the number of estimators won't result into an infinitely "better" model since its metrics will, eventually, reach a plateau and won't increase or maybe will even decrease.
    So, we can set the number of estimators to the one that maximizes the important metrics after searching for it, either by hand (or by graphing the results) or using a more methodic approach such as GridSearchCV for a wide range of estimator numbers.

    Even though this solution can reduce training time, it cannot solve the underlying problem of the time needing to train each sub-model, especially if the "best" number of estimators is high.



#### About the boosting classifiers
Mirroring the solutions above, about the solutions for training time increase caused by increasing the number of estimators in a boosting classifier:
1. Using parallelism.
  *   Unfortunately, due to the nature of the boosting classifiers requiring one model having the results of the previous model's training, parallelism is **impossible** for these kinds of classifiers.
2. Using a subset of the given data.
  *   This too is **not possible** (in the same way as in bagging classifiers), because each sub-model learns from the previous one's errors, so no data can be missing.
3. Picking the best number of estimators.
  *   Fortunately, this solution **can be used**, but as stated above, it cannot solve the problem entirely, but it can help somewhat.
  *   Additionally, if available, the **learning rate training parameter** can be increased, so that the classifier reaches a low total error more quickly and in that case the number of estimators can be decreased to accommodate for the increased learning speed (since it lowers the error faster, it needs fewer steps to take and thus, a lower number of estimators).
    But, this too isn't always the best choice since it might compromise the learning process and result in a worse model.

However, there as boosting classifiers such as the Gradient Boosting Classifier that can use approximation (like XGBClassifier, for example) to speed up the training process. This may potentially yield worse results than the normal training process but is much quicker because of the approximation.



## 3.0 Creating the best classifier ##

**3.1** In this part of the assignment you are asked to train the best possible ensemble! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure (weighted) & balanced accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code. Can you achieve a balanced accuracy over 83-84%?

In [None]:
# BEGIN CODE HERE
from sklearn.naive_bayes import GaussianNB

""" Tuning:
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_STATE)
all_metrics = {'n_estimators':{}, 'max_features': {}, 'max_samples': {}}
for n_estimators in [1, 10, 25, 50, 60]:
    all_metrics['n_estimators'][n_estimators] = {'train': [], 'test': []}
    for max_features in [.2, .4, .7, .9, 1.]:
        if max_features not in all_metrics['max_features']:
            all_metrics['max_features'][max_features] = {'train': [], 'test': []}
        for max_samples in [.5, .7, .85, .9, 1.]:
            if max_samples not in all_metrics['max_samples']:
                all_metrics['max_samples'][max_samples] = {'train': [], 'test': []}

            clf = BaggingClassifier(base_estimator=LogisticRegression(random_state=RANDOM_STATE, solver='sag', tol=0.05),
                     max_features=max_features, max_samples=max_samples, n_estimators=n_estimators, n_jobs=4, random_state=RANDOM_STATE)
            clf.fit(X_train, y_train)

            bal_acc_train = balanced_accuracy_score(y_train, clf.predict(X_train))
            bal_acc_test = balanced_accuracy_score(y_test, clf.predict(X_test))

            print(dict(max_features=max_features, max_samples=max_samples, n_estimators=n_estimators), end='\n\t\t')
            print(dict(bal_acc_train=bal_acc_train, bal_acc_test=bal_acc_test), end='\n\t\t')
            print(dict(f1_train=f1_score(y_train, clf.predict(X_train), average='weighted'),
                  f1_test=f1_score(y_test, clf.predict(X_test), average='weighted')))
            all_metrics['n_estimators'][n_estimators]['train'].append(bal_acc_train)
            all_metrics['n_estimators'][n_estimators]['test'].append(bal_acc_test)
            all_metrics['max_features'][max_features]['train'].append(bal_acc_train)
            all_metrics['max_features'][max_features]['test'].append(bal_acc_test)
            all_metrics['max_samples'][max_samples]['train'].append(bal_acc_train)
            all_metrics['max_samples'][max_samples]['test'].append(bal_acc_test)

# Plot to arrive to conclusions.
plotx = sorted(all_metrics['n_estimators'].keys())
ploty = [np.average(all_metrics['n_estimators'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')

plotx = sorted(all_metrics['n_estimators'].keys())
ploty = [np.average(all_metrics['n_estimators'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')

plt.title('Average balanced accuracy for n_estimators parameter')
plt.legend()
plt.show()


plotx = sorted(all_metrics['max_features'].keys())
ploty = [np.average(all_metrics['max_features'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')

plotx = sorted(all_metrics['max_features'].keys())
ploty = [np.average(all_metrics['max_features'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')

plt.title('Average balanced accuracy for max_features parameter')
plt.legend()
plt.show()


plotx = sorted(all_metrics['max_samples'].keys())
ploty = [np.average(all_metrics['max_samples'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')

plotx = sorted(all_metrics['max_samples'].keys())
ploty = [np.average(all_metrics['max_samples'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')

plt.title('Average balanced accuracy for max_samples parameter')
plt.legend()
plt.show()

# ---- After we see the graphs, it can be observed that, in average, the best parameters that optimize both train and test scores (balanced accuracy because it is the hardest to raise and it is observed that with high accuracy we have high f1 as well) are:
#     * n_estimators = 25, which is a local peak, after which train scores don't increase significantly and test slowly fall off.
#     * max_features = 0.7, since it is near the peak and after (hand) testing final ensembles of models, the submodel that had the best scores had this parameter.
#     * max_samples = 0.85 since it is near the peak and after (hand) testing final ensembles of models, the submodel that had the best scores had this parameter.

# As for validation:
clf = BaggingClassifier(base_estimator=LogisticRegression(random_state=RANDOM_STATE, solver='sag', tol=0.05),
                        max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=RANDOM_STATE)
scores = cross_validate(clf, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
print('avg_fmeasure =', np.average(scores["test_f1_weighted"]) )# The average f-measure -- Should be ~ 0.847500.
print('avg_accuracy =', np.average(scores["test_balanced_accuracy"])) # The average accuracy -- Should be ~ 0.840846.
"""
clf1 = BaggingClassifier(base_estimator=LogisticRegression(random_state=RANDOM_STATE, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=RANDOM_STATE)


""" Tuning:
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)
all_metrics = {'tol': {}}
for tol in [1e-3, 1e-2, 5e-2, 1e-1, 0.5]:
    if tol not in all_metrics['tol']:
        all_metrics['tol'][tol] = {'train': [], 'test': []}

    clf = LinearSVC(tol=tol, random_state=RANDOM_STATE)
    clf.fit(X_train, y_train)

    bal_acc_train = balanced_accuracy_score(y_train, clf.predict(X_train))
    bal_acc_test = balanced_accuracy_score(y_test, clf.predict(X_test))

    print(dict(tol=tol), end='\n\t\t')
    print(dict(bal_acc_train=bal_acc_train, bal_acc_test=bal_acc_test), end='\n\t\t')
    print(dict(f1_train=f1_score(y_train, clf.predict(X_train), average='weighted'),
          f1_test=f1_score(y_test, clf.predict(X_test), average='weighted')))
    all_metrics['tol'][tol]['train'].append(bal_acc_train)
    all_metrics['tol'][tol]['test'].append(bal_acc_test)

# Plot to arrive to conclusions.
plotx = sorted(all_metrics['tol'].keys())
ploty = [np.average(all_metrics['tol'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')
plotx = sorted(all_metrics['tol'].keys())
ploty = [np.average(all_metrics['tol'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')
plt.title('Average balanced accuracy for tol parameter')
plt.legend()
plt.show()

# ---- After we see the graphs, it can be observed that all choices seem extremely close, so (since it is a simple model compared to others) the model choice will be pretty much the same.
# So, after (hand) testing final ensembles of models, the submodel that had the best scores was the LinearSVC(tol=0.05, random_state=RANDOM_STATE), and so tol=0.05 was selected.

clf = LinearSVC(tol=0.05, random_state=RANDOM_STATE)
scores = cross_validate(clf, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
print('avg_fmeasure =', np.average(scores["test_f1_weighted"]) )# The average f-measure -- Should be ~ 0.810540.
print('avg_accuracy =', np.average(scores["test_balanced_accuracy"])) # The average accuracy -- Should be ~ 0.804951.
"""
clf2 = LinearSVC(tol=0.05, random_state=RANDOM_STATE)

""" Tuning:
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)
all_metrics = {'n_estimators': {}, 'learning_rate': {}}
for n_estimators in [1, 10, 50, 100, 113, 150, 200, 300]:
    if n_estimators not in all_metrics['n_estimators']:
        all_metrics['n_estimators'][n_estimators] = {'train': [], 'test': []}
    for learning_rate in [.1, .2, .5, .7, 1.]:
        if learning_rate not in all_metrics['learning_rate']:
            all_metrics['learning_rate'][learning_rate] = {'train': [], 'test': []}

        clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate = learning_rate, verbose = 0)
        clf.fit(X_train, y_train)

        bal_acc_train = balanced_accuracy_score(y_train, clf.predict(X_train))
        bal_acc_test = balanced_accuracy_score(y_test, clf.predict(X_test))

        print(dict(n_estimators=n_estimators, learning_rate=learning_rate), end='\n\t\t')
        print(dict(bal_acc_train=bal_acc_train, bal_acc_test=bal_acc_test), end='\n\t\t')
        print(dict(f1_train=f1_score(y_train, clf.predict(X_train), average='weighted'),
              f1_test=f1_score(y_test, clf.predict(X_test), average='weighted')))

        all_metrics['n_estimators'][n_estimators]['train'].append(bal_acc_train)
        all_metrics['n_estimators'][n_estimators]['test'].append(bal_acc_test)
        all_metrics['learning_rate'][learning_rate]['train'].append(bal_acc_train)
        all_metrics['learning_rate'][learning_rate]['test'].append(bal_acc_test)

# Plot to arrive to conclusions.
plotx = sorted(all_metrics['n_estimators'].keys())
ploty = [np.average(all_metrics['n_estimators'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')
plotx = sorted(all_metrics['n_estimators'].keys())
ploty = [np.average(all_metrics['n_estimators'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')
plt.title('Average balanced accuracy for n_estimators parameter')
plt.legend()
plt.show()
plotx = sorted(all_metrics['learning_rate'].keys())
ploty = [np.average(all_metrics['learning_rate'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label = 'Train')
plotx = sorted(all_metrics['learning_rate'].keys())
ploty = [np.average(all_metrics['learning_rate'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label = 'Test')
plt.title('Average balanced accuracy for learning_rate parameter')
plt.legend()
plt.show()

# ---- After we see the graphs, it can be observed that, in average, the best parameters that optimize both train and test scores (balanced accuracy because it is the hardest to raise and it is observed that with high accuracy we have high f1 as well) are:
#     * n_estimators = 113, after which both train and test scores have a rise that does not justify the increased complexity.
#     * learning_rate = 0.2, because the model overtrains and the testing score does not increase significantly.
# As for validation:
clf = GradientBoostingClassifier(n_estimators=113, learning_rate = 0.2, random_state=RANDOM_STATE)
scores = cross_validate(clf, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
print('avg_fmeasure =', np.average(scores["test_f1_weighted"]) )# The average f-measure -- Should be ~ 0.829364.
print('avg_accuracy =', np.average(scores["test_balanced_accuracy"])) # The average accuracy -- Should be ~ 0.819656.
"""
clf3 = GradientBoostingClassifier(n_estimators=113, learning_rate = 0.2, verbose = 0, random_state=RANDOM_STATE) # Classifier #2

""" Tuning:
from matplotlib import pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)
all_metrics = {'hidden_layer_sizes': {}}
for hidden_layer_sizes in [1, 10, 50, 100, 200, 300]:
    all_metrics['hidden_layer_sizes'][hidden_layer_sizes] = {'train': [], 'test': []}
    clf = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes, random_state=RANDOM_STATE)
    clf.fit(X_train, y_train)
    bal_acc_train = balanced_accuracy_score(y_train, clf.predict(X_train))
    bal_acc_test = balanced_accuracy_score(y_test, clf.predict(X_test))
    f1_train = f1_score(y_train, clf.predict(X_train), average='weighted')
    f1_test = f1_score(y_test, clf.predict(X_test), average='weighted')
    print(dict(hidden_layer_sizes=hidden_layer_sizes), end='\n\t\t')
    print(dict(bal_acc_train=bal_acc_train, bal_acc_test=bal_acc_test), end='\n\t\t')
    print(dict(f1_train=f1_train, f1_test=f1_test))
    all_metrics['hidden_layer_sizes'][hidden_layer_sizes]['train'].append(bal_acc_train)
    all_metrics['hidden_layer_sizes'][hidden_layer_sizes]['test'].append(bal_acc_test)

# Plot to arrive to conclusions.
plotx = sorted(all_metrics['hidden_layer_sizes'].keys())
ploty = [np.average(all_metrics['hidden_layer_sizes'][i]['train']) for i in plotx]
plt.plot(plotx, ploty, 'r', label='Train')
plotx = sorted(all_metrics['hidden_layer_sizes'].keys())
ploty = [np.average(all_metrics['hidden_layer_sizes'][i]['test']) for i in plotx]
plt.plot(plotx, ploty, 'g', label='Test')
plt.title('Average balanced accuracy for hidden_layer_sizes parameter')
plt.legend()
plt.show()


# ---- After we see the graphs, it can be observed that, in average, the best parameters that optimize both train and test scores (balanced accuracy because it is the hardest to raise and it is observed that with high accuracy we have high f1 as well) is:
#     * hidden_layer_sizes = 200, after which train scores stay flat and test scores slowly drop.
# As for validation:
clf = MLPClassifier(hidden_layer_sizes = 200, random_state=RANDOM_STATE)
scores = cross_validate(clf, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
print('avg_fmeasure =', np.average(scores["test_f1_weighted"]) )# The average f-measure -- Should be ~ 0.858986.
print('avg_accuracy =', np.average(scores["test_balanced_accuracy"])) # The average accuracy -- Should be ~ 0.852655.
"""
clf4 = MLPClassifier(hidden_layer_sizes = 200, random_state=RANDOM_STATE)

"""
# Trying out different ensemble combinations to form a new ensemble.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_STATE)
hard_vcls = VotingClassifier([('bagging_logistic', clf1), ('svc', clf2), ('grad_boosting', clf3), ('mlp', clf4)], voting="hard") # Hard voting Classifier.
scls_lr = StackingClassifier(estimators=[('bagging_logistic', clf1), ('svc', clf2), ('grad_boosting', clf3), ('mlp', clf4)],
                             final_estimator=LogisticRegression(random_state=RANDOM_STATE), n_jobs=1) # Stacking Classifier.
scls_gnb = StackingClassifier(estimators=[('bagging_logistic', clf1), ('svc', clf2), ('grad_boosting', clf3), ('mlp', clf4)],
                             final_estimator=GaussianNB(), n_jobs=1) # Stacking Classifier.
scls_knn = StackingClassifier(estimators=[('bagging_logistic', clf1), ('svc', clf2), ('grad_boosting', clf3), ('mlp', clf4)],
                             final_estimator=KNeighborsClassifier(n_neighbors=5), n_jobs=1) # Stacking Classifier.
scls_dt = StackingClassifier(estimators=[('bagging_logistic', clf1), ('svc', clf2), ('grad_boosting', clf3), ('mlp', clf4)],
                             final_estimator=DecisionTreeClassifier(random_state=RANDOM_STATE), n_jobs=1) # Stacking Classifier.

# Pick the model with the maximum balanced accuracy (since usually (in the tuning tests above) it is lower than f1 and harder to raise while f1 can usually be raised to a good enough point).
# Also, results will be printed to verify the above and ensure a good model.

validate_best_accuracy = -1  # Initial (non-achievable) low max.
validate_best_cls = hard_vcls

clf_ens = hard_vcls.fit(X_train, y_train)
ba = balanced_accuracy_score(y_test, clf_ens.predict(X_test))
print({'balanced_accuracy': ba, 'f1_weighted': f1_score(y_test, clf_ens.predict(X_test), average="weighted")})
if ba > validate_best_accuracy:
    validate_best_accuracy = ba
    validate_best_cls = hard_vcls

clf_ens = scls_lr.fit(X_train, y_train)
ba = balanced_accuracy_score(y_test, clf_ens.predict(X_test))
print({'balanced_accuracy': ba, 'f1_weighted': f1_score(y_test, clf_ens.predict(X_test), average="weighted")})
if ba > validate_best_accuracy:
    validate_best_accuracy = ba
    validate_best_cls = scls_lr

clf_ens = scls_gnb.fit(X_train, y_train)
ba = balanced_accuracy_score(y_test, clf_ens.predict(X_test))
print({'balanced_accuracy': ba, 'f1_weighted': f1_score(y_test, clf_ens.predict(X_test), average="weighted")})
if ba > validate_best_accuracy:
    validate_best_accuracy = ba
    validate_best_cls = scls_gnb

clf_ens = scls_knn.fit(X_train, y_train)
ba = balanced_accuracy_score(y_test, clf_ens.predict(X_test))
print({'balanced_accuracy': ba, 'f1_weighted': f1_score(y_test, clf_ens.predict(X_test), average="weighted")})
if ba > validate_best_accuracy:
    validate_best_accuracy = ba
    validate_best_cls = scls_knn

clf_ens = scls_dt.fit(X_train, y_train)
ba = balanced_accuracy_score(y_test, clf_ens.predict(X_test))
print({'balanced_accuracy': ba, 'f1_weighted': f1_score(y_test, clf_ens.predict(X_test), average="weighted")})
if ba > validate_best_accuracy:
    validate_best_accuracy = ba
    validate_best_cls = scls_dt

best_cls = validate_best_cls
print(best_cls)

# The best classifier was hard_vcls.
"""

# Best clf is the one that had the best balanced accuracy score.
best_cls = VotingClassifier([('bagging_logistic', clf1), ('svc', clf2), ('grad_boosting', clf3), ('mlp', clf4)], voting="hard") # Hard voting Classifier.

print(best_cls)

scores = cross_validate(best_cls, X, y, cv = 10, scoring=["f1_weighted", "balanced_accuracy"], n_jobs=1, verbose=5)
best_fmeasure = np.average(scores['test_f1_weighted'])  # Should be ~ 0.8575316429138551.
best_accuracy = np.average(scores['test_balanced_accuracy'])  # Should be ~ 0.8550719461773506.

#END CODE HERE

In [15]:
print("Classifier:")
#print(best_cls)
print("F1 Weighted-Score:{} & Balanced Accuracy:{}".format(best_fmeasure, best_accuracy))

Classifier:
F1 Weighted-Score:0.8575316429138551 & Balanced Accuracy:0.8550719461773506


**3.2** Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code.

YOUR ANSWER HERE
***
### **Process**
1. Firstly, I tried (tested interactively in the console (some of these tests are recorded below) and with default hyperparameters) normal and bagged "versions" of all the classifiers (both simple and complex) mentioned in this notebook (specifically, in "1.0 Testing different ensemble methods").
2. From these tests, I chose some well performing models and tuned them (you can see the tuning process in the multiline comments above each classifier as well as the results and their explanations) to be as accurate as possible (as good base models should lead to good ensembles).
3. Then, these were combined in different ways as an ensemble (this can be seen in the cell above where the tests for different model mixing ways (hard voting, stacking with logistic regression, etc.).
4. After that, after reaching a good result (>85% but still a little lower than 1.2, having that as the baseline), I tried slightly different well-performing models as base models and/or tweaking some hyperparameters of present base models (base on the results on their individual tuning) trying to get as good final scores as possible (and of course to get something better than 1.2 results). This is because it was observed that the absolute best performing models and best tuning of individual (base) models of the ensemble didn't necessarily translate to the best ensemble.
5. Finally, after already having a good model and not reaching a better one with the tests above, I stick with the best achieved so far.
6. In conclusion the model was not too much better than the one in 1.2 but has a (slightly) better Balanced Accuracy and about the same f1 score.

We can see that a balanced accuracy > 84% was achieved as well as F1 weighted > 85%.

### **Results of attempted base / simple ensemble classifiers (scores obtained with 10-fold validation).**
* `DecisionTreeClassifier(random_state=42)`

  Classifier:

  F1 Weighted-Score:0.6863592882008598 & Balanced Accuracy:0.6774439103601141
* `LogisticRegression(random_state=42, solver='sag', tol=0.05)`

  Classifier:

  F1 Weighted-Score:0.852717589259744 & Balanced Accuracy:0.8463461294771439
* `KNeighborsClassifier()`

  Classifier:

  F1 Weighted-Score:0.8037038987819196 & Balanced Accuracy:0.7962067223144675
* `GaussianNB()`

  Classifier:

  F1 Weighted-Score:0.6880036235582245 & Balanced Accuracy:0.6684547863432078
* `LinearSVC(random_state=42, tol=0.05)`

  Classifier:

  F1 Weighted-Score:0.8132495370951286 & Balanced Accuracy:0.8072063253684082
* `MLPClassifier(random_state=42)`

  Classifier:

  F1 Weighted-Score:0.8552265578078104 & Balanced Accuracy:0.8481423086810057
* `RandomForestClassifier(n_jobs=6, random_state=42)`

  Classifier:

  F1 Weighted-Score:0.7998545059385732 & Balanced Accuracy:0.7849452081564534
* `BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), n_jobs=6, random_state=42)`

  Classifier:

  F1 Weighted-Score:0.851510234473268 & Balanced Accuracy:0.8448135971300434
* `BaggingClassifier(base_estimator=KNeighborsClassifier(), random_state=42)`

  Classifier:

  F1 Weighted-Score:0.8040109139580689 & Balanced Accuracy:0.797505718022774
* `BaggingClassifier(base_estimator=GaussianNB(), n_jobs=6, random_state=42)`

  Classifier:

  F1 Weighted-Score:0.6889119940989887 & Balanced Accuracy:0.669408426557572
* `BaggingClassifier(base_estimator=LinearSVC(random_state=42, tol=0.05), n_jobs=4, random_state=42)`

  Classifier:

  F1 Weighted-Score:0.8390275836624488 & Balanced Accuracy:0.8352138257160011
* `BaggingClassifier(base_estimator=MLPClassifier(hidden_layer_sizes=20, random_state=42), random_state=42)`

  Classifier:

  F1 Weighted-Score:0.8578569784546044 & Balanced Accuracy:0.8518975756890494
* `BaggingClassifier(base_estimator=RandomForestClassifier(n_jobs=6, random_state=42), random_state=42)`

  Classifier:

  F1 Weighted-Score:0.7968607680608969 & Balanced Accuracy:0.7796641415284126
* `GradientBoostingClassifier(random_state=42)`

  Classifier:

  F1 Weighted-Score:0.8185840971893368 & Balanced Accuracy:0.8075612858804577
* `BaggingClassifier(base_estimator=GradientBoostingClassifier(random_state=42), n_jobs=4, random_state=42)`

  Classifier:

  F1 Weighted-Score:0.8203764375551932 & Balanced Accuracy:0.8083381914829314


### **Results of attempted final (complex) classifiers (10 fold cross-validation)**
* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=42)), ('bagging_sgd', BaggingClassifier(base_estimator=SGDClassifier(random_state=42), max_features=0.9, max_samples=0.5, n_estimators=35, n_jobs=4, random_state=42)), ('bagging_mlp', BaggingClassifier(base_estimator=MLPClassifier(hidden_layer_sizes=5, random_state=42), n_estimators=3, n_jobs=1, random_state=42)), ('grad_boosting', GradientBoostingClassifier(learning_rate=0.2, n_estimators=113))], final_estimator=GaussianNB(), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.8539910263036601 & Balanced Accuracy:0.8499916515400712


* `VotingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=42)), ('svc', LinearSVC(random_state=42, tol=0.05)), ('grad_boosting', GradientBoostingClassifier(learning_rate=0.2, n_estimators=113)), ('mlp', MLPClassifier(hidden_layer_sizes=200, random_state=42))])`

  Classifier:

  F1 Weighted-Score:0.8594937452161666 & Balanced Accuracy:0.8574680661770182

* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, n_estimators=15, n_jobs=6, random_state=42)), ('mlp', MLPClassifier(hidden_layer_sizes=113, random_state=42))], final_estimator=LogisticRegression(random_state=42), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.8567770768849808 & Balanced Accuracy:0.8503984450908069
* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, n_estimators=15, n_jobs=6, random_state=42)), ('svc', LinearSVC(random_state=42, tol=0.05)), ('grad_boosting', GradientBoostingClassifier(learning_rate=0.25, n_estimators=113)), ('mlp', MLPClassifier(hidden_layer_sizes=113, random_state=42))], final_estimator=GaussianNB(), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.8567441188433959 & Balanced Accuracy:0.8528436948615201

* `StackingClassifier(estimators=[('bagging_logistic', BaggingClassifier(base_estimator=LogisticRegression(random_state=42, solver='sag', tol=0.05), max_features=0.7, max_samples=0.85, n_estimators=25, n_jobs=6, random_state=42)), ('svc', LinearSVC(random_state=42, tol=0.05)), ('random_forest', RandomForestClassifier(n_estimators=200, n_jobs=6, random_state=42))], final_estimator=GaussianNB(), n_jobs=1)`

  Classifier:

  F1 Weighted-Score:0.8492895967124776 & Balanced Accuracy:0.8444886288625206


### Results of the final classifier (10-fold cross-validation)
Model: `VotingClassifier(estimators=[('bagging_logistic',BaggingClassifier(base_estimator=LogisticRegression(random_state=42,solver='sag',tol=0.05),max_features=0.7,max_samples=0.85,n_estimators=25,n_jobs=6,random_state=42)),('svc',LinearSVC(random_state=42,tol=0.05)),('grad_boosting',GradientBoostingClassifier(learning_rate=0.2,n_estimators=113,random_state=42)),('mlp',MLPClassifier(hidden_layer_sizes=200,random_state=42))])`



Metrics (10-fold):
Classifier:
F1 Weighted-Score:0.8575316429138551 & Balanced Accuracy:0.8550719461773506

**3.3** Create a classifier that is going to be used in production - in a live system. Use the *test_set_noclass.csv* to make predictions. Store the predictions in a list.  

In [None]:
# BEGIN CODE HERE
cls = VotingClassifier(estimators=[('bagging_logistic',
                              BaggingClassifier(base_estimator=LogisticRegression(random_state=RANDOM_STATE,
                                                                                  solver='sag',
                                                                                  tol=0.05),
                                                max_features=0.7,
                                                max_samples=0.85,
                                                n_estimators=25, n_jobs=6,
                                                random_state=RANDOM_STATE)),
                             ('svc', LinearSVC(random_state=RANDOM_STATE, tol=0.05)),
                             ('grad_boosting',
                              GradientBoostingClassifier(learning_rate=0.2,
                                                         n_estimators=113,
                                                         random_state=RANDOM_STATE)),
                             ('mlp',
                              MLPClassifier(hidden_layer_sizes=200,
                                            random_state=RANDOM_STATE))])
cls.fit(X,y)
#END CODE HERE
test_set = pd.read_csv("test_set_noclass.csv")
predictions = cls.predict(test_set)

LEAVE HERE ANY COMMENTS ABOUT YOUR CLASSIFIER
***

The same problem as with 2.1 is observed: With each time the notebook is run the resulting accuracies change.

As before, I report my last findings.

#### This following cell will not be executed. The test_set.csv with the classes will be made available after the deadline and this cell is for testing purposes!!! Do not modify it! ###

In [None]:
if False:
  from sklearn.metrics import f1_score, balanced_accuracy_score
  final_test_set = pd.read_csv('test_set.csv')
  ground_truth = final_test_set['CLASS']
  print("Balanced Accuracy: {}".format(balanced_accuracy_score(predictions, ground_truth)))
  print("F1 Weighted-Score: {}".format(f1_score(predictions, ground_truth, average='weighted')))

Both should aim above 85%!