# Model Definition and Evaluation
## Table of Contents
1. [Model Selection](#model-selection)
2. [Feature Engineering](#feature-engineering)
3. [Hyperparameter Tuning](#hyperparameter-tuning)
4. [Implementation](#implementation)
5. [Evaluation Metrics](#evaluation-metrics)
6. [Comparative Analysis](#comparative-analysis)


In [1]:
from scipy.io.arff import loadarff
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.fft import rfft
from sklearn.preprocessing import LabelEncoder

# ML

## Models
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

## Cross Validation
import time
from sklearn.model_selection import RepeatedStratifiedKFold 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline

## evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

## Model Selection

For this task different Classical Machine Learning Classifiers are considered. The following are:
- Gaussian Naive Bayes
- Support Vector Classifier
- Random Forest Classifier
- K-Neighbors Classifier
- Decision Tree Classifier

These are the most typical Classifiers, which are usually used in Machine Learning Classifier Tasks. Also they all are implemented in the module sklearn, which makes them easily compatible. 

## Feature Engineering

[Describe any additional feature engineering you've performed beyond what was done for the baseline model.]


In [2]:
# Load the dataset
data_train = loadarff('../1_DatasetCharacteristics/data/InsectSound_TRAIN.arff')
data_test = loadarff('../1_DatasetCharacteristics/data/InsectSound_TEST.arff')

df_train = pd.DataFrame(data_train[0])
df_test = pd.DataFrame(data_test[0])


# Fast Fourier Transform
X_train = pd.DataFrame(abs(rfft(df_train.drop(columns = "target"))))
X_test = pd.DataFrame(abs(rfft(df_test.drop(columns = "target"))))

# Creating a instance of label Encoder.
le = LabelEncoder()
# Using .fit_transform function to fit label encoder and return encoded label
y_train = le.fit_transform(df_train['target'])
y_test = le.fit_transform(df_test['target'])

# Creating a dataset with less frequencies (first half) to compare results
X_train_smaller = X_train.iloc[:, :150]
X_test_smaller = X_test.iloc[:, :150]

## Hyperparameter Tuning

In this notebook a nested cross validation and grid search is used. Also to keep it easily comparible there is a global configuration of the number of splits and repeats at the top.

In [3]:
# Global Configuration useful: (no magic numbers)
num_trials = 3
num_inner_repeats = 3
num_inner_splits = 3
num_outer_splits = 3


def nested_cv(estimator, grid, features, targets):

    start = time.time()
    accs = np.zeros((num_trials,num_outer_splits))
    baccs = np.zeros((num_trials,num_outer_splits)) # balanced accuracy
    fit_times = np.zeros((num_trials,num_outer_splits))
    test_times = np.zeros((num_trials,num_outer_splits))

    for i in range(num_trials):
        print("Running Outer CV in iteration ", i , " at time ", time.time()-start)

        # best parametrisation
        inner_cv = RepeatedStratifiedKFold(n_splits=num_inner_splits, n_repeats=num_trials,random_state=1)
        
        # creating results
        outer_cv = StratifiedKFold(n_splits=num_outer_splits, shuffle = True ,random_state=i)

        # try all combinations of grid, returns hyperparameters with highest score 
        clf = GridSearchCV(estimator= Pipeline([("estimator",estimator)]), param_grid=grid, cv = inner_cv, scoring= ("balanced_accuracy"), n_jobs=8, refit = "balanced_accuracy")
        # already optimized, just try with best parameters, returns dictionary
        cv_results = cross_validate(clf, X = features, y = targets, cv = outer_cv, scoring=("accuracy", "balanced_accuracy"), n_jobs=8)

        accs[i] = cv_results["test_accuracy"]
        baccs[i] = cv_results["test_balanced_accuracy"]
        fit_times[i] = cv_results["fit_time"]
        test_times[i] = cv_results["score_time"]
    
    print("Total time : ", time.time()-start, " sec")
    return accs, baccs, fit_times, test_times

def add_results(results, name, accs, baccs, fit_times, test_times):
    row = {"name" : name,
        "accs_mean" : np.mean(accs),
        "accs_std" : np.std(accs),
        "accs_min" : np.min(accs),
        "accs_max" : np.max(accs),
        "baccs_mean" : np.mean(baccs),
        "baccs_std" : np.std(baccs),
        "baccs_min" : np.min(baccs),
        "baccs_max" : np.max(baccs),
        "fit_time" : fit_times.mean(),
        "test_time" : test_times.mean()
    }
    return pd.concat([results, pd.DataFrame(row, index = [0])], ignore_index = True)

def train_prod_model(X_train, y_train, estimator, grid):
    pipe = Pipeline([("estimator",estimator)])
    cv = RepeatedStratifiedKFold(n_splits=num_inner_splits, n_repeats=num_inner_repeats, random_state=1)
    clf = GridSearchCV(estimator = pipe, param_grid=grid, cv = cv)
    clf.fit(X_train, y_train)
    return clf

def plot_results(y_test, y_pred, clf) : 
    cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                display_labels=clf.classes_)
    disp.plot();
    
    return accuracy_score(y_test, y_pred)

results = pd.DataFrame()

## Implementation

[Implement the final model(s) you've selected based on the above steps.]


### Gaussian Naive Bayes

In [4]:
gnb_grid = {"estimator__var_smoothing" : [.05, .1, .3, .5, 1]}

#### Full Dataset

In [5]:
gnb = GaussianNB()
accs, baccs, fit_times, test_times = nested_cv(gnb,gnb_grid, X_train, y_train)
results = add_results(results=results, name = "gnb", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)
clf_gnb = train_prod_model(X_train, y_train, gnb, gnb_grid)

print(clf_gnb.best_params_)

accuracy_score(y_test, clf_gnb.predict(X_test))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  16.274003982543945
Running Outer CV in iteration  2  at time  31.15364980697632
Total time :  45.508363008499146  sec
{'estimator__var_smoothing': 0.1}


0.5992

#### Smaller Dataset

In [6]:
gnb_sm = GaussianNB()

accs, baccs, fit_times, test_times = nested_cv(gnb_sm,gnb_grid, X_train_smaller, y_train)
results = add_results(results=results, name = "gnb_sm", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_gnb_sm = train_prod_model(X_train_smaller, y_train, gnb_sm, gnb_grid)

print(clf_gnb_sm.best_params_)


accuracy_score(y_test, clf_gnb_sm.predict(X_test_smaller))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  7.254518270492554
Running Outer CV in iteration  2  at time  14.60692834854126
Total time :  21.830267906188965  sec
{'estimator__var_smoothing': 0.05}


0.5992

### Support Vector Classifier

In [7]:
svc_grid = {"estimator__kernel" : ['linear','rbf'], "estimator__C": [0.1, 1]}

#### Full Dataset

In [8]:
svc = SVC()

accs, baccs, fit_times, test_times = nested_cv(svc,svc_grid, X_train, y_train)
results = add_results(results=results, name = "svc", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf = train_prod_model(X_train, y_train, svc, svc_grid)

print(clf.best_params_)

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  1153.7479915618896
Running Outer CV in iteration  2  at time  2261.4552471637726
Total time :  3371.8896701335907  sec
{'estimator__C': 1, 'estimator__kernel': 'rbf'}


#### Smaller Dataset

In [9]:
svc = SVC()

accs, baccs, fit_times, test_times = nested_cv(svc,svc_grid, X_train_smaller, y_train)
results = add_results(results=results, name = "svc_sm", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_svc_sm = train_prod_model(X_train_smaller, y_train, svc, svc_grid)

print(clf_svc_sm.best_params_)
accuracy_score(y_test, clf_svc_sm.predict(X_test_smaller))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  824.3327133655548
Running Outer CV in iteration  2  at time  2268.736865758896
Total time :  3079.410781145096  sec
{'estimator__C': 1, 'estimator__kernel': 'rbf'}


0.728

### Random Forest Classifier

In [10]:
random_forest_grid = {"estimator__n_estimators" : [10, 100, 150, 300]}

#### Full Dataset

In [11]:
# Random Forest Complete DataSet
rfc = RandomForestClassifier(random_state=1)

accs, baccs, fit_times, test_times = nested_cv(rfc,random_forest_grid, X_train, y_train)
results = add_results(results=results, name = "random_forest", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_rfc = train_prod_model(X_train, y_train, rfc, random_forest_grid)

print(clf_rfc.best_params_)
accuracy_score(y_test, clf_rfc.predict(X_test))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  603.921338558197
Running Outer CV in iteration  2  at time  1208.6200604438782
Total time :  1817.7600269317627  sec
{'estimator__n_estimators': 300}


0.7356

#### Smaller Dataset

In [12]:
rfc = RandomForestClassifier(random_state=1)

accs, baccs, fit_times, test_times = nested_cv(rfc,random_forest_grid, X_train_smaller, y_train)
results = add_results(results=results, name = "random_forest_sm", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_rfc_sm = train_prod_model(X_train_smaller, y_train, rfc, random_forest_grid)

print(clf_rfc_sm.best_params_)
accuracy_score(y_test, clf_rfc_sm.predict(X_test_smaller))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  432.44741201400757
Running Outer CV in iteration  2  at time  868.3287403583527
Total time :  1303.5031235218048  sec
{'estimator__n_estimators': 300}


0.73528

### K-Neighbors Classifier

In [13]:
knn_grid = {"estimator__weights" : ("uniform","distance"), "estimator__n_neighbors": range(1,3)}

#### Full Dataset

In [14]:
knn = KNeighborsClassifier()
accs, baccs, fit_times, test_times = nested_cv(knn,knn_grid, X_train.values, y_train)
results = add_results(results=results, name = "kNN", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_knn = train_prod_model(X_train.values, y_train, knn, knn_grid)

print(clf_knn.best_params_)
accuracy_score(y_test, clf_knn.predict(X_test))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  37.25303101539612
Running Outer CV in iteration  2  at time  74.22644710540771
Total time :  110.39417338371277  sec
{'estimator__n_neighbors': 1, 'estimator__weights': 'uniform'}


0.63788

#### Smaller Dataset

In [15]:
knn_sm = KNeighborsClassifier()
accs, baccs, fit_times, test_times = nested_cv(knn_sm,knn_grid, X_train_smaller, y_train)
results = add_results(results=results, name = "kNN_sm", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_knn_sm = train_prod_model(X_train_smaller, y_train, knn_sm, knn_grid)

print(clf_knn_sm.best_params_)
accuracy_score(y_test, clf_knn_sm.predict(X_test_smaller))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  22.86045265197754
Running Outer CV in iteration  2  at time  45.442070960998535
Total time :  67.5814061164856  sec
{'estimator__n_neighbors': 1, 'estimator__weights': 'uniform'}


0.63104

### Decision Tree Classifier

In [16]:
tree_grid = {"estimator__criterion": ("gini", "entropy"), "estimator__max_depth" : [2**i for i in range(0,7)]}

#### Full Dataset

In [17]:
tree = DecisionTreeClassifier(random_state=1)

accs, baccs, fit_times, test_times = nested_cv(tree,tree_grid, X_train, y_train)
results = add_results(results=results, name = "decision_tree", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_tree = train_prod_model(X_train, y_train, tree, tree_grid)

print(clf_tree.best_params_)
accuracy_score(y_test, clf_tree.predict(X_test))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  447.24935030937195
Running Outer CV in iteration  2  at time  893.8723459243774
Total time :  1340.5227127075195  sec
{'estimator__criterion': 'entropy', 'estimator__max_depth': 8}


0.60628

#### Smaller Dataset

In [18]:
tree = DecisionTreeClassifier(random_state=1)

accs, baccs, fit_times, test_times = nested_cv(tree,tree_grid, X_train_smaller, y_train)
results = add_results(results=results, name = "decision_tree_sm", accs = accs, baccs = baccs, fit_times = fit_times, test_times= test_times)

clf_tree_sm = train_prod_model(X_train_smaller, y_train, tree, tree_grid)

print(clf_tree_sm.best_params_)
accuracy_score(y_test, clf_tree_sm.predict(X_test_smaller))

Running Outer CV in iteration  0  at time  0.0
Running Outer CV in iteration  1  at time  221.86227536201477
Running Outer CV in iteration  2  at time  443.13639783859253
Total time :  665.6937670707703  sec
{'estimator__criterion': 'entropy', 'estimator__max_depth': 8}


0.60864

## Evaluation Metrics and Comparative Analysis

In this section the calculated accuracy scores for the different classifiers are evaluated and compared to each other as well as to the baseline model.

In [21]:
df = results.copy()
df.drop(columns=df.columns[df.columns.str.contains("baccs")], inplace=True)
df.rename({"gnb": "Gaussian Naive Bayes","gnb_sm": "Gaussian Naive Bayes small", "svc": "Support Vector Classifier", "svc_sm": "Support Vector Classifier small","random_forest": "Random Forest", "random_forest_sm": "Random Forest small" , "kNN": "k-Nearest-Neighbors", "kNN_sm": "k-Nearest-Neighbors small", "kNN_argmax": "k-Nearest-Neighbors (max frequency)", "decision_tree": "Decision Tree", "decision_tree_sm": "Decision Tree small"}, inplace=True)
df.sort_values("accs_mean")

Unnamed: 0,name,accs_mean,accs_std,accs_min,accs_max,fit_time,test_time
8,decision_tree,0.596813,0.008794,0.586153,0.610944,445.767351,0.011173
9,decision_tree_sm,0.599493,0.006254,0.586753,0.607871,220.333219,0.007535
0,gnb,0.601413,0.001692,0.598872,0.603864,12.527852,0.502562
1,gnb_sm,0.601453,0.003275,0.595392,0.607824,6.748821,0.199094
7,kNN_sm,0.623613,0.004456,0.614545,0.629905,19.800821,1.718615
6,kNN,0.628587,0.004327,0.620905,0.634825,32.170527,2.719509
3,svc_sm,0.720613,0.002929,0.715829,0.724829,794.224772,105.264107
2,svc,0.726213,0.002954,0.721502,0.730589,1084.920991,35.096847
5,random_forest_sm,0.729293,0.004982,0.723149,0.73839,432.022286,0.510126
4,random_forest,0.731787,0.00407,0.725342,0.73971,603.630489,0.544061


The Random Forest Classifier has the best accuracy score with a value of 0.73. It has a very high fit time but once it is fitted, the test time is very fast in comparison to the others. All Classifier are superior to the baseline, which has an accuracy of 0.1. 

### Setbacks and Further Improvements

It takes a lot of time to test all combinations of different hyperparameters, therefore not a lot of parameters are tested. It is likely that the results would become even better if more hyperparameters would be tuned with a finer grid. 

Also all of the models are classical machine learning classifiers. A Neural Network is likely to also lead to better results. 