**DVAMI20h**

- Arlind Iseni
- Alexander Jamal

## Assignment 2
The aim of Assignment 2 is to experimentally compare the computational and predictive performance of three learning algorithms on a spam detection task.

**Group assignment:** Max 2 students

**Prerequisite reading:** sections 12.1 - 12.3 in the main literature

**Language:** Python (Already implemented supervised learning algorithms and standard libraries can be used. However, It is NOT permitted to use any library or API that directly computes the Friedman and Nemeyi tests.)

**Data:** Spambase Dataset, https://archive.ics.uci.edu/ml/datasets/SpambaseLinks to an external site.

**Algorithms**  
three supervised classification learning algorithms of your choice.

**Evaluation measures:** perform a comparison between the selected algorithms based on 1) computational performance in terms of training time, 2) predictive performance based on accuracy, and 3) predictive performance based on F-measure.

**Procedure**  
(repeat steps 2, 3, and 4 for each evaluation measure above)

1. Run stratified ten-fold cross-validation tests.
2. Present the results exactly as in the table in example 12.4 of the main literature.
3. Conduct the ***Friedman test*** and report the results exactly as in the table in example 12.8 of the main literature.
4. Determine whether the average ranks as a whole display significant differences on the **0.05** $\alpha$-level and, if so, use the Nemeyi test to calculate the critical difference in order to determine which algorithms perform significantly different from each other.

**Compute**  
the size of possible instances
the size of hypothesis space (the number of possible extensions)
the number of possible conjunctive concepts according to the descriptions in Section 4.1 of the main literature
Implement the algorithm and verify that it works as expected.
Compute the accuracy of the model and report the generated model, i.e., the conjunctive rule.

**Written report**  
Template: The IEEE conference template and citation style should be followed (templatesLinks to an external site. in MS word and LaTeX).
Language: English without spelling mistakes.
Style: Clear.
Content: The report should give an overview of the conducted experiments and the obtained results. It should contain (but not be limited to) information about the used classifiers, a brief description of the Friedman and Nemeyi tests along with the formulas, results of the experiment as stated above, results of the comparison stating whether the algorithms perform significantly different or not from each other for each performance measure.
Format: PDF.
Page limit: 2 pages excluding references (no abstract should be included)

**Code**   
Provide meaningful comments for different blocks of the code. 
A README.TXT file must clearly state exactly how to execute the code and any necessary setups.

**Submission**  
Make sure to include your names in the report and the code.
The report must be submitted as a PDF separately (not to be included in the ZIP file).
Code and additional files related to implementation must be archived using ZIP.

### Import modules and dataset

In [275]:
import pandas as pd
import numpy as np
from time import time
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import f1_score, accuracy_score

### Configure settings

In [276]:
%matplotlib widget
%matplotlib inline
plt.rcParams['figure.figsize'] = (18, 12)
plt.rcParams['figure.constrained_layout.use'] = True

### Load and read dataframe

In [277]:
# columns are saved in the data/names.txt file. all entries without the newline character in a list.
with open("data/names.txt", "r") as f:
    columns = f.read().splitlines()

In [278]:
df = pd.read_csv("data/spambase.data", names=columns)

In [279]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_orders,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### Data exploration

In [280]:
# percentage of spam and non spam in data set
df["is_spam"].value_counts(normalize=True)

0    0.605955
1    0.394045
Name: is_spam, dtype: float64

### Data cleaning

In [281]:
# if any missing values
df.isna().sum().any()

False

In [282]:
# if there exists duplicates
df.duplicated().sum()

391

In [283]:
# if there exists any values below zero (errors)
(df < 0).all().any()

False

### Data transformation

In [284]:
def fitter(X_: pd.DataFrame, params: dict):
    """
    Description:
        fits an algorithm to data.
    
    Args:
        X_: pandas dataframe (unlabeled data)
        params: parameters used to initialize transformer function
    
    Returns:
        transformation function object
    """
    return KBinsDiscretizer(**params).fit(X_)

In [285]:
def discretizer(X_: pd.DataFrame, fitter_: KBinsDiscretizer) -> pd.DataFrame:
    """
    Description:
        transforms dataframe.
    
    Args:
        X_: pandas data frame (unlabeled data)
        fitter_: the fit we use to transform
    
    Returns:
        pandas dataframe
    """
    return pd.DataFrame(fitter_.transform(X_))

### Data spliting

In [286]:
# split data into target (label, Y) and non-target (X)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [287]:
def stratified(X_: pd.DataFrame, y_: pd.Series, clf, metric, params=None) -> list[float]:
    """
    Description:
        loops through kfold stratified training / test data splits and trains model.
    
    Args:
        X_: pandas data frame (unlabeled data)
        y_: pandas series (label data)
        clf: machine learning classifier
        metric: 
        params: dictionary with arguments for the transformer
    
    Returns:
        list of values for a given metric
    """
    skf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
    metric_lst = list()
    for train_index, test_index in skf.split(X_, y_):
        params = {} if params == None else params
        
        fit = None if params == {} else fitter(X_.iloc[train_index], params)
        X_train = X_.iloc[train_index] if fit == None else discretizer(X_.iloc[train_index], fit)
        X_test = X_.iloc[test_index] if fit == None else discretizer(X_.iloc[test_index], fit)
        y_train, y_test = y_[train_index], y_[test_index]
        
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        
        metric_lst.append(metric(y_test, y_pred))
        
    return metric_lst

### Experiment

https://medium.com/mlearning-ai/comparing-classifiers-friedman-and-nemenyi-tests-32294103ee12

In [288]:
# Instantiation
svm_clf = SVC()
ada_clf = AdaBoostClassifier()
rf_clf = RandomForestClassifier()

In [335]:
def friedman_table(metric) -> pd.DataFrame:
    """
    Description:
        
    """
    # create an empty data frame with column and indicies set
    frame = pd.DataFrame(columns=["Support Vector Machine", "AdaBoost", "Random Forest"], index=[i for i in range(1, 11)]) 
    
    # fill each colummn with their respective algorithms f-score for every fold.
    frame.loc[:, "Support Vector Machine"] = stratified(X, y, svm_clf, metric, params=dict(n_bins=7, encode="ordinal", strategy="kmeans"))
    frame.loc[:, "AdaBoost"] = stratified(X, y, ada_clf, metric)
    frame.loc[:, "Random Forest"] = stratified(X, y, rf_clf, metric)

    # calculate average and std for given metric
    frame.loc["Average", :] = frame.mean()
    frame.loc["Std", :] = frame.std()
    
    # create a ranking table
    ranks = frame.rank(axis=1, method="max", ascending=False).astype(np.int8)
    # get averages for each algorithm
    avg = ranks.loc[~ranks.index.isin(["Average", "Std"])].mean() # ranks.iloc[:-2, :].mean()
    friedman_table = frame.apply(lambda x: x.astype(str).str.cat("(" + ranks[x.name].astype(str) + ")", sep=" "))
    friedman_table.loc["Average rank"] = avg

    return friedman_table

In [336]:
accuracy = friedman_table(accuracy_score)

In [337]:
accuracy

Unnamed: 0,Support Vector Machine,AdaBoost,Random Forest
1,0.9240780911062907 (3),0.9392624728850325 (2),0.9457700650759219 (1)
2,0.9326086956521739 (3),0.95 (2),0.95 (2)
3,0.9282608695652174 (3),0.9369565217391305 (2),0.9369565217391305 (2)
4,0.9434782608695652 (2),0.9369565217391305 (3),0.9521739130434783 (1)
5,0.9434782608695652 (3),0.9565217391304348 (2),0.9586956521739131 (1)
6,0.9456521739130435 (3),0.9478260869565217 (2),0.95 (1)
7,0.95 (2),0.9456521739130435 (3),0.9717391304347827 (1)
8,0.9456521739130435 (3),0.9652173913043478 (2),0.9760869565217392 (1)
9,0.8934782608695652 (1),0.8456521739130435 (3),0.8913043478260869 (2)
10,0.8760869565217392 (1),0.8565217391304348 (3),0.8608695652173913 (2)


In [340]:
fmeasure = friedman_table(f1_score)

In [341]:
fmeasure

Unnamed: 0,Support Vector Machine,AdaBoost,Random Forest
1,0.9014084507042254 (3),0.9199999999999999 (2),0.9261363636363636 (1)
2,0.9131652661064426 (3),0.9362880886426593 (1),0.9359331476323121 (2)
3,0.9065155807365438 (3),0.9169054441260746 (1),0.9069767441860465 (2)
4,0.9257142857142856 (2),0.9196675900277009 (3),0.934844192634561 (1)
5,0.9265536723163841 (3),0.943502824858757 (2),0.9438202247191012 (1)
6,0.9322493224932249 (3),0.9361702127659575 (2),0.947945205479452 (1)
7,0.9329446064139941 (2),0.9283667621776505 (3),0.9548022598870056 (1)
8,0.9283667621776505 (3),0.956043956043956 (2),0.9664804469273743 (1)
9,0.8650137741046833 (2),0.8202531645569621 (3),0.8727272727272726 (1)
10,0.835734870317003 (1),0.8156424581005587 (3),0.8199445983379502 (2)


### Nemenyi test

In order to conduct the ***Nemenyi*** test on the 3 algorithms we have, we will be calculating the mean value of the rankings of each algorithm which has been done in *f1*.

We continue by calculating the critical distance (CD) using the following formula: CD = q\$_\alpha \times \sqrt{\frac{k \times (k+1)}{6n}}$

where $q_\alpha$ depends on the significance level $\alpha$ as well as **k**: for $\alpha$ = 0.05 and **k** = 3 it is 2.343, **n** is the amount of measurements taken which in our case is **n** = 10, which leads to a **CD** of 1.047. Since our average ranks for Support Vector Machine, AdaBoost and Random Forest are *2.5, 2.2 and 1.3* respectively, we can see that only the difference between Support Vector Machine and Random Forest is larger than CD, letting us conclude that Random Forest is better than Support Vector Machine.

In [342]:
def nemenyi(k: int, N: int) -> float:
    """
    Description:
        calculates the critial distance of the nemenyi test.
        
    Args:
        k: number of models to evaluate
        N: number of rows
    
    Returns:
        critical distance
    """
    if k <= 1:
        raise ValueError("k needs to be larger than 1")
    q_alpha = [
        1.960, 2.343, 2.569, 
        2.728, 2.850, 2.949, 
        3.031, 3.102, 3.164]
    
    return q_alpha[k] * np.sqrt((k*(k+1)) / (6*N))

In [343]:
nemenyi(3, 10)

1.2199986885238854