**DVAMI20h**

- Arlind Iseni
- Alexander Jamal

## Assignment 2
The aim of Assignment 2 is to experimentally compare the computational and predictive performance of three learning algorithms on a spam detection task.

**Group assignment:** Max 2 students

**Prerequisite reading:** sections 12.1 - 12.3 in the main literature

**Language:** Python (Already implemented supervised learning algorithms and standard libraries can be used. However, It is NOT permitted to use any library or API that directly computes the Friedman and Nemeyi tests.)

**Data:** Spambase Dataset, https://archive.ics.uci.edu/ml/datasets/SpambaseLinks to an external site.

**Algorithms**  
three supervised classification learning algorithms of your choice.

**Evaluation measures:** perform a comparison between the selected algorithms based on 1) computational performance in terms of training time, 2) predictive performance based on accuracy, and 3) predictive performance based on F-measure.

**Procedure**  
(repeat steps 2, 3, and 4 for each evaluation measure above)

1. Run stratified ten-fold cross-validation tests.
2. Present the results exactly as in the table in example 12.4 of the main literature.
3. Conduct the ***Friedman test*** and report the results exactly as in the table in example 12.8 of the main literature.
4. Determine whether the average ranks as a whole display significant differences on the **0.05** $\alpha$-level and, if so, use the Nemeyi test to calculate the critical difference in order to determine which algorithms perform significantly different from each other.

**Compute**  
the size of possible instances
the size of hypothesis space (the number of possible extensions)
the number of possible conjunctive concepts according to the descriptions in Section 4.1 of the main literature
Implement the algorithm and verify that it works as expected.
Compute the accuracy of the model and report the generated model, i.e., the conjunctive rule.

**Written report**  
Template: The IEEE conference template and citation style should be followed (templatesLinks to an external site. in MS word and LaTeX).
Language: English without spelling mistakes.
Style: Clear.
Content: The report should give an overview of the conducted experiments and the obtained results. It should contain (but not be limited to) information about the used classifiers, a brief description of the Friedman and Nemeyi tests along with the formulas, results of the experiment as stated above, results of the comparison stating whether the algorithms perform significantly different or not from each other for each performance measure.
Format: PDF.
Page limit: 2 pages excluding references (no abstract should be included)

**Code**   
Provide meaningful comments for different blocks of the code. 
A README.TXT file must clearly state exactly how to execute the code and any necessary setups.

**Submission**  
Make sure to include your names in the report and the code.
The report must be submitted as a PDF separately (not to be included in the ZIP file).
Code and additional files related to implementation must be archived using ZIP.

### Import modules and dataset

In [25]:
import pandas as pd
from time import time
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import f1_score, accuracy_score

### Load and read dataframe

In [26]:
# columns are saved in the data/names.txt file. all entries without the newline character in a list.
with open("data/names.txt", "r") as f:
    columns = f.read().splitlines()

In [27]:
df = pd.read_csv("data/spambase.data", names=columns)

In [28]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_orders,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### Data exploration

In [29]:
# percentage of spam and non spam in data set
df["is_spam"].value_counts(normalize=True)

0    0.605955
1    0.394045
Name: is_spam, dtype: float64

### Data cleaning

In [30]:
# if any missing values
df.isna().sum().any()

False

In [32]:
# if there exists duplicates
df.duplicated().sum()

391

In [31]:
# if there exists any values below zero (errors)
(df < 0).all().any()

False

### Data transformation

In [33]:
def fit(X_: pd.DataFrame, params: dict) -> KBinsDiscretizer:
    """
    Description:
        Fits an algorithm to data.
    
    Args:
        X_: pandas dataframe (unlabeled data).
        params: parameters used to initialize transformer function.
    
    Returns:
        Returns transformation function object.
    """
    return KBinsDiscretizer(**params).fit(X_)

In [34]:
def discretizer(X_: pd.DataFrame, fitter_: KBinsDiscretizer) -> pd.DataFrame:
    """
    Description:
        Transforms dataframe.
    
    Args:
        X_: pandas data frame (unlabeled data).
        fitter_: KBinsDiscretizer object we use to transform our data.
    
    Returns:
        Returns pandas dataframe.
    """
    return pd.DataFrame(fitter_.transform(X_))

### Data spliting

In [35]:
# split data into target (label, Y) and non-target (X)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [37]:
def train_test(X_: pd.DataFrame, y_: pd.Series, clf, metric, params=None) -> list[float]:
    """
    Description:
        Loops through k folds (balanced) and trains and tests classifier.
    
    Args:
        X_: pandas data frame (unlabeled data).
        y_: pandas series (label data).
        clf: machine learning classifier.
        metric: performance function.
        params: dictionary with arguments for the transformer.
    
    Returns:
        List of values for a given metric.
    """
    skf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
    metric_lst = []
    
    for train_index, test_index in skf.split(X_, y_):
        params = {} if params == None else params
        
        fitter = None if params == {} else fit(X_.iloc[train_index], params)
        X_train = X_.iloc[train_index] if fitter == None else discretizer(X_.iloc[train_index], fitter)
        X_test = X_.iloc[test_index] if fitter == None else discretizer(X_.iloc[test_index], fitter)
        y_train, y_test = y_[train_index], y_[test_index]
        
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        
        metric_lst.append(metric(y_test, y_pred))
        
    return metric_lst

### Experiment

https://medium.com/mlearning-ai/comparing-classifiers-friedman-and-nemenyi-tests-32294103ee12

In [38]:
# Instantiation
svm = SVC()
ada = AdaBoostClassifier()
rf = RandomForestClassifier()

In [14]:
# import custom module for friedman table
from friedman_table import Friedman

In [15]:
# number of samples
blocks = 10

In [16]:
# treatment group with accuracy
treatments_accuracy = {
    "Support Vector Machine": train_test(X, y, svm, accuracy_score, params=dict(n_bins=7, encode="ordinal", strategy="kmeans")), 
    "AdaBoost": train_test(X, y, ada, accuracy_score),
    "Random Forest": train_test(X, y, rf, accuracy_score)}

In [17]:
accuracy = Friedman(blocks, treatments_accuracy)

In [18]:
# treatment group with f1-score
treatments_f1 = {
    "Support Vector Machine": train_test(X, y, svm, f1_score, params=dict(n_bins=7, encode="ordinal", strategy="kmeans")), 
    "AdaBoost": train_test(X, y, ada, f1_score),
    "Random Forest": train_test(X, y, rf, f1_score)}                          

In [19]:
f1 = Friedman(blocks, treatments_f1)

In [20]:
accuracy.table()

Unnamed: 0,Support Vector Machine,AdaBoost,Random Forest
1,0.9240780911062907 (3.0),0.9392624728850325 (2.0),0.9414316702819957 (1.0)
2,0.9326086956521739 (3.0),0.95 (1.0),0.9456521739130435 (2.0)
3,0.9282608695652174 (3.0),0.9369565217391305 (1.0),0.9326086956521739 (2.0)
4,0.9434782608695652 (2.0),0.9369565217391305 (3.0),0.9478260869565217 (1.0)
5,0.9434782608695652 (3.0),0.9565217391304348 (2.0),0.9608695652173913 (1.0)
6,0.9456521739130435 (3.0),0.9478260869565217 (2.0),0.9543478260869566 (1.0)
7,0.95 (2.0),0.9456521739130435 (3.0),0.9695652173913043 (1.0)
8,0.9456521739130435 (3.0),0.9652173913043478 (2.0),0.9717391304347827 (1.0)
9,0.8934782608695652 (2.0),0.8456521739130435 (3.0),0.8934782608695652 (2.0)
10,0.8760869565217392 (1.0),0.8565217391304348 (3.0),0.8565217391304348 (3.0)


In [21]:
f1.table()

Unnamed: 0,Support Vector Machine,AdaBoost,Random Forest
1,0.9014084507042254 (3.0),0.9199999999999999 (2.0),0.9408450704225353 (1.0)
2,0.9131652661064426 (3.0),0.9362880886426593 (1.0),0.9329608938547486 (2.0)
3,0.9065155807365438 (3.0),0.9169054441260746 (1.0),0.9132947976878613 (2.0)
4,0.9257142857142856 (2.0),0.9196675900277009 (3.0),0.934844192634561 (1.0)
5,0.9265536723163841 (3.0),0.943502824858757 (2.0),0.952112676056338 (1.0)
6,0.9322493224932249 (3.0),0.9361702127659575 (2.0),0.9450549450549451 (1.0)
7,0.9329446064139941 (2.0),0.9283667621776505 (3.0),0.9458689458689459 (1.0)
8,0.9283667621776505 (3.0),0.956043956043956 (2.0),0.9664804469273743 (1.0)
9,0.8650137741046833 (2.0),0.8202531645569621 (3.0),0.8720626631853786 (1.0)
10,0.835734870317003 (1.0),0.8156424581005587 (3.0),0.8232044198895029 (2.0)


### Nemenyi test

In order to conduct the ***Nemenyi*** test on the 3 algorithms we have, we will be calculating the mean value of the rankings of each algorithm which has been done in *f1*.

We continue by calculating the critical distance (CD) using the following formula: CD = q\$_\alpha \times \sqrt{\frac{k \times (k+1)}{6n}}$

where $q_\alpha$ depends on the significance level $\alpha$ as well as **k**: for $\alpha$ = 0.05 and **k** = 3 it is X, **n** is the amount of measurements taken which in our case is **n** = 10, which leads to a **CD** of X. Since our average ranks for Support Vector Machine, AdaBoost and Random Forest are *X1, X2 and X3* respectively, ...

In [22]:
accuracy.nemenyi()

1.2199986885238854

In [23]:
f1.nemenyi()

1.2199986885238854

Support Vector machine and Ada Boost both pass the critical distance-threshold and therefore our null hypothesis is rejected (not any significant differences between treatments)