**DVAMI20h**

- Arlind Iseni
- Alexander Jamal

## Assignment 2
The aim of Assignment 2 is to experimentally compare the computational and predictive performance of three learning algorithms on a spam detection task.

**Group assignment:** Max 2 students

**Prerequisite reading:** sections 12.1 - 12.3 in the main literature

**Language:** Python (Already implemented supervised learning algorithms and standard libraries can be used. However, It is NOT permitted to use any library or API that directly computes the Friedman and Nemeyi tests.)

**Data:** Spambase Dataset, https://archive.ics.uci.edu/ml/datasets/SpambaseLinks to an external site.

**Algorithms**  
three supervised classification learning algorithms of your choice.

**Evaluation measures:** perform a comparison between the selected algorithms based on 1) computational performance in terms of training time, 2) predictive performance based on accuracy, and 3) predictive performance based on F-measure.

**Procedure**  
(repeat steps 2, 3, and 4 for each evaluation measure above)

1. Run stratified ten-fold cross-validation tests.
2. Present the results exactly as in the table in example 12.4 of the main literature.
3. Conduct the Friedman test and report the results exactly as in the table in example 12.8 of the main literature.
4. Determine whether the average ranks as a whole display significant differences on the 0.05 alpha level and, if so, use the Nemeyi test to calculate the critical difference in order to determine which algorithms perform significantly different from each other.

**Compute**  
the size of possible instances
the size of hypothesis space (the number of possible extensions)
the number of possible conjunctive concepts according to the descriptions in Section 4.1 of the main literature
Implement the algorithm and verify that it works as expected.
Compute the accuracy of the model and report the generated model, i.e., the conjunctive rule.

**Written report**  
Template: The IEEE conference template and citation style should be followed (templatesLinks to an external site. in MS word and LaTeX).
Language: English without spelling mistakes.
Style: Clear.
Content: The report should give an overview of the conducted experiments and the obtained results. It should contain (but not be limited to) information about the used classifiers, a brief description of the Friedman and Nemeyi tests along with the formulas, results of the experiment as stated above, results of the comparison stating whether the algorithms perform significantly different or not from each other for each performance measure.
Format: PDF.
Page limit: 2 pages excluding references (no abstract should be included)

**Code**   
Provide meaningful comments for different blocks of the code. 
A README.TXT file must clearly state exactly how to execute the code and any necessary setups.

**Submission**  
Make sure to include your names in the report and the code.
The report must be submitted as a PDF separately (not to be included in the ZIP file).
Code and additional files related to implementation must be archived using ZIP.

### Import modules and dataset

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.preprocessing import KBinsDiscretizer, Normalizer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.svm import SVC

### Configure settings

In [4]:
%matplotlib widget
%matplotlib inline
plt.rcParams['figure.figsize'] = (18, 12)
plt.rcParams['figure.constrained_layout.use'] = True

### Load and read dataframe

In [5]:
# columns are saved in the data/names.txt file. Here we all entries without the newline character in a list.
with open("data/names.txt", "r") as f:
    columns = f.read().splitlines()

In [6]:
df = pd.read_csv("data/spambase.data", names=columns)

In [7]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_orders,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### Data exploration

In [8]:
df["is_spam"].value_counts(normalize=True)

0    0.605955
1    0.394045
Name: is_spam, dtype: float64

### Data cleaning

#### null-values

In [9]:
df.isna().sum().any()

False

#### duplicates

In [10]:
df.duplicated().sum()

391

#### negative values

In [11]:
(df < 0).all().sum()

0

### Split data

In [13]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [14]:
skf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)

### Data transformation

In [119]:
def fitter(X_: pd.DataFrame, transformer_, params: dict):
    """
    Fits an algorithm to data.
    
    Args:
        X_: pandas dataframe (unlabeled data)
        transformer_: any algorithm used for data transformation
        params: parameters used to initialize transformer function
    
    Returns:
        transformation function object
    """
    return transformer_(**params).fit(X_)

In [16]:
def transformer(X_: pd.DataFrame, fitter_: fitter) -> pd.DataFrame:
    """
    Transforms dataframe.
    
    Args:
        X_: pandas data frame (unlabeled data)
        fitter_: the fit we use to transform
    
    Returns:
        pandas dataframe
    """
    return pd.DataFrame(fitter_.transform(X_))

### Stratified K-fold

In [64]:
def stratified(X_: pd.DataFrame, y_: pd.Series, clf, transformer_=None, params=None) -> pd.DataFrame:
    """
    loops through kfold stratified training / test data splits and trains model.
    
    Args:
        X_: pandas data frame (unlabeled data)
        y_: pandas series (label data)
        clf: classifier algorithm
        transformer_: what algorithm for transformation (optional)
        params: dictionary with arguments for the transformer
    
    Returns:
        None
    """
    acc = []
    f1 = []
    prec = []
    recall = []
    for train_index, test_index in skf.split(X, y):
        params = {} if params == None else params
        fit = None if transformer_ == None else fitter(X.iloc[train_index], transformer_, params)
        X_train = X_.iloc[train_index] if fit == None else transformer(X_.iloc[train_index], fit)
        X_test = X_.iloc[test_index] if fit == None else transformer(X_.iloc[test_index], fit)
        
        y_train, y_test = y_[train_index], y_[test_index]
        
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        
        acc.append(accuracy_score(y_pred, y_test))
        f1.append(f1_score(y_pred, y_test))
        prec.append(precision_score(y_pred, y_test))
        recall.append(recall_score(y_pred, y_test))
        
    return np.mean(acc), np.mean(f1), np.mean(prec), np.mean(recall)

### SVM Classifier

In [65]:
svm_clf = SVC()

#### Without transformation

In [51]:
#stratified(X, y, svm_clf)

#### With discretization

In [66]:
kbins_params = dict(n_bins=7, encode="ordinal", strategy="kmeans")
svm_scores = stratified(X, y, svm_clf, transformer_=KBinsDiscretizer, params=kbins_params)

#### With normalization

In [50]:
#normalizer_params = dict(norm='max')
#stratified(X, y, svm_clf, transformer_=Normalizer)

### Ada Boost Classifier

In [39]:
ada_clf = AdaBoostClassifier()

#### Without transformation

In [68]:
ada_scores = stratified(X, y, ada_clf)

#### With discretization

In [49]:
#stratified(X, y, ada_clf, transformer_=KBinsDiscretizer, params=kbins_params)

#### With normalization

In [48]:
#stratified(X, y, ada_clf, transformer_=Normalizer)

### Random forest classifier

In [69]:
rf_clf = RandomForestClassifier()

In [70]:
rf_scores = stratified(X, y, rf_clf)

In [117]:
#stratified(X, y, rf_clf, transformer_=KBinsDiscretizer, params=kbins_params)

In [47]:
#stratified(X, y, rf_clf, transformer_=Normalizer)

### Friedman test

In [97]:
metrics = pd.DataFrame(columns=["Random Forest", "SVM", "Ada"], index=["Accuracy", "Precision", "Recall", "F-measure"])

In [98]:
metrics

Unnamed: 0,Random Forest,SVM,Ada
Accuracy,,,
Precision,,,
Recall,,,
F-measure,,,


In [99]:
metrics["Random Forest"] = rf_scores
metrics["SVM"] = svm_scores
metrics["Ada"] = ada_scores

In [100]:
metrics

Unnamed: 0,Random Forest,SVM,Ada
Accuracy,0.940663,0.928277,0.928057
Precision,0.924551,0.906767,0.909284
Recall,0.916177,0.885274,0.906247
F-measure,0.934279,0.930148,0.914763


In [115]:
scores = metrics.rank(axis=1, method="max", ascending=False)

In [116]:
scores

Unnamed: 0,Random Forest,SVM,Ada
Accuracy,1.0,2.0,3.0
Precision,1.0,3.0,2.0
Recall,1.0,3.0,2.0
F-measure,1.0,2.0,3.0
