# Comparison of Original IForestASD, SADWIN IFA, PADWIN IFA, and NDKSWIN

L'objectif est de voir comment les 4 méthodes se comportent avec différents jeux de données.
* SAWIN IFA (Scores based ADWIN on IForestASD) utilise le score des données sur la base du modèle créé avec IForest ASD dans les fenêtres précédentes. Ce score est utilisé par ADWIN pour vérifier si le modèle drift pour pas. Une fois le drift détecté, le modèle est mis à jour sur la base de la fenêtre courante et l'ancien modèle est totalement supprimé.
* PAWIN IFA (Prediction based ADWIN on IForestASD) utilise la classification des données sur la base du modèle créé avec IForest ASD dans les fenêtres précédentes. Cette classification est utilisée par ADWIN pour vérifier si le modèle drift pour pas. Une fois le drift détecté, le modèle est mis à jour sur la base de la fenêtre courante et l'ancien modèle est totalement supprimé.
* NDKSWIN IFA (N-Dimensional KSWIN on IForestASD) utilise les données de la fenêtre courante pour détecter si elles drift sur au moins une colonne. NDKSWIN est une adaptation de KSWIN de scikit-multiflow pour les données à n dimensions. Une fois le drift détecté, le modèle est mis à jour sur la base de la fenêtre courante et l'ancien modèle est totalement supprimé.

## Install skmultiflow if needed
You need to install git

In [1]:
#print("scikit-multiflow package installation")
#!pip install -U git+https://github.com/scikit-multiflow/scikit-multiflow

In [2]:
try:
    import skmultiflow
except ImportError as e:
    print("scikit-multiflow package installation")
    !pip install -U git+https://github.com/scikit-multiflow/scikit-multiflow

## Importations and configurations

In [1]:
%matplotlib notebook
import matplotlib as plt
plt.interactive(True)
from source import functions
func = functions.Comparison()
import datetime

## General parameters for the evaluation

In [2]:
#************************ Execution settings *******************************
window_sizes = [100, 500]
n_estimators = [30]
execution_number = 1 # Number of execution because of the random character of IForest
anomaly_threshold = 0.5 # Anomaly threshold to decide which is anomaly or not
max_sample = 10000 # Total size of data to examine (windows number = max_sample/window)
n_wait = max_sample # The evaluation step size
# Used metric in the evaluation. Attention to use the metrics availlable in skmultiflow
metrics=['accuracy', 'f1', 'precision', 'recall', 'true_vs_predicted', 'kappa', 'kappa_m', 'running_time', 'model_size']

#************************ Stream data settings *******************************
window_save_size = 100
window_number = round((max_sample/window_save_size),0) # The number of window to save on .csv file

# Parameters for NDWSIN IFA
alpha=0.01
n_dimensions=2 # Number of dimensions to use for the test of the drift concept (normally 50% of m)
n_tested_samples=0.1 # Percentage of data to pick in the window to test the drift concept
fixed_checked_dimension = False # IF False, dimension will be choose randomly, if not on the 
fixed_checked_sample=False

## Execution Function

In [4]:
def execute_comparision():
    file_path = func.save_stream_data_generated(stream=stream, window = window_save_size, 
                                                result_folder=test_name, window_number = window_number)
    for window in window_sizes:
        stream2 = func.get_file_stream(path=file_path)
        for n_estimator in n_estimators:
            print("")
            print("******************************** Window = "+str(window)+" and n_estimator = "+str(n_estimator)+" ********************************")
            func.run_IForestASDs_comparison2(execution_number=execution_number, stream=stream2, 
                                            stream_n_features=stream.n_features, window = window, 
                             estimators = n_estimator, anomaly = anomaly_threshold, drift_rate = drift_rate, 
                             result_folder=test_name, max_sample=max_sample, n_wait=n_wait, metrics=metrics,
                             #n_estimators_updated=n_estimators_updated, updated_randomly=updated_randomly,
                            alpha=alpha, n_dimensions=n_dimensions, n_tested_samples=n_tested_samples,
                            fixed_checked_dimension = fixed_checked_dimension, fixed_checked_sample=fixed_checked_sample)
    
    directory_path = 'results/'+str(test_name)
    func.merge_file2(folder_path=directory_path, output_file = 'output_madkour.csv',skiprows=(4 + 4))

## Applied On Simple Stream Dataset
### Idea and waitting results
### Results
#### Model updating
#### Method performances

### Summary

In [5]:
dataset_name = "Generator"
test_name = dataset_name+'_'+str(datetime.datetime.now())
drift_rate = 0.1
stream = func.get_dataset(dataset_name=dataset_name, classification_function=0,noise_percentage=0.1, random_state=1)
execute_comparision()


Please find the data used on results/Generator_2022-09-23 17:15:32.112100/Generator_2022-09-23 17:15:32.112100_dataUsed.csv

******************************** Window = 100 and n_estimator = 30 ********************************
*************************************** Execution N° 0**********************************


<IPython.core.display.Javascript object>

Prequential Evaluation
Evaluating 1 target(s).
Pre-training on 1 sample(s).
Evaluating...

The model was updated by training a new iForest with the version : AnomalyRate


  self.fig.tight_layout(rect=[0, .04, 1, 0.98], pad=2.6, w_pad=0.4, h_pad=1.0)



The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : NDKSWIN

The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : AnomalyRate
 #------------------- [5%] [14.89s]
The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : AnomalyRate
 ##------------------ [10%] [33.50s]
The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with

<IPython.core.display.Javascript object>

Prequential Evaluation
Evaluating 1 target(s).
Pre-training on 1 sample(s).
Evaluating...
 #------------------- [5%] [0.01s]

  self.fig.tight_layout(rect=[0, .04, 1, 0.98], pad=2.6, w_pad=0.4, h_pad=1.0)



The model was updated by training a new iForest with the version : AnomalyRate
 ##------------------ [10%] [123.10s]
The model was updated by training a new iForest with the version : AnomalyRate
 ###----------------- [15%] [242.31s]
The model was updated by training a new iForest with the version : AnomalyRate

The model was updated by training a new iForest with the version : NDKSWIN
 ####---------------- [20%] [363.04s]
The model was updated by training a new iForest with the version : AnomalyRate
 #####--------------- [25%] [483.26s]
The model was updated by training a new iForest with the version : AnomalyRate
 ######-------------- [30%] [601.83s]
The model was updated by training a new iForest with the version : AnomalyRate
 #######------------- [35%] [722.16s]
The model was updated by training a new iForest with the version : AnomalyRate
 ########------------ [40%] [842.16s]
The model was updated by training a new iForest with the version : AnomalyRate
 #########----------- [45



  df = pd.read_csv(folder_path + "/" + file_, sep=',', skiprows=skiprows, header=0,


  df = pd.read_csv(folder_path + "/" + file_, sep=',', skiprows=1, header=0, dtype='unicode',


  df = pd.read_csv(folder_path + "/" + file_, sep=',', skiprows=1, header=0, dtype='unicode',


  df = pd.read_csv(folder_path + "/" + file_, sep=',', skiprows=skiprows, header=0,
