# Analisi risultati

Nel notebook vengono svolte alcune analisi sul dataset generato tramite varie
simulazioni su diversi dataset e utilizzando tre diversi classificatori. In
particolare sono stati usati una **SVM**, un **MultiLayer Perceptron** e un
**Random Forest**. Non sono stati effettuati benchmark sulle prestazioni, si
sta infatti considerando solo la qualità dei risultati ottenuti.


In [1]:
import pandas as pd

df = pd.read_csv("../datasets/test.csv")
display(df.info())
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   dataset_id       1200 non-null   int64  
 1   simulation_id    1200 non-null   int64  
 2   samples          1200 non-null   int64  
 3   features         1200 non-null   int64  
 4   classes          1200 non-null   int64  
 5   clusters         1200 non-null   int64  
 6   population_size  1200 non-null   int64  
 7   point            1200 non-null   int64  
 8   class            1200 non-null   int64  
 9   target           1200 non-null   int64  
 10  model            1200 non-null   object 
 11  min_fitness      1200 non-null   float64
 12  mean_fitness     1200 non-null   float64
 13  fitness_std      1200 non-null   float64
 14  max_fitness      1200 non-null   float64
 15  accuracy         1200 non-null   float64
dtypes: float64(5), int64(10), object(1)
memory usage: 150.1+ KB


None

Unnamed: 0,dataset_id,simulation_id,samples,features,classes,clusters,population_size,point,class,target,model,min_fitness,mean_fitness,fitness_std,max_fitness,accuracy
0,0,0,10,2,2,1,1000,0,1,0,SVC,-0.855264,-0.696396,0.092766,-0.547943,1.0
1,0,0,10,2,2,1,1000,0,1,1,SVC,-0.197379,-0.104045,0.057211,-0.000054,1.0
2,0,0,10,2,2,1,1000,1,0,0,SVC,-0.143339,-0.073661,0.039783,-0.000221,1.0
3,0,0,10,2,2,1,1000,1,0,1,SVC,-0.572780,-0.484936,0.047278,-0.416159,1.0
4,0,0,10,2,2,1,1000,2,1,0,SVC,-1.160755,-1.096355,0.035346,-1.031373,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,1,4,10,2,2,2,4000,7,0,1,MLPClassifier,-0.685313,-0.562300,0.068521,-0.445267,1.0
1196,1,4,10,2,2,2,4000,8,0,0,MLPClassifier,-0.178795,-0.093622,0.051011,-0.000323,1.0
1197,1,4,10,2,2,2,4000,8,0,1,MLPClassifier,-0.722029,-0.567538,0.083668,-0.448194,1.0
1198,1,4,10,2,2,2,4000,9,0,0,MLPClassifier,-0.127930,-0.065308,0.035431,-0.000132,1.0


Ogni riga del dataset contiene quindi:

- **simulation_ID**: identifica una singola simulazione con determinati
  parametri. Dato che ogni simulazione è ripetuta 10 volte, ognuna di esse ha
  un identificatore da 0 a 9.
- **dataset_ID**: ID univoco per ogni dataset analizzato.
- **point**: ogni punto del dataset viene semplicemente enumerato da $0$ a
  $N-1$, dove $N$ è il numero totale di punti del dataset.
- **class**: classe del punto.
- **target**: classe target dell'algoritmo genetico.
- **model**: il modello classificatore utilizzato.
- **min/mean/max_fitness**: valore minimo, medio e massimo di fitness estratti
  dalla hall of fame prodotta ad ogni esecuzione dell'algoritmo genetico.
- **fitness_std**: deviazione standard dei valori di fitness della popolazione
  sintetica finale.
- **accuracy**: calcolata come numero di individui nella hall of fame
  classificati nella classe target diviso numero di individui totali presenti
  nella hall of fame.

Possiamo quindi vedere ogni riga come una singola esecuzione dell'algoritmo
genetico su uno specifico punto e su una specifica classe target.

Dato che i valori di fitness non sono altro che la distanza di ogni punto
sintetico dal punto preso in esame, moltiplicata per $-1$. Possiamo quindi
convertire le tre colonne di fitness in valori di distanza rimoltiplicandole
per $-1$ di modo da avere valori meglio interpretabili.


In [2]:
df[["min_fitness", "mean_fitness", "max_fitness"]] *= -1.0
df = df.rename(
    columns={
        "min_fitness": "min_distance",
        "mean_fitness": "mean_distance",
        "fitness_std": "distance_std",
        "max_fitness": "max_distance",
    }
)
df

Unnamed: 0,dataset_id,simulation_id,samples,features,classes,clusters,population_size,point,class,target,model,min_distance,mean_distance,distance_std,max_distance,accuracy
0,0,0,10,2,2,1,1000,0,1,0,SVC,0.855264,0.696396,0.092766,0.547943,1.0
1,0,0,10,2,2,1,1000,0,1,1,SVC,0.197379,0.104045,0.057211,0.000054,1.0
2,0,0,10,2,2,1,1000,1,0,0,SVC,0.143339,0.073661,0.039783,0.000221,1.0
3,0,0,10,2,2,1,1000,1,0,1,SVC,0.572780,0.484936,0.047278,0.416159,1.0
4,0,0,10,2,2,1,1000,2,1,0,SVC,1.160755,1.096355,0.035346,1.031373,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,1,4,10,2,2,2,4000,7,0,1,MLPClassifier,0.685313,0.562300,0.068521,0.445267,1.0
1196,1,4,10,2,2,2,4000,8,0,0,MLPClassifier,0.178795,0.093622,0.051011,0.000323,1.0
1197,1,4,10,2,2,2,4000,8,0,1,MLPClassifier,0.722029,0.567538,0.083668,0.448194,1.0
1198,1,4,10,2,2,2,4000,9,0,0,MLPClassifier,0.127930,0.065308,0.035431,0.000132,1.0


Procediamo con l'accorpare i risultati di diverse simulazioni effettuate con
gli stessi parametri. Nello specifico vogliamo rimuovere la colonna
_simulation_ID_ prendendo ed effettuando le seguenti operazioni di aggregazione
sulle colonne riguardanti i valori di distance e precisione:

- _min_distance_: viene preso il minimo tra tutti i valori.
- _mean_distance_: viene calcolata la media tra tutti i valori.
- _max_distance_: viene preso il massimo tra tutti i valori.
- _distance_std_: viene calcolata la media delle deviazioni standard.
- _accuracy_: viene calcolata la media delle precisioni.


In [3]:
df = (
    df.groupby(
        [
            "dataset_id",
            "samples",
            "features",
            "classes",
            "clusters",
            "population_size",
            "point",
            "class",
            "target",
            "model",
            "accuracy",
        ]
    )
    .agg(
        {
            "min_distance": "min",
            "mean_distance": "mean",
            "distance_std": "mean",
            "max_distance": "max",
        }
    )
    .reset_index()
)

df

Unnamed: 0,dataset_id,samples,features,classes,clusters,population_size,point,class,target,model,accuracy,min_distance,mean_distance,distance_std,max_distance
0,0,10,2,2,1,1000,0,1,0,MLPClassifier,1.0,0.776474,0.610789,0.100287,0.472115
1,0,10,2,2,1,1000,0,1,0,SVC,1.0,0.830058,0.690544,0.087986,0.548037
2,0,10,2,2,1,1000,0,1,1,MLPClassifier,1.0,0.175006,0.099295,0.054942,0.000054
3,0,10,2,2,1,1000,0,1,1,SVC,1.0,0.187042,0.103112,0.056527,0.000054
4,0,10,2,2,1,1000,1,0,0,MLPClassifier,1.0,0.139559,0.077683,0.041085,0.000221
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,1,10,2,2,2,4000,8,0,1,SVC,1.0,0.686232,0.561219,0.074872,0.439107
236,1,10,2,2,2,4000,9,0,0,MLPClassifier,1.0,0.127930,0.067002,0.036600,0.000132
237,1,10,2,2,2,4000,9,0,0,SVC,1.0,0.126894,0.067265,0.036705,0.000132
238,1,10,2,2,2,4000,9,0,1,MLPClassifier,1.0,0.683916,0.600463,0.058848,0.542617


Il prossimo passo potrebbe essere quello di aggregare i risultati ottenuti
da simulazioni che utilizzano dimensioni diverse della popolazione sintetica.
Prima però sarà necessario analizzarli.

Per l'analisi sarà necessario raggruppare le simulazioni effettuate sui vari
dataset e con i diversi modelli.


In [10]:
df.groupby(
    ["dataset_id", "model", "population_size"]
).agg(
    ["min", "mean", "max", "std"]
)["accuracy"].reset_index()

Unnamed: 0,dataset_id,model,population_size,min,mean,max,std
0,0,MLPClassifier,1000,1.0,1.0,1.0,0.0
1,0,MLPClassifier,2000,1.0,1.0,1.0,0.0
2,0,MLPClassifier,4000,1.0,1.0,1.0,0.0
3,0,SVC,1000,1.0,1.0,1.0,0.0
4,0,SVC,2000,1.0,1.0,1.0,0.0
5,0,SVC,4000,1.0,1.0,1.0,0.0
6,1,MLPClassifier,1000,1.0,1.0,1.0,0.0
7,1,MLPClassifier,2000,1.0,1.0,1.0,0.0
8,1,MLPClassifier,4000,1.0,1.0,1.0,0.0
9,1,SVC,1000,1.0,1.0,1.0,0.0
