# Training utilizzando ADFA-Netflow

## Obiettivo

Lo scopo è quello di generare l'effetto degli attacchi zero-day, in moda da arricchire un dataset composto da dati normali. In teoria, se riusciamo a modellare l'effetto degli attacchi, allora la presenza degli zero-day nel momento di test porterà a un comportamento del sistema che  è stato già osservato nell'attimo del training. Nel caso in cui si riuscisse ad ottenre quanto detto prima, la pericolosità degli attacchi sconosciuti sarebbe altamente ridotta, perché i loro effetti sono equivalenti a quelli di attacchi conosciuti.

## Approccio

L'assunzione di base è che gli attacchi hanno un effetto visibile su un sistema target, per esempio è possibile catalogarli come anomalie guardando i dati raccolti. Questa assunzione è la base di ogni lavoro nel settore dell'anomaly detection.

In generale, l'approccio è quello di modellare i dati normali in un iperpiano per scoprire la varietà dei dati normali, e, da questa, identificare i data points fuori dalla varietà. Ogni anomalia sarà descritta da questi data points: attacchi, conosciuti o non, sono descritti da questi data points.

Di conseguenza, se i data points nella varietà dei dati normali e anomali possono essere identificate, possiamo allenare un Intrusion Detection System in un ambiente supervisionato senza il rischio di attacchi sconosciuti.



# We start!

In [13]:
import sys

In [14]:
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge numpy==1.20
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge pandas
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge matplotlib
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge xgboost
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge scikit-learn

#SOME SOLUTIONS REQUIRE TENSORFLOW 1.5, WHILE ONE PYOD AUTOENCODER WANTS 2.0. THIS CREATES SOME COMPATIBILITY PROBLEMS.
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge tensorflow==2
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge keras
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge  copulas
#!conda  install  --yes --prefix {sys.prefix} -c conda-forge mkl-service

In [15]:
#!{sys.executable} -m pip install pyod
#!{sys.executable} -m pip install seaborn
#!{sys.executable} -m pip install tabgan
#!{sys.executable} -m pip install sbo

In [114]:
import tensorflow as tf
print(tf.__version__)

2.1.0


In [115]:
import sys, os, time, glob, joblib, random, itertools, PIL
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [116]:
from sklearn.metrics import accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.manifold import TSNE
from sklearn.ensemble import IsolationForest
from sklearn import ensemble
from sklearn import metrics
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
from matplotlib import cm
from time import time

In [117]:
import tensorflow as tf

In [118]:
#import seaborn as sns
#import pylab as plt
from IPython import display
from itertools import combinations
#necessary to avoid growth of jupyter
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [119]:
PATH='/home/polazzi/datasets/adfa/'
adfa_competitors=open(PATH+"adfa_competitors.csv", "a")

In [26]:
adfa_competitors.write('\ndataset, '
                       + 'generation technique, '
                       + 'classifier, '
                       + 'train-test split, ' 
                       + '# normal in training set, '
                       + '# original attack in training set, '
                       + '# generated attack in training set, '
                       + '# normal in test set, ' 
                       + '# attacks in test set, '
                       + ' features (columns) - may be 1-hotted, '
                       + ' generated attack is manipulated, '
                       + 'tp, tn, fp, fn, accuracy, mcc \n')
adfa_competitors.flush()

In [120]:
# scale and move the coordinates so they fit [0; 1] range
def scale_to_01_range(x):
    # compute the distribution range
    value_range = (np.max(x) - np.min(x))

    # move the distribution so that it starts from zero
    # by extracting the minimal value from all its values
    starts_from_zero = x - np.min(x)

    # make the distribution fit [0; 1] by dividing by its range
    if(value_range!=0):
        return starts_from_zero / value_range
    else:
        return starts_from_zero


### Partizionamento del dataset

Come prima cosa, si crea la partizione di training e di test con un rapporto 60:40.
La colonna contenente le etichette viene normalizzata con ivalori [0,1].

In [121]:
ADFANet=pd.read_csv(PATH+"ADFANet_Shuffled_LabelOK.csv")
ADFANet_Y=ADFANet[['label']]
print(ADFANet_Y['label'].unique())
ADFANet.info()

ADFANet_X=ADFANet.drop('label', axis=1, inplace=False)

ADFANet_Y.loc[ADFANet_Y['label']=='normal', 'label']=0
ADFANet_Y.loc[ADFANet_Y['label']!=0, 'label']=1
ADFANet_Y=ADFANet_Y.astype(int)
print(ADFANet_Y['label'].unique())

['1b' 'normal' 'neptune' 'other' 'portsweep' 'mailbomb']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132006 entries, 0 to 132005
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   packets   132006 non-null  int64 
 1   bytes     132006 non-null  int64 
 2   Duration  132006 non-null  int64 
 3   label     132006 non-null  object
dtypes: int64(3), object(1)
memory usage: 4.0+ MB
[1 0]


In [122]:
ADFANet_X

Unnamed: 0,packets,bytes,Duration
0,13,13005,0
1,11,1292,0
2,271,13193,13
3,1,142,0
4,1,160,0
...,...,...,...
132001,1,115,0
132002,1,143,0
132003,1,160,0
132004,1,144,0


In [124]:
ADFANet_train_X, ADFANet_test_X, ADFANet_train_Y, ADFANet_test_Y = train_test_split(
                                                                                ADFANet_X,ADFANet_Y, 
                                                                                test_size=0.4, 
                                                                                shuffle=True
                                                                                )
print('train_size, test_size')
ADFANet_train_X.shape, ADFANet_test_X.shape #,contamination

train_size, test_size


((79203, 3), (52803, 3))

Proviamo XGBoost in un ambiente supervisionato.

In [26]:
xb=xgb.XGBClassifier(objective='binary:logistic', use_label_encoder=False)
xb.fit(ADFANet_train_X, ADFANet_train_Y)
predicted=xb.predict(ADFANet_test_X)

In [27]:
accuracy=accuracy_score(ADFANet_test_Y, predicted)
tn, fp, fn, tp = confusion_matrix(ADFANet_test_Y, predicted).ravel()
mcc=matthews_corrcoef(ADFANet_test_Y, predicted)
print("tp, tn, fp, fn, accuracy, mcc:")
tp, tn, fp, fn, accuracy, mcc

tp, tn, fp, fn, accuracy, mcc:


(16427, 36320, 41, 15, 0.9989394541976782, 0.997528715968531)

In [34]:
normal_in_training_set=np.unique(ADFANet_train_Y, return_counts=True)[1][0]
original_attack_in_training_set=np.unique(ADFANet_train_Y, return_counts=True)[1][1]
normal_in_test_set=np.unique(ADFANet_test_Y, return_counts=True)[1][0]
attacks_in_test_set=np.unique(ADFANet_test_Y, return_counts=True)[1][1]

adfa_competitors.write('ADFANet, '
                       + 'all known, '
                       + 'XGBoost, '
                       + '0.6--0.4, '
                       + str(normal_in_training_set) + ', '
                       + str(original_attack_in_training_set) + ', ' + '0, '
                       + str(normal_in_test_set) + ', '
                       + str(attacks_in_test_set) + ', '
                       + str(ADFANet_test_X.shape[1]) + ', '
                       + ' NO GENERATED ATTACK, '
                       + f'{tp}, {tn}, {fp}, {fn}, {accuracy:.4f}, {mcc:.4f}\n')
adfa_competitors.flush()

**Adesso supponiamo di non avere conoscenza degli attacchi. Questo vuol dire che non c'è nessun attacco nel training set.**

contamination è impostata a 0 (o a un valore di default se 0 non è accettato come valore), gli attacchi sono rimossi dal training set, e quindi questi si trovano solo nel test set d'ora in poi.

In [125]:
contamination=0.0
trainingX_Y=pd.concat([ADFANet_train_X, ADFANet_train_Y], axis=1)
ADFANet_train_X_no_attack=trainingX_Y[trainingX_Y.label == 0]

In [126]:
ADFANet_train_X_no_attack=ADFANet_train_X_no_attack.drop(['label'], axis=1)
ADFANet_train_X_no_attack.shape

(54682, 3)

In [127]:
#test set with all attack
missing_attacks=trainingX_Y[trainingX_Y['label']!=0]
missing_attacks=missing_attacks.drop(['label'], axis=1)
frames=[ADFANet_test_X, missing_attacks]
ADFANet_test_X=pd.concat(frames).to_numpy()
ADFANet_test_Y=np.vstack((ADFANet_test_Y.to_numpy(), np.ones((missing_attacks.to_numpy().shape[0], 1))))

In [128]:
ADFANet_test_Y.shape,ADFANet_test_X.shape

((77324, 1), (77324, 3))

In [129]:
trainingX_Y

Unnamed: 0,packets,bytes,Duration,label
18897,1,160,0,0
116347,6,2877,0,1
85241,1,160,17,0
31513,1,154,0,0
121596,1,162,0,0
...,...,...,...,...
41797,1,46,0,1
97299,1,161,0,0
24042,1,46,0,1
114728,1,46,0,1


In [130]:
np.save(PATH+'/train_X_no_attack', ADFANet_train_X_no_attack )
np.save(PATH+'/test_X', ADFANet_test_X)
np.save(PATH+'/test_Y', ADFANet_test_Y )

### ALAD

Adesso testiamo **ALAD (eGAN)**.

Controllare la cartella "eGAN/Efficient-GAN-Anomaly-Detection" per avere informazioni sull'esecuzione e sull'anomaly detection di eGAN. 

Eseguiamo XGBoost e Isolation Forest sui dati generati, confrontandoli con il classificatore interno di eGAN.

I dati generati si trovano in: '/home/polazzi/eGAN/Efficient-GAN-Anomaly-Detection/data/adfa_generated_attacks.npy'.

In [140]:
generated_attacks=np.load('/home/polazzi/eGAN/Efficient-GAN-Anomaly-Detection/data/adfa_generated_attacks.npy')

In [141]:
ADFANet_train_X_no_attack=np.load(PATH+'/train_X_no_attack.npy')
ADFANet_test_X=np.load(PATH+'/test_X.npy')
ADFANet_test_Y=np.load(PATH+'/test_Y.npy')

First, we use the generated attacks without any kind of optimization or improvement of data quality.

In [142]:
X_ood=generated_attacks
X_ood.shape

(50000, 3)

In [143]:
ADFANet_train_X__no_attacks_augmented=np.vstack((ADFANet_train_X_no_attack, X_ood))
ADFANet_train_Y__no_attacks_augmented=np.vstack((np.zeros((ADFANet_train_X_no_attack.shape[0],1)),
                                                 np.ones((X_ood.shape[0], 1))))

In [144]:
ADFANet_train_X__no_attacks_augmented

array([[ 1.00000000e+00,  1.60000000e+02,  0.00000000e+00],
       [ 1.00000000e+00,  1.60000000e+02,  1.70000000e+01],
       [ 1.00000000e+00,  1.54000000e+02,  0.00000000e+00],
       ...,
       [ 9.35663044e-01,  3.36778412e+01, -1.36466458e-01],
       [ 4.02902484e-01,  2.82565460e+01, -1.42906338e-01],
       [ 6.67986989e-01,  2.80663109e+01,  3.75470012e-01]])

In [145]:
ADFANet_test_Y

array([[1.],
       [0.],
       [0.],
       ...,
       [1.],
       [1.],
       [1.]])

In [146]:
from sklearn.model_selection import GridSearchCV
xb=xgb.XGBClassifier(objective='binary:logistic', use_label_encoder=False)

xb.fit(ADFANet_train_X__no_attacks_augmented, ADFANet_train_Y__no_attacks_augmented)
predicted=xb.predict(ADFANet_test_X)

accuracy=accuracy_score(ADFANet_test_Y, predicted)
tn, fp, fn, tp = confusion_matrix(ADFANet_test_Y, predicted).ravel()
mcc=matthews_corrcoef(ADFANet_test_Y, predicted)

print("tp, tn, fp, fn, accuracy, mcc:")
tp, tn, fp, fn, accuracy, mcc

tp, tn, fp, fn, accuracy, mcc:


(0, 36363, 0, 40961, 0.4702679633748901, 0.0)

In [101]:
normal_in_training_set=np.unique(ADFANet_train_X_no_attack.shape[0])
original_attack_in_training_set=0
normal_in_test_set=np.unique(ADFANet_test_Y, return_counts=True)[1][0]
attacks_in_test_set=np.unique(ADFANet_test_Y, return_counts=True)[1][1]
generated_attacks=X_ood.shape[0]

adfa_competitors.write('ADFANet, '
                       + 'ALAD (eGAN) '
                       + '- 500 epochs, '
                       + 'XGBoost, '
                       + '0.6--0.4, '
                       + str(normal_in_training_set) + ', '
                       + str(original_attack_in_training_set) + ', '
                       + str(generated_attacks) + ', '
                       + str(normal_in_test_set) + ', '
                       + str(attacks_in_test_set) + ', '
                       + str(ADFANet_test_X.shape[1]) + ', '
                       + ' NO AUGMENTATION, '
                       + f'{tp}, {tn}, {fp}, {fn}, {accuracy:.4f}, {mcc:.4f}\n')
adfa_competitors.flush()

In [147]:
contamination=0.2

total=int(ADFANet_train_X_no_attack.shape[0]*contamination)
if(total > X_ood.shape[0]):
    print("contamination is larger than the available attacks.")
    sys.exit(0)

idx = np.random.randint(X_ood.shape[0], size=total)
attacks=X_ood[idx,:]
ADFANet_train_X__no_attacks_augmented=np.vstack((ADFANet_train_X_no_attack, attacks))
ADFANet_train_Y__no_attacks_augmented=np.vstack((np.zeros((ADFANet_train_X_no_attack.shape[0], 1)),
                                                 np.ones((attacks.shape[0],1))))
                                                 
contamination=contamination/(1+contamination)
                                                 
contamination, attacks.shape

(0.16666666666666669, (10936, 3))

In [41]:
#Ciclo per testare quale è la scelta migliore dell'argomento max_samples
"""
max_samples_i = 200
for i in range(5):
    isol_1=IsolationForest(n_estimators=800,
                         max_samples=max_samples_i,
                         contamination=contamination,
                         max_features=1.0,
                         bootstrap=False,
                         n_jobs=12,warm_start=False).fit(ADFANet_train_X__no_attacks_augmented)

    predicted=isol_1.predict(ADFANet_test_X)

    Y=ADFANet_test_Y
    predicted=np.where(predicted==-1, 1, 0)
    accuracy=accuracy_score(Y, predicted)
    mcc=matthews_corrcoef(Y, predicted)
    tn, fp, fn, tp= confusion_matrix(Y, predicted).ravel()

    print(f'Testing with a depth of {np.ceil(np.log2(max_samples_i))}: {tp}, {tn}, {fp}, {fn}, {accuracy}, {mcc}')
    max_samples_i +=200
"""

Testing with a depth of 8.0: 17148, 34393, 1923, 23813, 0.6669642972682687, 0.42332806072739787
Testing with a depth of 9.0: 40961, 34376, 1940, 0, 0.9748955057779158, 0.9506709741496733
Testing with a depth of 10.0: 40961, 34309, 2007, 0, 0.9740284948949881, 0.9490033203440049
Testing with a depth of 10.0: 40961, 34309, 2007, 0, 0.9740284948949881, 0.9490033203440049
Testing with a depth of 10.0: 40961, 34309, 2007, 0, 0.9740284948949881, 0.9490033203440049


In [148]:
isol_1=IsolationForest(n_estimators=50,
                     max_samples=400,
                     contamination=contamination,
                     max_features=1.0,
                     bootstrap=False,
                     n_jobs=12,
                     warm_start=False).fit(ADFANet_train_X__no_attacks_augmented)

predicted=isol_1.predict(ADFANet_test_X)

Y=ADFANet_test_Y
predicted=np.where(predicted==-1, 1, 0)
accuracy=accuracy_score(Y, predicted)
mcc=matthews_corrcoef(Y, predicted)
tn, fp, fn, tp= confusion_matrix(Y, predicted).ravel()

tp, tn, fp, fn, accuracy, mcc

(40961, 33687, 2676, 0, 0.9653923749418033, 0.9325221318425488)

In [149]:
original_attack_in_training_set=0
normal_in_test_set=np.unique(ADFANet_test_Y, return_counts=True)[1][0]
attacks_in_test_set=np.unique(ADFANet_test_Y, return_counts=True)[1][1]
generated_attacks=X_ood.shape[0]

adfa_competitors.write('ADFANet, '
                       + 'ALAD (eGAN) '
                       + '- 50 epochs, '
                       + 'Isolation Forest, '
                       + '0.6--0.4, '
                       + str(normal_in_training_set) + ', '
                       + str(original_attack_in_training_set) + ', '
                       + str(generated_attacks) + ', '
                       + str(normal_in_test_set) + ', '
                       + str(attacks_in_test_set) + ', '
                       + str(ADFANet_test_X.shape[1]) + ', '
                       + ' NO AUGMENTATION, '
                       + f'{tp}, {tn}, {fp}, {fn}, {accuracy:.4f}, {mcc:.4f}\n')
adfa_competitors.flush()

### ARN - Adversarial Regularized Reconstruction for Anomaly Detection and Generation

Per generare gli attacchi e ottenere i risultati da ARN, eseguire il notebook: './arn/GENERATION_KDDCUP99REV/ADFA-ARN_ADFA_REV_Generation.ipynb'; questo salva il file con i dati generati in '/home/polazzi/arn/adfa/generated_attacks.npy'.

ARN necessita della codifica one-hot e di un processo di normalizzazione dei dati; questo sembra aiutare molto l'algoritmo, ma non altri come XGBoost.

It also need to create the train and test set in the same way. For this reason, we import the same creation method used in ARN.

In [97]:
generated_attacks=np.load('/home/polazzi/arn/adfa/generated_attacks.npy')

In [98]:
x_train=np.load(PATH+'/ARN_x_train.npy')
x_test=np.load(PATH+'/ARN_x_test.npy')
y_train=np.load(PATH+'/ARN_y_train.npy')
y_test=np.load(PATH+'/ARN_y_test.npy')

In [99]:
x_train.shape, generated_attacks.shape

((54627, 1119), (54272, 1119))

In [100]:
x_train_augmented=np.vstack((x_train,generated_attacks))

y_train_augmented=np.hstack((y_train,
                             np.ones(generated_attacks.shape[0],)))

In [101]:
y_train_augmented

array([0., 0., 0., ..., 1., 1., 1.])

In [112]:
xb=xgb.XGBClassifier(objective='binary:logistic', 
                     use_label_encoder=False )

xb.fit(x_train_augmented, 
       y_train_augmented)

predicted=xb.predict(x_test)

accuracy=accuracy_score(y_test, predicted)
tn, fp, fn, tp = confusion_matrix(y_test, predicted).ravel()
mcc=matthews_corrcoef(y_test, predicted)

print("tp, tn, fp, fn, accuracy, mcc:")
tp, tn, fp, fn, accuracy, mcc

tp, tn, fp, fn, accuracy, mcc:


(75, 36418, 0, 40886, 0.4716137453314207, 0.029369896688479707)

In [113]:
normal_in_training_set=np.unique(x_train.shape[0])
original_attack_in_training_set=0
normal_in_test_set=np.unique(y_test, return_counts=True)[1][0]
attacks_in_test_set=np.unique(y_test, return_counts=True)[1][1]
ga=generated_attacks.shape[0]

adfa_competitors.write('ADFANet, '
                       + 'ARN '
                       + '- 500 epochs, '
                       + 'XGBoost, '
                       + '0.6--0.4, '
                       + str(normal_in_training_set) + ', '
                       + str(original_attack_in_training_set) + ', '
                       + str(ga) + ', ' 
                       + str(normal_in_test_set) + ', '
                       + str(attacks_in_test_set) + ', '
                       + str(ADFANet_test_X.shape[1]) + ', '
                       + ' NO AUGMENTATION, '
                       #'{}, {}, {}, {}, {:3f}, {:3f} \n'.format(tp, tn, fp, fn, accuracy, mcc)
                       + f'{tp}, {tn}, {fp}, {fn}, {accuracy:.4f}, {mcc:.4f}\n')
adfa_competitors.flush()

In [103]:
contamination=0.2

total=int(x_train.shape[0]*contamination)
if(total > generated_attacks.shape[0]):
    print("contamination is larger than the available attacks.")
    sys.exit(0)
    
np.random.shuffle(generated_attacks)

idx = np.random.randint(generated_attacks.shape[0], size=total)
attacks=generated_attacks[idx,:]
x_train_augmented=np.vstack((x_train, attacks))
y_train_augmented=np.hstack((y_train,np.ones((attacks.shape[0],))))
                                                 
contamination=contamination/(1+contamination)
                                                 
contamination, attacks.shape

(0.16666666666666669, (10925, 1119))

In [110]:
isol_1=IsolationForest(n_estimators=50,
                         max_samples= 400,
                         contamination=contamination,
                         max_features=1.0,
                         bootstrap=False,
                         n_jobs=12,warm_start=False).fit(x_train_augmented)

predicted=isol_1.predict(x_test)

Y=y_test
predicted=np.where(predicted==-1, 1, 0)
accuracy=accuracy_score(Y, predicted)
mcc=matthews_corrcoef(Y, predicted)
tn, fp, fn, tp= confusion_matrix(Y, predicted).ravel()

print(f'{tp}, {tn}, {fp}, {fn}, {accuracy}, {mcc}')

40961, 34378, 2040, 0, 0.9736362578994301, 0.9482617996607893


In [111]:
normal_in_training_set=np.unique(x_train.shape[0])
original_attack_in_training_set=0
normal_in_test_set=np.unique(y_test, return_counts=True)[1][0]
attacks_in_test_set=np.unique(y_test, return_counts=True)[1][1]
ga=generated_attacks.shape[0]

adfa_competitors.write('ADFANet, '
                       + 'ARN '
                       + '- 500 epochs, '
                       + 'Isolation Forest, '
                       + '0.6--0.4, '
                       + str(normal_in_training_set) + ', '
                       + str(original_attack_in_training_set) + ', '
                       + str(ga) + ', '
                       + str(normal_in_test_set) + ', '
                       + str(attacks_in_test_set) + ', '
                       + str(ADFANet_test_X.shape[1]) + ', '
                       + ' NO AUGMENTATION, '
                       + f'{tp}, {tn}, {fp}, {fn}, {accuracy:.4f}, {mcc:.4f}\n')
adfa_competitors.flush()