# Network Attack Detection using Machine Learning: Evasion Attack

## Dataset description

In the NetFlow V9 format by Cisco

* **train_set**: ~2 million flows, used for training the model
* **test_set**: ~4 million flows, used for testing the model

### Columns
* **FLOW_ID**: A unique identifier for the flow
* **PROTOCOL_MAP**: A string representing the protocol used in the flow, possible values include "ICMP", "TCP", "UDP", "IGMP", "GRE", "ESP", "AH", "EIGRP", "OSPF", "PIM", "IPV6-ICMP", "IPV6-IP", "IPV6-ROUTE", "IPV6-FRAG", "IPV6-NONXT", "IPV6-OPTS", and others.
* **L4_SRC_PORT**: The source port number in the flow, possible values range from 0 to 65535.
* **IPV4_SRC_ADDR**: The source IPv4 address in the flow, represented as a string in dotted decimal notation (e.g., "192.168.0.1").
* **L4_DST_PORT**: The destination port number in the flow, possible values range from 0 to 65535.
* **IPV4_DST_ADDR**: The destination IPv4 address in the flow, represented as a string in dotted decimal notation (e.g., "192.168.0.2").
* **FIRST_SWITCHED**: The time at which the flow started, measured in seconds since the epoch (January 1, 1970).
* **FLOW_DURATION_MILLISECONDS**: The duration of the flow in milliseconds.
* **LAST_SWITCHED**: The time at which the flow ended, measured in seconds since the epoch (January 1, 1970).
* **PROTOCOL**: The protocol used in the flow, possible values include 1 (ICMP), 6 (TCP), 17 (UDP), and others.
* **TCP_FLAGS**: The TCP flags set in the flow, represented as a binary string (e.g., "100101").
* **TCP_WIN_MAX_IN**: The maximum advertised window size (in bytes) for incoming traffic.
* **TCP_WIN_MAX_OUT**: The maximum advertised window size (in bytes) for outgoing traffic.
* **TCP_WIN_MIN_IN**: The minimum advertised window size (in bytes) for incoming traffic.
* **TCP_WIN_MIN_OUT**: The minimum advertised window size (in bytes) for outgoing traffic.
* **TCP_WIN_MSS_IN**: The maximum segment size (in bytes) for incoming traffic.
* **TCP_WIN_SCALE_IN**: The window scale factor for incoming traffic.
* **TCP_WIN_SCALE_OUT**: The window scale factor for outgoing traffic.
* **SRC_TOS**: The Type of Service (ToS) value for the source IP address.
* **DST_TOS**: The Type of Service (ToS) value for the destination IP address.
* **TOTAL_FLOWS_EXP**: The total number of expected flows.
* **MIN_IP_PKT_LEN**: The minimum length (in bytes) of IP packets in the flow.
* **MAX_IP_PKT_LEN**: The maximum length (in bytes) of IP packets in the flow.
* **TOTAL_PKTS_EXP**: The total number of expected packets in the flow.
* **TOTAL_BYTES_EXP**: The total number of expected bytes in the flow.
* **IN_BYTES**: The number of bytes received in the flow.
* **IN_PKTS**: The number of packets received in the flow.
* **OUT_BYTES**: The number of bytes sent in the flow.
* **OUT_PKTS**: The number of packets sent in the flow.
* **ANALYSIS_TIMESTAMP**: The time at which the flow was analyzed, measured in seconds since the epoch (January 1, 1970).
* **ANOMALY**: A binary flag indicating whether the flow contains an anomaly (1 = true, 0 = false).
* **ALERT**: (<u>only available in training set</u>) The kind of attack that has been detected on the current flow. This are the possible values:
  - **None**: No attack has been detected
  - **Port scanning**: The flow is a port scanning attack
  - **Denial of Service**: The flow is a DoS attack
  - **Malware**: The flow is a malware attack
* **ID**: A unique identifier for the flow.

## Tested models

* K-Nearest Neighbors (KNN)
* Support Vector Machine Classifier (SVC) with RBF (Radial Basis Function) kernel
* Pipeline with Principal Component Analysis (PCA) and Support Vector Machine Classifier (SVC)
* Bagging Classifier (based on SVC with RBF kernel)
* Random Forest Classifier
* Extra Trees Classifier
* Neural Network (MLPClassifier)

## 1. Datasets loading

### 1.1. Importing the basic libraries

In [None]:
# Load data processing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
!pip install datashader
import datashader as ds
import datashader.transfer_functions as tf
import datashader.bundling as bd
import colorcet
import matplotlib.colors
import matplotlib.cm
import bokeh.plotting as bpl
import bokeh.transform as btr
import holoviews as hv
import holoviews.operation.datashader as hd
%matplotlib inline

### 1.2. Importing machine learning libraries

In [None]:
# Load machine learning libraries
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

### 1.3. Development mode flag

In [None]:
# If true, only the 3% of the data will be used for training and testing of the various models
_DEVMODE = True

### 1.4. Loading the datasets

In [None]:
# Loading the data from the train and test files
# train_df = pd.read_csv('/scratch/duh6ae/NetSecProject/train_net.csv')
# test_df = pd.read_csv('/scratch/duh6ae/NetSecProject/test_net.csv')
train_size = 0.75
df = pd.read_csv('/scratch/duh6ae/NetSecProject/train_net.csv')
df = df.sample(frac=1).reset_index(drop=True)

# Split the data
train_df = df[:int(train_size * len(df))]
test_df = df[int(train_size * len(df)):]

In [None]:
test_df

### 1.5. Loaded datasets information

In [None]:
# Print total size
print("Test set size: ", test_df.shape)
print("Train set size: ", train_df.shape)

# Value counts
train_df['ALERT'].value_counts()

### 1.6. Dataset development mode reduction

In [None]:
if _DEVMODE:
    train_df = train_df.sample(frac=0.03, random_state=1)
    test_df = test_df.sample(frac=0.03, random_state=1)

    # Print total size
    print("Test set size: ", test_df.shape)
    print("Train set size: ", train_df.shape)


## 2. Data preprocessing

### 2.1. Print datasets information

In [None]:
train_df.info()

### 2.2. Print datasets shape

In [None]:
# Show information about the data
def printInfo(df):
    print('Dataframe shape: ', df.shape)
    print('Dataframe columns: ', df.columns)

print('==== Train data ====')
printInfo(train_df)
print()
print('==== Test data ====')
printInfo(test_df)

### 2.3. Show training dataset structure

In [None]:
train_df.head()

### 2.4 Check for missing values

In [None]:
# Check for missing values
print('==== Train data ====')
print(train_df.isnull().sum())
print()
print('==== Test data ====')
print(test_df.isnull().sum())
print()

### 2.5 Fill missing **ANOMALY** values

In [None]:
# Fill the missing ANOMALY and ALERT values with 0 (no anomaly)
train_df['ANOMALY'].fillna(0, inplace=True)
test_df['ANOMALY'].fillna(0, inplace=True)
train_df['ALERT'].fillna('None', inplace=True)
test_df['ALERT'].fillna('None', inplace=True)

## 3. Data analysis

### 3.1. Data types

In [None]:
train_df.dtypes

### 3.2. Observing the distribution of the target variable

In [None]:
# Show the distribution of the target variable
sns.countplot(x='ALERT', data=train_df)

In [None]:
# Count the number of unique protocol_maps
train_df['PROTOCOL_MAP'].value_counts()

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(20, 5))

# seaborn countplots
sns.countplot(x='ANOMALY', data=train_df, ax=axs[0]).set(title='ANOMALY')


# Seaborn countplot for the 'PROTOCOL_MAP' column, with enough space for the labels
sns.countplot(x='PROTOCOL_MAP', data=train_df, ax=axs[1]).set(title='PROTOCOL_MAP')

# Boxplot for L4_SRC_PORT to undestand the distribution of the data
sns.boxplot(
    x='L4_SRC_PORT', data=train_df, ax=axs[2],
    notch=True, showcaps=True,
    flierprops={"marker": "x"}, # Change the outlier marker
    showmeans=True, # Show the mean
    boxprops={"facecolor": (.4, .6, .8, .5)},
  ).set(title='L4_SRC_PORT')

### 3.3. Protocol distribution in relation to the kind of attack

In [None]:
# Show protocol_map distribution for kind of ALERT
sns.countplot(x='PROTOCOL_MAP', hue='ALERT', data=train_df)

### 3.4. Unique hosts in dataset

In [None]:
# Find unique hosts (IP addresses) in the train and test data
train_src_hosts = train_df['IPV4_SRC_ADDR'].unique()
train_dst_hosts = train_df['IPV4_DST_ADDR'].unique()
train_hosts = np.union1d(train_src_hosts, train_dst_hosts)

# For each host, count the number of flows
print('Number of unique hosts in the train data: ', len(train_hosts))

# Find unique hosts (IP addresses) in the train and test data
test_src_hosts = test_df['IPV4_SRC_ADDR'].unique()
test_dst_hosts = test_df['IPV4_DST_ADDR'].unique()
test_hosts = np.union1d(test_src_hosts, test_dst_hosts)

# Floor ratio of hosts in test data that are not in train data
ratio = math.floor((1.0-len(test_hosts)/len(train_hosts)) * 100)

# For each host, count the number of flows
print("Number of unique hosts in the test data: {} (~{}% smaller)".format(len(test_hosts), ratio))


### 3.5. Distribution analysis using pairplot

In [None]:
# select the columns to be used for training
train_df_columns = train_df.copy()[['L4_SRC_PORT', 'L4_DST_PORT', 'PROTOCOL', 'ANOMALY', 'ALERT']]

# Distribution analysis using pairplot
sns.pairplot(train_df_columns, hue='ALERT')

### 3.6. Remove useless columns and create dummies

In [None]:
# Revoked columns
revoked_columns = [
  'FLOW_ID', # Completely random
  'ID', # Completely random
  'ANALYSIS_TIMESTAMP', # Completely random
  'IPV4_SRC_ADDR', # Not useful for the model
  'IPV4_DST_ADDR', # Not useful for the model
  'PROTOCOL_MAP', # There is a numerical column for the protocol
  'MIN_IP_PKT_LEN', # Always 0 since it is a minimum value
  'MAX_IP_PKT_LEN', # Always 0 (maybe it means that the packet have infinite length?)
  'TOTAL_PKTS_EXP', # Always 0
  'TOTAL_BYTES_EXP', # Always 0
]

# Create dummy columns for the ALERT column
alert_dummies = pd.get_dummies(train_df['ALERT'], prefix='ALERT', drop_first=True)

# Copy + drop the revoked columns
train_df = train_df.copy().drop(revoked_columns, axis=1)

### 3.7. Correlation heatmap

In [None]:
# Correlation heatmap using pandas
corr = pd.concat([train_df.drop('ALERT', axis=1), alert_dummies], axis=1).corr(
  numeric_only=False, # Only consider numeric columns
)

# Correlation heatmap using seaborn + make annotations fit the heatmap
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt=".1f", cmap="YlGnBu")

## 4. Dataset preparation


### 4.1. Splitting the training set

In [None]:
def split_maintain_distribution(X, y):
  sss=StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=9)
  indexes = sss.split(np.zeros(X.shape[0]), y)
  train_indices, test_indices = next(indexes)
  return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

In [None]:
X = train_df.drop('ALERT', axis=1)
#print(X.head())
Y = train_df['ALERT']
print(Y.head())
X_train, X_val, y_train, y_val = split_maintain_distribution(X, Y)

In [None]:
X_test = test_df.drop('ALERT', axis=1)
#print(X.head())
y_test = test_df['ALERT']
print(Y.head())

#### 4.1.1 Check if the datasets are balanced

In [None]:
# Print distribution of the target variable in the train and validation sets
print('Train set distribution:')
print(y_train.value_counts(normalize=True))
print()
print('Validation set distribution:')
print(y_val.value_counts(normalize=True))

### 4.2. Data scaling

In [None]:
# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)

# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)

# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)

## 5. Feature selection

### 5.1. Create model and fit it

In [None]:
# Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100) # 100 trees = default value

# Fit the model
rfc.fit(x_train_scaled, y_train)

### 5.2. Get feature importances

In [None]:
# Print features importance
feature_importances = pd.DataFrame(
    rfc.feature_importances_,
    index=X_train.columns,
    columns=['importance']
).sort_values('importance', ascending=False)
print(feature_importances)

### 5.3. Plot feature importances

In [None]:
# Plot feature importance
plt.figure(figsize=(20, 10))
plt.xticks(rotation=-90)
sns.barplot(x=feature_importances.index, y=feature_importances['importance'])

### 5.4. Select most important features

In [None]:
MIN_IMPORTANCE_THRESHOLD = 0.02

In [None]:
# Select all columns with importance > 0.02
COLUMNS = feature_importances[feature_importances['importance'] > MIN_IMPORTANCE_THRESHOLD].index
COLUMNS

### 5.5. Reprepare the dataset with the selected features

#### 5.5.1. Split again the training set into training and validation sets (with new features)

In [None]:
X_train, X_val, y_train, y_val = split_maintain_distribution(
  train_df[COLUMNS],
  train_df['ALERT']
)

#### 5.5.2. Scale again the train and validation sets (with new features)

In [None]:
# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)

# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)

# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)

#### 5.5.3. Scale also the test set (with new features)

In [None]:
# No target variable, so no need to split the fit and transform
x_test_scaled = StandardScaler().fit_transform(test_df[COLUMNS])
# Convert to pandas dataframe
df_feat_test = pd.DataFrame(x_test_scaled, columns=test_df[COLUMNS].columns)

## Data Poisoning

In [None]:
!pip install adversarial-robustness-toolbox

In [None]:
from art.estimators.classification import SklearnClassifier
from art.attacks.evasion import BoundaryAttack

def run_targeted_boundary_attack(X_train, y_train, X_test, target_class, classifier):
    """
    Run the targeted Boundary Attack on the provided data using the provided classifier.

    :param X_train: Training feature data
    :param y_train: Training target data
    :param X_test: Test feature data
    :param target_class: Target class for the attack
    :param classifier: Trained sklearn classifier model
    :return: Adversarial test examples
    """
    # Wrap the sklearn model in ART's classifier wrapper
    art_classifier = SklearnClassifier(model=classifier)

    # Initialize the Boundary Attack
    attack = BoundaryAttack(estimator=art_classifier, targeted=True, max_iter=50, num_trial=2500, sample_size=20, init_size=10)

    # Generate adversarial test examples
    X_test_adv = attack.generate(x=X_test, y=np.array([target_class] * len(X_test)))

    return X_test_adv


'''
# Load and prepare the data
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a RandomForest classifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)

# Apply HopSkipJump attack
X_test_adv = run_hopskipjump_attack(X_train, y_train, X_test, classifier)

# Evaluate the classifier on original and adversarial examples
original_accuracy = accuracy_score(y_test, classifier.predict(X_test))
adversarial_accuracy = accuracy_score(y_test, classifier.predict(X_test_adv))
print(f'Original Accuracy: {original_accuracy:.2f}')
print(f'Adversarial Accuracy: {adversarial_accuracy:.2f}')
'''

In [None]:
#convert y_train to numbers and then back to str after poisoning
y_train.unique()

## 6. Model Training

* K-Nearest Neighbors (KNN)
* Support Vector Machine (SVM) with RBF kernel (Radial Basis Function)
  * SVC
  * SVC with PCA (Principal Component Analysis) pipeline
* Bagging Classifier (SVC with RBF kernel)
* Random Forest Classifier
* Extra Trees Classifier
* Neural Network (MLPClassifier)

### 6.1. KNN Classifier training

In [None]:
# Find best K using GridSearchCV
MAX_DEGREE = 30

k_range = list(range(1, MAX_DEGREE+1))
param_grid = dict(n_neighbors=k_range)
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best k: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")

In [None]:
knn_3 = KNeighborsClassifier(n_neighbors=3)
score = cross_val_score(knn_3, x_train_scaled, y_train, cv=3)

print(f"k: 3")
print(f"Score: {np.mean(score)}")

In [None]:
# Plot results
plt.figure(num=0, dpi=96, figsize=(10, 6))
plt.plot(k_range, grid.cv_results_['mean_test_score'])
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.xticks(k_range)
plt.show()

#### 6.1.2. Fit model with best K hyperparameter + make predictions

In [None]:
# Create a KNN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3) # 3 = view note above
# Fit the classifier to the data
knn.fit(x_train_scaled, y_train)
# Make predictions on validation set
#predictions = knn.predict(x_validation_scaled)

In [None]:
y_test

#### 6.1.3. Model evaluation based on validation set predictions

In [None]:
classifier = knn #place classifier here; the rest should be universal
class_to_num = {'Port Scanning':0, 'None':1, 'Denial of Service':2, 'Malware':3}
num_to_class = {0:'Port Scanning', 1:'None', 2:'Denial of Service', 3:'Malware'}
y_train_numerical = y_train.map(class_to_num)
y_test_numerical = y_test.map(class_to_num)
target_class = 1
x_test_scaled = run_targeted_boundary_attack(x_train_scaled, list(y_train_numerical), x_test_scaled, target_class, classifier)
y_test_attack = pd.Series(y_test_numerical).map(num_to_class)

In [None]:
predictions = knn.predict(x_test_scaled)

In [None]:
y_test = test_df['ALERT']

In [None]:
# Print the classification report
print(classification_report(y_test, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_test, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 6.1.4. KNN predictions on test set

In [None]:
# Prediction on the test set
predictions = knn.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 6.2. Support Vector Machine Classifier (SVC) training

#### 6.2.1 Only SVC model training

##### 6.2.1.1. Grid search to find best hyperparameters for SVC

In [None]:
# Create grid search parameters
param_grid = {
  'C': [0.1, 1, 10, 100, 1000],
  'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
}

# Create grid search
svc_grid = GridSearchCV(
  SVC(kernel="rbf"),
  param_grid,
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1, # Use all cores
)

# Fit grid search
svc_grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best params: {svc_grid.best_params_}")
print(f"Best score: {svc_grid.best_score_}")

##### 6.2.1.2. Create model with best parameters + fit model

In [None]:
# Create SVM with best parameters
svc = SVC(
  kernel='rbf',
  C=svc_grid.best_params_['C'],
  gamma=svc_grid.best_params_['gamma'],
)
svc.fit(x_train_scaled, y_train)

##### 6.2.1.3. Make predictions

In [None]:
classifier = svc #place classifier here; the rest should be universal
class_to_num = {'Port Scanning':0, 'None':1, 'Denial of Service':2, 'Malware':3}
num_to_class = {0:'Port Scanning', 1:'None', 2:'Denial of Service', 3:'Malware'}
y_train_numerical = y_train.map(class_to_num)
y_test_numerical = y_test.map(class_to_num)
target_class = 1
x_test_scaled = run_targeted_boundary_attack(x_train_scaled, list(y_train_numerical), x_test_scaled, target_class, classifier)
y_test_attack = pd.Series(y_test_numerical).map(num_to_class)

In [None]:
# Make predictions on validation set
predictions = svc.predict(x_test_scaled)

##### 6.2.1.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_test, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

##### 6.2.1.5. SVC model predictions on test set

In [None]:
# Prediction on the test set
predictions = svc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

#### 6.2.2. PCA + SVC model training

##### 6.2.2.1. Create pipeline

In [None]:
# Create the two parameters
pca = PCA(whiten=True, random_state=42) # PCA (Principal Component Analysis)
svc = SVC(kernel='rbf', class_weight='balanced') # SVC (Support Vector Classification)

# Create pipeline
model = make_pipeline(pca, svc)

##### 6.2.2.2. Grid search to find the best parameters for PCA and SVC

In [None]:
# Generate a valid n_components range (from 5 to maximum number of features)
n_features = x_train_scaled.shape[1]
n_components = np.arange(5, n_features, 3)

param_grid = {
  'pca__n_components': n_components,
  'svc__C': [50, 100, 500, 1000, 5000, 10000],
  'svc__gamma': [0.001, 0.01, 0.1, 1, 10]
}

# Grid search
pipeline_grid = GridSearchCV(
    model,
    param_grid,
    cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
    n_jobs=-1 # Use all cores
)
pipeline_grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best params: {pipeline_grid.best_params_}")
print(f"Best score: {pipeline_grid.best_score_}")

##### 6.2.2.3. Create pipeline with best parameters + fit model

In [None]:
# Now, create the desired pipeline
pca = PCA(
  n_components=pipeline_grid.best_params_['pca__n_components'],
  whiten=True,
  random_state=42
)
svc = SVC(kernel='rbf',
  class_weight='balanced',
  # Use the best parameters found by the grid search
  C=pipeline_grid.best_params_['svc__C'],
  gamma=pipeline_grid.best_params_['svc__gamma']
)
model = make_pipeline(pca, svc)
model.fit(x_train_scaled, y_train)

##### 6.2.2.4. Make predictions

In [None]:
# Make predictions on validation set
predictions = model.predict(x_validation_scaled)

##### 6.2.2.5. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

##### 6.2.2.6. SVC+PCA pipeline model predictions on test set

In [None]:
# Prediction on the test set
predictions = model.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 6.3. Bagging Classifier (SVC based) training

#### 6.3.1. Create model using best SVC parameters + fit model

In [None]:
svc = SVC(kernel='rbf',
  class_weight='balanced',
  C=svc_grid.best_params_['C'],
  gamma=svc_grid.best_params_['gamma']
)

clf = BaggingClassifier(
  svc,
  n_estimators=30,
  n_jobs=-1, # Use all cores
  random_state=42
)
clf.fit(x_train_scaled, y_train)

#### 6.3.2. Make predictions

In [None]:
classifier = clf #place classifier here; the rest should be universal
class_to_num = {'Port Scanning':0, 'None':1, 'Denial of Service':2, 'Malware':3}
num_to_class = {0:'Port Scanning', 1:'None', 2:'Denial of Service', 3:'Malware'}
y_train_numerical = y_train.map(class_to_num)
y_test_numerical = y_test.map(class_to_num)
target_class = 1
x_test_scaled = run_targeted_boundary_attack(x_train_scaled, list(y_train_numerical), x_test_scaled, target_class, classifier)
y_test_attack = pd.Series(y_test_numerical).map(num_to_class)

In [None]:
predictions = clf.predict(x_test_scaled)

#### 6.3.3. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_test, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_test, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 6.5.5. Bagging Classifier predictions on test set

In [None]:
# Prediction on the test set
predictions = clf.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 6.4. Random Forest Classifier training

#### 6.4.1. Grid search to find best hyperparameters for Random Forest

In [None]:
# Create random forest classifier
rfc = RandomForestClassifier()

# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500, 1000]}

# Used to find the best n_estimators value to use to train the model
rfc_grid = GridSearchCV(
  rfc,
  parameters,
  scoring='accuracy',
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1 # Use all cores
)

# Fit model to data
rfc_grid.fit(x_train_scaled, y_train)

# Extract best params
print(f"Best params: {rfc_grid.best_params_}")
print(f"Best score: {rfc_grid.best_score_}")

#### 6.4.2. Create model with best parameters + fit model

In [None]:
rfc = RandomForestClassifier(n_estimators=rfc_grid.best_params_['n_estimators'])
rfc.fit(x_train_scaled, y_train)

#### 6.4.3. Make predictions

In [None]:
classifier = rfc #place classifier here; the rest should be universal
class_to_num = {'Port Scanning':0, 'None':1, 'Denial of Service':2, 'Malware':3}
num_to_class = {0:'Port Scanning', 1:'None', 2:'Denial of Service', 3:'Malware'}
y_train_numerical = y_train.map(class_to_num)
y_test_numerical = y_test.map(class_to_num)
target_class = 1
x_test_scaled = run_targeted_boundary_attack(x_train_scaled, list(y_train_numerical), x_test_scaled, target_class, classifier)
y_test_attack = pd.Series(y_test_numerical).map(num_to_class)

In [None]:
# Make predictions on validation set
predictions = rfc.predict(x_test_scaled)

#### 6.4.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_test, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_test, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 6.5.5. Random Forest model predictions on test set

In [None]:
# Prediction on the test set
predictions = rfc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 6.5. Extra Trees Classifier training

#### 6.5.1. Grid search to find best hyperparameters for Extra Trees

In [None]:
# Create random forest classifier
etc = ExtraTreesClassifier()

# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500]}

# Used to find the best n_estimators value to use to train the model
etc_grid = GridSearchCV(
  etc,
  parameters,
  scoring='accuracy',
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1 # Use all cores
)

# Fit model to data
etc_grid.fit(x_train_scaled, y_train)

# Extract best params
print(f"Best params: {etc_grid.best_params_}")
print(f"Best score: {etc_grid.best_score_}")

#### 6.5.2. Create model with best parameters + fit model

In [None]:
etc = ExtraTreesClassifier(n_estimators=etc_grid.best_params_['n_estimators'])
etc.fit(x_train_scaled, y_train)

#### 6.5.3. Make predictions

In [None]:
classifier = etc #place classifier here; the rest should be universal
class_to_num = {'Port Scanning':0, 'None':1, 'Denial of Service':2, 'Malware':3}
num_to_class = {0:'Port Scanning', 1:'None', 2:'Denial of Service', 3:'Malware'}
y_train_numerical = y_train.map(class_to_num)
y_test_numerical = y_test.map(class_to_num)
target_class = 1
x_test_scaled = run_targeted_boundary_attack(x_train_scaled, list(y_train_numerical), x_test_scaled, target_class, classifier)
y_test_attack = pd.Series(y_test_numerical).map(num_to_class)

In [None]:
# Make predictions on validation set
predictions = etc.predict(x_test_scaled)

#### 6.5.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_test, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_test, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 6.5.5. Extra Trees model predictions on test set

In [None]:
# Prediction on the test set
predictions = etc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 6.6. Neural Network classifier training

#### 6.6.1. Grid search to find best hyperparameters for Neural Network

In [None]:
# Create MLPClasifier
mlp = MLPClassifier(
  max_iter=1000,
  random_state=42
)

# Grid search for MLPClassifier
parameters = {
  'hidden_layer_sizes': [(50,), (100,), (50, 50)],
  'activation': ['relu', 'tanh'],
  'alpha': [0.0001, 0.001],
  'solver': ['adam', 'lbfgs'],
  'learning_rate': ['constant', 'invscaling'],
}

mlp_grid = GridSearchCV(
  mlp,
  parameters,
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1, # Use all cores
)

mlp_grid.fit(x_train_scaled, y_train)

In [None]:
# Extract best params
print(f"Best params: {mlp_grid.best_params_}")
print(f"Best score: {mlp_grid.best_score_}")

#### 6.6.2. Create model with best parameters + fit model

In [None]:
# Create MLPClassifier with best parameters
mlp = MLPClassifier(
  hidden_layer_sizes=mlp_grid.best_params_['hidden_layer_sizes'],
  activation=mlp_grid.best_params_['activation'],
  alpha=mlp_grid.best_params_['alpha'],
  solver=mlp_grid.best_params_['solver'],
  learning_rate=mlp_grid.best_params_['learning_rate'],
  max_iter=1000,
  random_state=42
)
mlp.fit(x_train_scaled, y_train)

#### 6.6.3. Make predictions

In [None]:
classifier = mlp #place classifier here; the rest should be universal
class_to_num = {'Port Scanning':0, 'None':1, 'Denial of Service':2, 'Malware':3}
num_to_class = {0:'Port Scanning', 1:'None', 2:'Denial of Service', 3:'Malware'}
y_train_numerical = y_train.map(class_to_num)
y_test_numerical = y_test.map(class_to_num)
target_class = 1
x_test_scaled = run_targeted_boundary_attack(x_train_scaled, list(y_train_numerical), x_test_scaled, target_class, classifier)
y_test_attack = pd.Series(y_test_numerical).map(num_to_class)

In [None]:
# Make predictions on validation set
predictions = mlp.predict(x_test_scaled)

#### 6.6.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_test, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_test, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 6.6.5. MPL classifier model predictions on test set

In [None]:
# Prediction on the test set
predictions = mlp.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class