# Machine Learning Models on the IDS 2018

In this notebook, random tree and random forest based machine learning algorithms are applied
to the ids2018 dataset. Several methods for resolving the class imbalance are tested. Random
tree algorithms were chosen for their effectiveness and the training time which were better than
other machine learning models. RT and RF based algorithms performed better in the preliminary
experiments


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import xgboost as xgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, average_precision_score, make_scorer, precision_score
from notebook_utils import load_sample_dataset_2018
%matplotlib inline
%load_ext autoreload
%autoreload 2

file_path = r"..\CIC-IDS-2018\Processed Traffic Data for ML Algorithms"

df = load_sample_dataset_2018(file_path)

Processed 1/10 files.
Processed 2/10 files.
Processed 3/10 files.
Processed 4/10 files.
Processed 5/10 files.
Processed 6/10 files.
Processed 7/10 files.
Processed 8/10 files.
Processed 9/10 files.
Processed 10/10 files.
Creating is_attack column...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1623303 entries, 0 to 1623302
Data columns (total 81 columns):
 #   Column             Non-Null Count    Dtype   
---  ------             --------------    -----   
 0   dst_port           1623295 non-null  float64 
 1   protocol           1623295 non-null  float64 
 2   timestamp          0 non-null        float64 
 3   flow_duration      1623295 non-null  float64 
 4   tot_fwd_pkts       1623295 non-null  float64 
 5   tot_bwd_pkts       1623295 non-null  float64 
 6   totlen_fwd_pkts    1623295 non-null  float64 
 7   totlen_bwd_pkts    1623295 non-null  float64 
 8   fwd_pkt_len_max    1623295 non-null  float64 
 9   fwd_pkt_len_min    1623295 non-null  float64 
 10  fwd_pkt_len_mean   1

## Preparing the Dataset

### Check for invalid values

In [2]:
# Select only numeric columns
numeric_columns = df.select_dtypes(include=[np.number]).columns
# Identify columns with NaN, infinity, or negative values
nan_columns = df[numeric_columns].columns[df[numeric_columns].isna().any()]
inf_columns = df[numeric_columns].columns[np.isinf(df[numeric_columns]).any()]
neg_columns = df[numeric_columns].columns[(df[numeric_columns] < 0).any()]
print("Columns with NaN values:", nan_columns.tolist())
print("Columns with infinite values:", inf_columns.tolist())
print("Columns with negative values:", neg_columns.tolist())
# Calculate the percentage of NaN, infinite, and negative values
nan_percentage = df[nan_columns].isna().mean() * 100
# nan_percentage = nan_percentage[nan_percentage > 1]
inf_percentage = df[inf_columns].map(lambda x: np.isinf(x)).mean() * 100
neg_percentage = df[neg_columns].map(lambda x: x < 0).mean() * 100
print("Percentage of NaN values in each column:\n", nan_percentage)
print("Percentage of infinite values in each column:\n", inf_percentage)
print("Percentage of negative values in each column:\n", neg_percentage)

Columns with NaN values: ['dst_port', 'protocol', 'timestamp', 'flow_duration', 'tot_fwd_pkts', 'tot_bwd_pkts', 'totlen_fwd_pkts', 'totlen_bwd_pkts', 'fwd_pkt_len_max', 'fwd_pkt_len_min', 'fwd_pkt_len_mean', 'fwd_pkt_len_std', 'bwd_pkt_len_max', 'bwd_pkt_len_min', 'bwd_pkt_len_mean', 'bwd_pkt_len_std', 'flow_byts_s', 'flow_pkts_s', 'flow_iat_mean', 'flow_iat_std', 'flow_iat_max', 'flow_iat_min', 'fwd_iat_tot', 'fwd_iat_mean', 'fwd_iat_std', 'fwd_iat_max', 'fwd_iat_min', 'bwd_iat_tot', 'bwd_iat_mean', 'bwd_iat_std', 'bwd_iat_max', 'bwd_iat_min', 'fwd_psh_flags', 'bwd_psh_flags', 'fwd_urg_flags', 'bwd_urg_flags', 'fwd_header_len', 'bwd_header_len', 'fwd_pkts_s', 'bwd_pkts_s', 'pkt_len_min', 'pkt_len_max', 'pkt_len_mean', 'pkt_len_std', 'pkt_len_var', 'fin_flag_cnt', 'syn_flag_cnt', 'rst_flag_cnt', 'psh_flag_cnt', 'ack_flag_cnt', 'urg_flag_cnt', 'cwe_flag_count', 'ece_flag_cnt', 'down_up_ratio', 'pkt_size_avg', 'fwd_seg_size_avg', 'bwd_seg_size_avg', 'fwd_byts_b_avg', 'fwd_pkts_b_avg', 'f

For negative values, 2 columns have an extremely high percentage of negative values. We choose to drop the features “init_win_bytes_forward” and “init_win_bytes_backward” as the source of the negative sign is unknown. For the rest of relevant features, the percentages of negative, infinite or are low so the rows are dropped.


In [3]:
def replace_invalid(df):
    # Select only numeric columns
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    # Identify columns with NaN, infinite, or negative values
    nan_columns = df[numeric_columns].columns[df[numeric_columns].isna().any()]
    inf_columns = df[numeric_columns].columns[np.isinf(df[numeric_columns]).any()]
    neg_columns = df[numeric_columns].columns[(df[numeric_columns] < 0).any()]
    # Drop rows with NaN values (low percentage of NaN values)
    # df = df.dropna(subset=nan_columns)
    # Drop rows with infinite values (assuming low percentage)
    for col in inf_columns:
        df = df[np.isfinite(df[col])]
    # Drop columns with a high percentage of negative values
    columns_to_drop = ['init_fwd_win_byts', 'init_bwd_win_byts']
    df = df.drop(columns=columns_to_drop)
    # Drop rows with negative values in the remaining columns
    remaining_neg_columns = [col for col in neg_columns if col not in columns_to_drop]
    for col in remaining_neg_columns:
        df = df[df[col] >= 0]
    return df

In [4]:
df = replace_invalid(df)

In [5]:
X = df.iloc[:, 0:76]
Y = df[["label", "is_attack"]]
X.info()
Y.info()
print(Y.label.value_counts())

<class 'pandas.core.frame.DataFrame'>
Index: 1613817 entries, 0 to 1623302
Data columns (total 76 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   dst_port           1613817 non-null  float64
 1   protocol           1613817 non-null  float64
 2   timestamp          0 non-null        float64
 3   flow_duration      1613817 non-null  float64
 4   tot_fwd_pkts       1613817 non-null  float64
 5   tot_bwd_pkts       1613817 non-null  float64
 6   totlen_fwd_pkts    1613817 non-null  float64
 7   totlen_bwd_pkts    1613817 non-null  float64
 8   fwd_pkt_len_max    1613817 non-null  float64
 9   fwd_pkt_len_min    1613817 non-null  float64
 10  fwd_pkt_len_mean   1613817 non-null  float64
 11  fwd_pkt_len_std    1613817 non-null  float64
 12  bwd_pkt_len_max    1613817 non-null  float64
 13  bwd_pkt_len_min    1613817 non-null  float64
 14  bwd_pkt_len_mean   1613817 non-null  float64
 15  bwd_pkt_len_std    1613817 non-null  

## Feature Selection

First, the columns with no variance are dropped as they have no impact on the target variables.

In [6]:
stats = X.describe()
std = stats.loc["std"]
features_no_var = std[std == 0.0].index
# Exclude non-numeric columns (e.g., categorical columns) from the features with zero variance
features_no_var_numeric = [col for col in features_no_var if col in X.select_dtypes(include=[np.number]).columns]
print(features_no_var_numeric)

['bwd_psh_flags', 'bwd_urg_flags', 'fwd_byts_b_avg', 'fwd_pkts_b_avg', 'fwd_blk_rate_avg', 'bwd_byts_b_avg', 'bwd_pkts_b_avg', 'bwd_blk_rate_avg']


In [9]:
X = X.drop(columns=features_no_var)
X = X.drop(columns=['dst_port', 'timestamp'])
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1613817 entries, 0 to 1623302
Data columns (total 41 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   protocol          1613817 non-null  float64
 1   flow_duration     1613817 non-null  float64
 2   tot_fwd_pkts      1613817 non-null  float64
 3   tot_bwd_pkts      1613817 non-null  float64
 4   fwd_pkt_len_max   1613817 non-null  float64
 5   fwd_pkt_len_min   1613817 non-null  float64
 6   fwd_pkt_len_mean  1613817 non-null  float64
 7   bwd_pkt_len_max   1613817 non-null  float64
 8   bwd_pkt_len_min   1613817 non-null  float64
 9   bwd_pkt_len_mean  1613817 non-null  float64
 10  flow_byts_s       1613817 non-null  float64
 11  flow_pkts_s       1613817 non-null  float64
 12  flow_iat_mean     1613817 non-null  float64
 13  flow_iat_std      1613817 non-null  float64
 14  flow_iat_max      1613817 non-null  float64
 15  fwd_iat_std       1613817 non-null  float64
 16  bwd_i

### Remove collinear variables

In [10]:
def correlation_feature_selection(df, threshold=0.95):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    return df.drop(columns=to_drop)
X = correlation_feature_selection(X)
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1613817 entries, 0 to 1623302
Data columns (total 41 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   protocol          1613817 non-null  float64
 1   flow_duration     1613817 non-null  float64
 2   tot_fwd_pkts      1613817 non-null  float64
 3   tot_bwd_pkts      1613817 non-null  float64
 4   fwd_pkt_len_max   1613817 non-null  float64
 5   fwd_pkt_len_min   1613817 non-null  float64
 6   fwd_pkt_len_mean  1613817 non-null  float64
 7   bwd_pkt_len_max   1613817 non-null  float64
 8   bwd_pkt_len_min   1613817 non-null  float64
 9   bwd_pkt_len_mean  1613817 non-null  float64
 10  flow_byts_s       1613817 non-null  float64
 11  flow_pkts_s       1613817 non-null  float64
 12  flow_iat_mean     1613817 non-null  float64
 13  flow_iat_std      1613817 non-null  float64
 14  flow_iat_max      1613817 non-null  float64
 15  fwd_iat_std       1613817 non-null  float64
 16  bwd_i

### Information Gain Selection

In [11]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
import pandas as pd

def oversample_minority_classes(X, Y, sample_size=1000):
    y = Y["label"]
    ros = RandomOverSampler(random_state=42)
    X_resampled, y_resampled = ros.fit_resample(X, y)
    # Create a subset of the oversampled data
    X_sample, _, y_sample, _ = train_test_split(X_resampled, y_resampled, train_size=sample_size, stratify=y_resampled, random_state=42)
    return X_sample, y_sample

def information_gain_feature_selection(X, Y, sample_size=1000):
    # Create an oversampled subset of the data
    X_sample, y_sample = oversample_minority_classes(X, Y, sample_size)
    # Create is_attack column based on label_code
    y_sample = (y_sample != 0).astype(int)
    # Perform feature selection on the oversampled subset
    info_gain = mutual_info_classif(X_sample, y_sample)
    info_gain_df = pd.DataFrame({'Feature': X.columns, 'Information Gain': info_gain})
    info_gain_df = info_gain_df.sort_values(by='Information Gain', ascending=False)
    print(info_gain_df)
    selected_features = info_gain_df[info_gain_df['Information Gain'] > 0]['Feature'].tolist()
    return selected_features

# Determine the selected features using the oversampled subset
selected_features = information_gain_feature_selection(X, Y)

# Apply the selected features to the main dataset
X = X[selected_features]

# Display information about the selected features
X.info()

             Feature  Information Gain
31      ack_flag_cnt            0.0005
21     fwd_psh_flags            0.0000
23        fwd_pkts_s            0.0000
24        bwd_pkts_s            0.0000
25       pkt_len_min            0.0000
26      pkt_len_mean            0.0000
27       pkt_len_var            0.0000
28      fin_flag_cnt            0.0000
29      rst_flag_cnt            0.0000
30      psh_flag_cnt            0.0000
0           protocol            0.0000
32      urg_flag_cnt            0.0000
33     down_up_ratio            0.0000
34  fwd_seg_size_min            0.0000
35       active_mean            0.0000
36        active_std            0.0000
37        active_max            0.0000
38        active_min            0.0000
39         idle_mean            0.0000
22     fwd_urg_flags            0.0000
20       bwd_iat_min            0.0000
1      flow_duration            0.0000
19       bwd_iat_max            0.0000
2       tot_fwd_pkts            0.0000
3       tot_bwd_pkts     

## Split Dataset

The dataset is split into a training set and a testing set with a ratio of 0.8/0.2. The dataset is stratified according to the label to have an equal representation of all classes in the 2 subsets.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y.label)

In [None]:
Y_train.label.value_counts()

In [None]:
Y_test.label.value_counts()

In [None]:
benign_percentage = len(Y_train.label[Y_train["label"]=="BENIGN"])/len(Y_train)
print('Percentage of benign samples: %.4f' % benign_percentage)
print(Y_train.is_attack.value_counts())

## Hyperparameters Optimization