# Hyperparameter Tuning:

## Self-Tuning Hyperparameters:

Hyperparameter tuning is the process of adjusting the _hyperparameters_ of a machine learning model to improve its performance. Hyperparameters are parameters that control the training process of a model, but are not learned from the data. They are usually set by the programmer and depend heavily on the task at hand.
The _optimal_ hyperparameters are chosen based on a metric which compares the capabilities of the model with a specific set of hyperparameters to the desired capabilities. Even if the metric is choice depends on the specific context and application, some common metrics that are used in classification problems include: accuracy, precision, recall, F1 score.
Hyperparameter tuning (or optimization) can either be SOHT (Single Objective Hyperparameters Tuning) or MOHT (Multiple-Objectives Hyperparameters Tuning). 
In SOHT, the focus is on optimizing a single objective, such as model accuracy. However, in multi-objective hyperparameter tuning (MOHT), we optimize the hyperparameters to achieve multiple, often conflicting objectives. MOHT can be achieved in multiple ways, and sometimes merging multiple objectives into a single one is a viable strategy.

MOHT is useful in situations where we want to balance different performance metrics, such as accuracy, precision, recall, and training time. In the context of an IDS, we may want to optimize the model to reduce the number of false negatives, compared to the number of false positives that will not impact the security of the network.

Some examples of MOHT can be:
- Optimizing a machine learning model for both accuracy and fairness.
- Optimizing a machine learning model for both accuracy and speed.
- Optimizing a machine learning model for both accuracy and low CPU usage.

### More on Hyperparameters Tuning

In machine learning models like the Naive Bayes classifier and SVM, hyperparameters are settings that are not learned from the data but are crucial for model performance. Manually tuning these hyperparameters can be time-consuming and might not lead to optimal results. Self-tuning hyperparameters involve mechanisms that automatically find the best hyperparameter settings for your models during the planning phase of the MAPE-K loop. Here's how it can be implemented:

- **Bayesian Optimization**: It's a probabilistic model-based approach that builds a surrogate model of the hyperparameter-performance relationship. It then uses an acquisition function to decide which hyperparameters to try next. Over time, Bayesian optimization adapts to discover the hyperparameters that yield the best performance.

- **Grid Search with Evaluation**: You can set up a grid search where different hyperparameters are systematically tested within predefined ranges. Computationally expensive. In the planning phase, the system can perform a grid search with occasional model evaluation to find the best hyperparameters.

- **Random Search**: Random combinations of hyperparameters are evaluated. Random search can be less intensive than grid search and may lead to good hyperparameter choices.

Hyperparameters tuning and Cross Validation (CV) are strictly correlated.

**Cross-Validation**: Cross-validation is a technique used to assess the performance and generalization ability of a machine learning model. It involves dividing the dataset into multiple subsets (known as _folds_) and iteratively training and testing the model using different combinations of these folds.

**Hyperparameter Tuning within Cross-Validation**: To find the best hyperparameters for a model, it's common to perform hyperparameter tuning within each fold of a cross-validation procedure. In this setup, you search for the best hyperparameters while training on the training portion of the current fold and then evaluate the model's performance on the validation portion of the fold.
This process is repeated for each fold, and the hyperparameters that yield the best average performance across the folds are selected as the final hyperparameters for the model. This approach ensures that the selected hyperparameters generalize well to different data partitions.

## Monitor Phase:

**Self-monitoring mechanism**: collects data from the IDS's operations on the KDDTest+ dataset. This data should include information on detection results (e.g., true positives, false negatives), response times, and resource utilization. From the collected data, define performance metrics: a set of performance metrics that are meaningful for your specific IDS. Common metrics include detection accuracy, false positive rate, false negative rate, and response time. These have already been identified for this specific IDS, but additional ones can be used to identify new trends in the system.
In the Monitoring phase, the IDS continuously collects data about network traffic, system performance and security incidents.
Collect performance metrics such as detection accuracy, false positives, and false negatives.
Gather data on the current threat landscape, including emerging threats, attack patterns, and vulnerabilities.
Monitor the computational resources available.

#### What to monitor

In order to detect the actual need for self-adaptation, metrics need to be identified to evaluate how _up-to-date_ the system is. These metrics are simple measurements that will be fed to the _analysis_ component that will identify if changes in trends may require adaptations. The following is a non-exhaustive list of what can be monitored in the system:

1. **Performance metrics**: evaluate how the systems perform in terms of the actual intrusion detection.
    - **Detection accuracy**: overall accuracy of the IDS.
    - **Confusion matrix**: a representation of the positives and negatives [TP FN / FP TN]
    - **False (True) positives**: rate at which the IDS produce false (true) alarms.
    - **False (True) negatives**: rate at which che IDS misses (identifies) actual threats. 
    - **Response time**: measures the time it takes for the IDS to produce a prediction.
- **Feedback from analysts**: all the metrics above can be updated in complete autonomy in testing, but in a real life applications, feedback will come from the classifiers themselves, as well as from analysts that manually classify misclassified connections.
    - **Model stability**: This metric is used to ensure the function of the performance tends to be constant in time. A drop of performance is alarming.
2. **Network data and traffic**: evaluate the properties of the incoming traffic.
    - **Data distribution**: monitor the distribution of network data, including traffic patterns, protocols, attack prevalence. High volumes of data may require higher response times to ensure the IDS does not behave as the bottleneck of the network.
    - **Data quality**: assess the quality of the data collected, inconsistencies or preprocessing problems. This may be useful to  identify when the IDS is lacking accuracy or when problems happen before the classification process.
3. **Hardware and software infrastructures**: Keep track of the properties of the infrastructure in which the IDS operates.
    - **Resource availability**: keep track of the computational resources, CPU, GPU etc.
4. **Feature engineering**: periodically evaluate the relevance of the features used for detection and adapt the feature set as needed.
5. **Thresholds**: monitor if thresholds have been changed and thus require adaptation.
6. **Train set**: changes in the structure of the train set can be identified. Changes like the _number of features_ and the _number of samples_ which can also be linked to the changes in the features quality.
7. **General requirements**: more requirements may be specified by operators, assigning an higher priority to some aspects of the IDS.

### Analyze Phase:

**Analyze component**: this is a fundamental component of the self-adaptive loop in the IDS's MAPE-K architecture. In this phase, the system continuously evaluates the performance of the IDS and collects valuable insights from the data and feedback. The primary tasks of these phases include:
- Metrics:
    - Compare _baseline_ values (set during design phase) of the metrics with current values to identify possible trends and evaluate if certain thresholds have been surpassed.
    - Evaluate if changes in the datasets lead to new optimal features.
    - Evaluate if the dataset itself was changed significantly.
    - Compare traffic volumes, patterns or other properties of the incoming traffic with historical patterns to evaluate if a drift in traffic distribution is taking place. 
    - Evaluate the resources utilization of hardware/software components and identify possible critical situations.

### Plan Phase:

The bulk of the self-adaptation happens here. In this phase, the issues identified in the _analysis_ phase are translated in actions to take on the IDS.

1. **Target identification**:
The system is composed of multiple layers, the first thing is to understand if both layers need tuning or if the changes involve only one of the two.
A single adaptation may be needed in cases where a performance drop can be linked to one single layer or time to process increases only in one of the two. However, in real applications the incorrect classification happens only at the end of the pipeline. Another instance in which single layer optimization may be needed is it a single layer uses too many resources or time to process the samples.
In general, we can say that optimizing the time to train the models is always a priority, since once a problem is identified, the need to have a model quickly is very high.
In case of dataset expansions, target identifications will be harder since new preprocessing will most likely be needed.

2. **Mapping**:
First, a **metric <--> hyperparameter** mapping needs to be done; assigning each metric to one or more hyperparameters allows understanding how an increase/decrease of a numeric hyperparameter, or the change in categorical value, will affect the performance of the classifier. For example, the drastic reduction of support vector in an SVM will most likely lead to a decrease in precision. Identify **what** hyperparameter to tweak is the first step in performing precise hyperparameter tuning.

3. **Objective(s) identification**:
Understanding the system state that needs to be reached is the key to achieving good self-adaptation. Performing a **metric <--> objective** mapping will be required to understand the starting and ending point of the adaptation process. First we need to evaluate what direction we want the system to head to, and then what metrics need to be improved to reach that specific objective.

### Execute Phase:
Model Reconfiguration: In the Execute phase, your IDS should reconfigure its models based on the chosen hyperparameters. It may involve retraining the models using the KDDTest+ dataset with the optimized hyperparameters.
Response Actions: After the models have been reconfigured, the Execute phase should continue to detect and respond to threats based on the improved models and settings. The response actions may adapt based on changes in model performance.

### Knowledge Phase:
Hyperparameter Update: The Knowledge phase should record the chosen hyperparameters for both layers, including their performance and changes over time. This knowledge will inform future adaptation.
Feedback Loop: Implement a feedback loop to ensure that the IDS learns from the outcomes of its decisions and adaptability. If certain hyperparameter choices consistently lead to poor performance, the system should adapt its choices.

### Evaluation:
Continuously evaluate the effectiveness of the self-adaptation mechanism. Use relevant performance metrics to assess whether the IDS are improving over time and whether the self-tuning of hyperparameters is having a positive impact.

### Automation:
Automate the self-adaptation process. Your IDS should be able to perform most of these steps automatically without requiring manual intervention.
By following these steps, your IDS will continuously monitor its performance, adapt its hyperparameters, and improve its detection capabilities over time. Hyperparameter tuning is a crucial component of self-adaptation, ensuring that your IDS remains effective in the face of evolving threats and network conditions.\\

## When to self-adapt

1. **Initial model training**: when the IDS is built, the best settings need to be found according to the data present.
2. **Adapting to new threats**: as new attacks are documented and added to the knowledge base, new parameters may be necessary to detect these new threats effectively.
3. **Changing network conditions**: fluctuations in the volume of data can impact the IDS's performance, hyperparameter tuning may help the system adapt to new conditions.
4. **Data drift**: Over time, the concept or distribution of normal and malicious network traffic may shift. This drift may be gradual but still requires adapting via tuning.
5. **Model updates**: When (and if) the models deployed in the IDS change, new hyperparameters tuning is necessary.
6. **Performance degradation**: general decrease of performance due to changes of threats, data patterns.
7. **Feature engineering**: new features are introduced/modified, new hyperparameters may perform better under these conditions.

## Human role

1. Operators annotate intrusions when the system encounters difficulties in classifying specific cases or when it mistakenly labels anomalies. This annotation process helps in creating **new data elements** that can be incorporated into the dataset.
2. Another way to keep human in the loop is to **specify additional requirements** to keep in consideration when performing multi-objective hyperparameter tuning such as time, accuracy etc.
3. Operators can **manually trigger** the self adaptive loop.
4. Adaptation rules such as **thresholds for accuracy** or **confusion matrix** values can be manually tweaked.

In [25]:
import numpy as np 
import pandas as pd
import copy
import pickle

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.decomposition import PCA
from imblearn.under_sampling import RandomUnderSampler as under_sam

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.metrics import matthews_corrcoef, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier

### ICFS function
Takes a dataframe as parameter and saves to file all the features necessary to describe DoS+Probe and U2R+R2L

In [26]:
def pearson_correlated_features(x, y, threshold):
    y['target'] = y['target'].astype(int)

    for p in x.columns:
        x[p] = x[p].astype(float)

    # Ensure y is a DataFrame for consistency
    if isinstance(y, pd.Series):
        y = pd.DataFrame(y, columns=['target'])

    # Calculate the Pearson's correlation coefficients between features and the target variable(s)
    corr_matrix = x.corrwith(y['target'])

    # Select features with correlations above the threshold
    selected_features = x.columns[corr_matrix.abs() > threshold].tolist()

    return selected_features

In [27]:
def compute_set_difference(df1, df2):
    # Create a new DataFrame containing the set difference of the two DataFrames.
    df_diff = df1[~df1.index.isin(df2.index)]
    # Return the DataFrame.
    return df_diff

In [28]:
def perform_icfs(x_train):
    # now ICFS only on the numerical features
    num_train = copy.deepcopy(x_train)
    del num_train['protocol_type']
    del num_train['service']
    del num_train['flag']

    target = pd.DataFrame()
    target['target'] = np.array([1 if x != 'normal' else 0 for x in num_train['label']])
    num_train = pd.concat([num_train, target], axis=1)

    # These are how attacks are categorized in the trainset
    dos_list = ['back', 'land', 'neptune', 'pod', 'smurf', 'teardrop']
    probe_list = ['ipsweep', 'portsweep', 'satan', 'nmap']
    u2r_list = ['loadmodule', 'perl', 'rootkit', 'buffer_overflow']
    r2l_list = ['ftp_write', 'guess_passwd', 'imap', 'multihop', 'phf', 'spy', 'warezclient', 'warezmaster']
    normal = ['normal']

    # useful sub-sets
    x_normal = num_train[num_train['label'].isin(normal)]
    x_u2r = num_train[num_train['label'].isin(u2r_list)]
    x_r2l = num_train[num_train['label'].isin(r2l_list)]
    x_dos = num_train[num_train['label'].isin(dos_list)]
    x_probe = num_train[num_train['label'].isin(probe_list)]

    # start the ICFS with l1

    # features for dos
    dos = copy.deepcopy(num_train)
    del dos['target']
    y = np.array([1 if x in dos_list else 0 for x in dos['label']])
    y_dos = pd.DataFrame(y, columns=['target'])
    del dos['label']
    dos_all = pearson_correlated_features(dos, y_dos, 0.1)
    print(dos_all)

    # features for probe
    probe = copy.deepcopy(num_train)
    del probe['target']
    y = np.array([1 if x in probe_list else 0 for x in probe['label']])
    y_probe = pd.DataFrame(y, columns=['target'])
    del probe['label']
    probe_all = pearson_correlated_features(probe, y_probe, 0.1)
    print(probe_all)

    # intersect for the optimal features
    set_dos = set(dos_all)
    set_probe = set(probe_all)

    comm_features_l1 = set_probe & set_dos

    print('common features to train l1: ', comm_features_l1)

    # now l2 needs the features to describe the difference between rare attacks and normal traffic

    # features for u2r
    u2r = pd.concat([x_u2r, x_normal], axis=0)
    del u2r['target']
    y = np.array([1 if x in u2r_list else 0 for x in u2r['label']])
    y_u2r = pd.DataFrame(y, columns=['target'])
    del u2r['label']
    u2r_all = pearson_correlated_features(u2r, y_u2r, 0.01)
    print(u2r_all)

    # features for r2l
    r2l = pd.concat([x_r2l, x_normal], axis=0)
    del r2l['target']
    y = np.array([1 if x in r2l_list else 0 for x in r2l['label']])
    y_r2l = pd.DataFrame(y, columns=['target'])
    del r2l['label']
    r2l_all = pearson_correlated_features(r2l, y_r2l, 0.01)
    print(r2l_all)

    # intersect for the optimal features
    set_r2l = set(r2l_all)
    set_u2r = set(u2r_all)

    comm_features_l2 = set_r2l & set_u2r
    # print('Common features to train l2: ', len(common_features_l2), common_features_l2)

    with open('NSL-KDD Files/NSL_features_l1.txt', 'w') as g:
        for a, x in enumerate(comm_features_l1):
            if a < len(comm_features_l1) - 1:
                g.write(x + ',' + '\n')
            else:
                g.write(x)

    # read the common features from file
    with open('NSL-KDD Files/NSL_features_l2.txt', 'w') as g:
        for a, x in enumerate(comm_features_l2):
            if a < len(comm_features_l2) - 1:
                g.write(x + ',' + '\n')
            else:
                g.write(x)

# Main implementation

In [29]:
# loading the train set
df_train = pd.read_csv('NSL-KDD Original Datasets/KDDTrain+.txt', sep=",", header=None)
df_train = df_train[df_train.columns[:-1]]  # tags column
titles = pd.read_csv('NSL-KDD Original Datasets/Field Names.csv', header=None)
label = pd.Series(['label'], index=[41])
titles = pd.concat([titles[0], label])
df_train.columns = titles.to_list()
df_train = df_train.drop(['num_outbound_cmds'],axis=1)
df_train_original = df_train
# df_train_original

In [30]:
# load test set
df_test = pd.read_csv('NSL-KDD Original Datasets/KDDTest+.txt', sep=",", header=None)
df_test = df_test[df_test.columns[:-1]]
df_test.columns = titles.to_list()
df_test = df_test.drop(['num_outbound_cmds'],axis=1)
df_test_original = df_test
# df_test_original

### Execution Parameters

In [31]:
EXPORT_MODELS = 0
EXPORT_DATASETS = 0

### Perform ICFS if needed

In [32]:
# It is possible to compute the ICFS again

# perform_icfs(df_train_original)

# DoS + Probe classifier (NBC)

In [33]:
# list of single attacks 
dos_attacks = ['back', 'land', 'neptune', 'pod', 'smurf', 'teardrop', 'worm', 'apache2', 'mailbomb', 'processtable', 'udpstorm']
probe_attacks = ['ipsweep', 'mscan', 'nmap', 'portsweep', 'saint', 'satan']
r2l_attacks = ['guess_passwd', 'ftp_write', 'imap', 'phf', 'multihop', 'warezmaster',
                'snmpguess', 'spy', 'warezclient', 'httptunnel', 'named', 'sendmail', 'snmpgetattack', 'xlock', 'xsnoop']
u2r_attacks = ['buffer_overflow', 'loadmodule', 'perl', 'ps', 'rootkit', 'sqlattack', 'xterm'] 

# list of attack classes split according to detection layer
dos_probe_list = ['back', 'land', 'neptune', 'pod', 'smurf', 'teardrop', 'ipsweep', 'nmap', 'portsweep', 'satan']
dos_probe_test = ['apache2', 'mailbomb', 'processtable', 'udpstorm', 'mscan', 'saint']
u2r_r2l_list = ['guess_passwd', 'ftp_write', 'imap', 'phf', 'multihop', 'warezmaster',
                'snmpguess', 'spy', 'warezclient', 'buffer_overflow', 'loadmodule', 'rootkit', 'perl']
u2r_r2l_test = ['httptunnel', 'named', 'sendmail', 'snmpgetattack', 'xlock', 'xsnoop', 'ps', 'xterm', 'sqlattack']
normal_list = ['normal']
categorical_features = ['protocol_type', 'service', 'flag']

# load the features obtained with ICFS for both layer 1 and layer 2
with open('NSL-KDD Files/NSL_features_l1.txt', 'r') as f:
    common_features_l1 = f.read().split(',')

with open('NSL-KDD Files/NSL_features_l2.txt', 'r') as f:
    common_features_l2 = f.read().split(',')
    
df_train = copy.deepcopy(df_train_original)
df_test = copy.deepcopy(df_test_original)

In [34]:
y_train = np.array([1 if x in (dos_attacks+probe_attacks) else 0 for x in df_train['label']])

df_train = df_train.drop(['label'],axis=1)
df_train = df_train.reset_index().drop(['index'], axis=1)
df_train

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00
1,0,udp,other,SF,146,0,0,0,0,0,...,255,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00
2,0,tcp,private,S0,0,0,0,0,0,0,...,255,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00
3,0,tcp,http,SF,232,8153,0,0,0,0,...,30,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,255,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00
125969,8,udp,private,SF,105,145,0,0,0,0,...,255,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,255,30,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,255,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00


In [35]:
X_train = df_train[common_features_l1] 

X_train

Unnamed: 0,logged_in,count,serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_srv_rerror_rate,dst_host_rerror_rate
0,0,2,0.0,0.0,0.0,1.00,0.00,0.00,25,0.17,0.03,0.17,0.00,0.00,0.00,0.00,0.05
1,0,13,0.0,0.0,0.0,0.08,0.15,0.00,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00
2,0,123,1.0,0.0,0.0,0.05,0.07,0.00,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00
3,1,5,0.2,0.0,0.0,1.00,0.00,0.00,255,1.00,0.00,0.03,0.04,0.03,0.01,0.01,0.00
4,1,30,0.0,0.0,0.0,1.00,0.00,0.09,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,184,1.0,0.0,0.0,0.14,0.06,0.00,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00
125969,0,2,0.0,0.0,0.0,1.00,0.00,0.00,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00
125970,1,1,0.0,0.0,0.0,1.00,0.00,0.00,30,0.12,0.06,0.00,0.00,0.72,0.00,0.00,0.01
125971,0,144,1.0,0.0,0.0,0.06,0.05,0.00,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00


In [36]:
# 2 one hot encoder, one for the features of layer1 and one for the features of layer2
ohe = OneHotEncoder(handle_unknown='ignore')
ohe2 = OneHotEncoder(handle_unknown='ignore')
scaler1 = MinMaxScaler()
scaler2 = MinMaxScaler()

In [37]:
df_minmax = scaler1.fit_transform(X_train)
X_train = pd.DataFrame(df_minmax, columns=X_train.columns)
X_train

Unnamed: 0,logged_in,count,serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_srv_rerror_rate,dst_host_rerror_rate
0,0.0,0.003914,0.0,0.0,0.0,1.00,0.00,0.00,0.098039,0.17,0.03,0.17,0.00,0.00,0.00,0.00,0.05
1,0.0,0.025440,0.0,0.0,0.0,0.08,0.15,0.00,0.003922,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00
2,0.0,0.240705,1.0,0.0,0.0,0.05,0.07,0.00,0.101961,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00
3,1.0,0.009785,0.2,0.0,0.0,1.00,0.00,0.00,1.000000,1.00,0.00,0.03,0.04,0.03,0.01,0.01,0.00
4,1.0,0.058708,0.0,0.0,0.0,1.00,0.00,0.09,1.000000,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,0.360078,1.0,0.0,0.0,0.14,0.06,0.00,0.098039,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00
125969,0.0,0.003914,0.0,0.0,0.0,1.00,0.00,0.00,0.956863,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00
125970,1.0,0.001957,0.0,0.0,0.0,1.00,0.00,0.00,0.117647,0.12,0.06,0.00,0.00,0.72,0.00,0.00,0.01
125971,0.0,0.281800,1.0,0.0,0.0,0.06,0.05,0.00,0.031373,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00


In [38]:
# perform One-hot encoding
label_enc = ohe.fit_transform(df_train.iloc[:,1:4])
label_enc.toarray()
new_labels = ohe.get_feature_names_out(categorical_features)
df_enc = pd.DataFrame(data=label_enc.toarray(), columns=new_labels)
X_train = pd.concat([X_train, df_enc], axis=1)
X_train

Unnamed: 0,logged_in,count,serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_srv_count,dst_host_same_srv_rate,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0.0,0.003914,0.0,0.0,0.0,1.00,0.00,0.00,0.098039,0.17,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.025440,0.0,0.0,0.0,0.08,0.15,0.00,0.003922,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.240705,1.0,0.0,0.0,0.05,0.07,0.00,0.101961,0.10,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.009785,0.2,0.0,0.0,1.00,0.00,0.00,1.000000,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.058708,0.0,0.0,0.0,1.00,0.00,0.09,1.000000,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,0.360078,1.0,0.0,0.0,0.14,0.06,0.00,0.098039,0.10,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
125969,0.0,0.003914,0.0,0.0,0.0,1.00,0.00,0.00,0.956863,0.96,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
125970,1.0,0.001957,0.0,0.0,0.0,1.00,0.00,0.00,0.117647,0.12,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
125971,0.0,0.281800,1.0,0.0,0.0,0.06,0.05,0.00,0.031373,0.03,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [39]:
# do the same for testset
y_test = np.array([1 if x in (dos_attacks+probe_attacks) else 0 for x in df_test['label']])

df_test = df_test.drop(['label'],axis=1)
df_test = df_test.reset_index().drop(['index'], axis=1)
df_test

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,tcp,private,REJ,0,0,0,0,0,0,...,255,10,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00
1,0,tcp,private,REJ,0,0,0,0,0,0,...,255,1,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,134,86,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,3,57,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,29,86,0.31,0.17,0.03,0.02,0.00,0.0,0.83,0.71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,0,tcp,smtp,SF,794,333,0,0,0,0,...,100,141,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00
22540,0,tcp,http,SF,317,938,0,0,0,0,...,197,255,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00
22541,0,tcp,http,SF,54540,8314,0,0,0,2,...,255,255,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07
22542,0,udp,domain_u,SF,42,42,0,0,0,0,...,255,252,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00


In [40]:
X_test = df_test[common_features_l1]

X_test

Unnamed: 0,logged_in,count,serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_srv_rerror_rate,dst_host_rerror_rate
0,0,229,0.0,1.0,1.0,0.04,0.06,0.00,10,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00
1,0,136,0.0,1.0,1.0,0.01,0.06,0.00,1,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00
2,0,1,0.0,0.0,0.0,1.00,0.00,0.00,86,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00
3,0,1,0.0,0.0,0.0,1.00,0.00,1.00,57,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00
4,0,1,0.0,1.0,0.5,1.00,0.00,0.75,86,0.31,0.17,0.03,0.02,0.00,0.0,0.71,0.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,1,1,0.0,0.0,0.0,1.00,0.00,0.00,141,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00
22540,1,2,0.0,0.0,0.0,1.00,0.00,0.18,255,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00
22541,1,5,0.0,0.0,0.0,1.00,0.00,0.20,255,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07
22542,0,4,0.0,0.0,0.0,1.00,0.00,0.33,252,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00


In [41]:
df_minmax = scaler1.transform(X_test)
X_test = pd.DataFrame(df_minmax, columns=X_test.columns)
X_test

Unnamed: 0,logged_in,count,serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_srv_rerror_rate,dst_host_rerror_rate
0,0.0,0.448141,0.0,1.0,1.0,0.04,0.06,0.00,0.039216,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00
1,0.0,0.266145,0.0,1.0,1.0,0.01,0.06,0.00,0.003922,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00
2,0.0,0.001957,0.0,0.0,0.0,1.00,0.00,0.00,0.337255,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00
3,0.0,0.001957,0.0,0.0,0.0,1.00,0.00,1.00,0.223529,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00
4,0.0,0.001957,0.0,1.0,0.5,1.00,0.00,0.75,0.337255,0.31,0.17,0.03,0.02,0.00,0.0,0.71,0.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,1.0,0.001957,0.0,0.0,0.0,1.00,0.00,0.00,0.552941,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00
22540,1.0,0.003914,0.0,0.0,0.0,1.00,0.00,0.18,1.000000,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00
22541,1.0,0.009785,0.0,0.0,0.0,1.00,0.00,0.20,1.000000,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07
22542,0.0,0.007828,0.0,0.0,0.0,1.00,0.00,0.33,0.988235,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00


In [42]:
label_enc = ohe.transform(df_test.iloc[:,1:4])
label_enc.toarray()
new_labels = ohe.get_feature_names_out(categorical_features)
df_enc = pd.DataFrame(data=label_enc.toarray(), columns=new_labels)
X_test = pd.concat([X_test, df_enc], axis=1)
X_test

Unnamed: 0,logged_in,count,serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_srv_count,dst_host_same_srv_rate,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0.0,0.448141,0.0,1.0,1.0,0.04,0.06,0.00,0.039216,0.04,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.266145,0.0,1.0,1.0,0.01,0.06,0.00,0.003922,0.00,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.001957,0.0,0.0,0.0,1.00,0.00,0.00,0.337255,0.61,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.001957,0.0,0.0,0.0,1.00,0.00,1.00,0.223529,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.001957,0.0,1.0,0.5,1.00,0.00,0.75,0.337255,0.31,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,1.0,0.001957,0.0,0.0,0.0,1.00,0.00,0.00,0.552941,0.72,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
22540,1.0,0.003914,0.0,0.0,0.0,1.00,0.00,0.18,1.000000,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
22541,1.0,0.009785,0.0,0.0,0.0,1.00,0.00,0.20,1.000000,1.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
22542,0.0,0.007828,0.0,0.0,0.0,1.00,0.00,0.33,0.988235,0.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [43]:
print('Shape of the whole train set: ', X_train.shape)
print('Shape of its targets: ', y_train.shape)
print('Shape of the whole test set: ', X_test.shape)
print('Shape of its targets: ', y_test.shape)

Shape of the whole train set:  (125973, 101)
Shape of its targets:  (125973,)
Shape of the whole test set:  (22544, 101)
Shape of its targets:  (22544,)


In [44]:
# Export the dataset for training layer 1
if EXPORT_DATASETS:
    X_train.to_csv('NSL-KDD Encoded Datasets/KDDTrain+_l1.txt', index=False)
    np.save('NSL-KDD Encoded Datasets/KDDTrain+_l1_targets', y_train)

### Principal Component Analysis

In [45]:
pca_dos_probe = PCA(n_components=0.95)
X_train_dos_probe = pca_dos_probe.fit_transform(X_train)
X_test_dos_probe = pca_dos_probe.transform(X_test)

### Building the classifier for the layer1

In [46]:
# Using Random Forest Classifier
# dos_probe_classifier = RandomForestClassifier(n_estimators=100, criterion='gini')

# Using the Naive Bayes Classifier
dos_probe_classifier = GaussianNB()
dos_probe_classifier.fit(X_train_dos_probe, y_train)
predicted = dos_probe_classifier.predict(X_test_dos_probe)

In [47]:
print('Metrics for layer 1:')
print('Confusion matrix: [TP FN / FP TN]\n', confusion_matrix(y_test,predicted))
print('Accuracy = ', accuracy_score(y_test,predicted))
print('F1 Score = ', f1_score(y_test,predicted))
print('Precision = ', precision_score(y_test,predicted))
print('Recall = ', recall_score(y_test,predicted))
print('Shape of the train set for l1: ', X_train_dos_probe.shape)

Metrics for layer 1:
Confusion matrix: [TP FN / FP TN]
 [[9559 3104]
 [ 913 8968]]
Accuracy =  0.8218151171043293
F1 Score =  0.8170181751924567
Precision =  0.7428760768721008
Recall =  0.9076004452990588
Shape of the train set for l1:  (125973, 28)


# R2L+U2R classifier

In [24]:
df_train = copy.deepcopy(df_train_original)
df_test = copy.deepcopy(df_test_original)

# load targeted attacks (Normal + r2l + u2r)
df_train = df_train[df_train['label'].isin(normal_list+u2r_attacks+r2l_attacks)]

y_train = np.array([0 if x=='normal' else 1 for x in df_train['label']])
df_train = df_train.drop(['label'],axis=1)
df_train = df_train.reset_index().drop(['index'], axis=1)
# df_train

In [2230]:
X_train = df_train[common_features_l2] 

# X_train

In [2231]:
df_minmax = scaler2.fit_transform(X_train)
X_train = pd.DataFrame(df_minmax, columns=X_train.columns)
X_train

Unnamed: 0,srv_count,urgent,root_shell,hot,dst_host_srv_diff_host_rate,dst_host_count,logged_in,dst_host_same_src_port_rate,srv_diff_host_rate,num_shells,dst_host_srv_count
0,0.003914,0.0,0.0,0.0,0.00,0.588235,0.0,0.17,0.00,0.0,0.098039
1,0.001957,0.0,0.0,0.0,0.00,1.000000,0.0,0.88,0.00,0.0,0.003922
2,0.009785,0.0,0.0,0.0,0.04,0.117647,1.0,0.03,0.00,0.0,1.000000
3,0.062622,0.0,0.0,0.0,0.00,1.000000,1.0,0.00,0.09,0.0,1.000000
4,0.013699,0.0,0.0,0.0,0.03,0.031373,1.0,0.12,0.43,0.0,0.858824
...,...,...,...,...,...,...,...,...,...,...,...
68385,0.001957,0.0,0.0,0.0,1.00,0.003922,1.0,1.00,0.00,0.0,0.007843
68386,0.021526,0.0,0.0,0.0,0.04,0.011765,1.0,0.33,0.18,0.0,1.000000
68387,0.003914,0.0,0.0,0.0,0.00,1.000000,0.0,0.01,0.00,0.0,0.956863
68388,0.001957,0.0,0.0,0.0,0.00,1.000000,1.0,0.00,0.00,0.0,0.117647


In [2232]:
# perform One-hot encoding
label_enc = ohe2.fit_transform(df_train.iloc[:,1:4])
label_enc.toarray()
new_labels = ohe2.get_feature_names_out(categorical_features)
df_enc = pd.DataFrame(data=label_enc.toarray(), columns=new_labels)
X_train = pd.concat([X_train, df_enc], axis=1)
X_train

Unnamed: 0,srv_count,urgent,root_shell,hot,dst_host_srv_diff_host_rate,dst_host_count,logged_in,dst_host_same_src_port_rate,srv_diff_host_rate,num_shells,...,flag_OTH,flag_REJ,flag_RSTO,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0.003914,0.0,0.0,0.0,0.00,0.588235,0.0,0.17,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.001957,0.0,0.0,0.0,0.00,1.000000,0.0,0.88,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.009785,0.0,0.0,0.0,0.04,0.117647,1.0,0.03,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.062622,0.0,0.0,0.0,0.00,1.000000,1.0,0.00,0.09,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.013699,0.0,0.0,0.0,0.03,0.031373,1.0,0.12,0.43,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68385,0.001957,0.0,0.0,0.0,1.00,0.003922,1.0,1.00,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
68386,0.021526,0.0,0.0,0.0,0.04,0.011765,1.0,0.33,0.18,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
68387,0.003914,0.0,0.0,0.0,0.00,1.000000,0.0,0.01,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
68388,0.001957,0.0,0.0,0.0,0.00,1.000000,1.0,0.00,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [2233]:
# do the same for test set
df_test = df_test[df_test['label'].isin(normal_list+u2r_attacks+r2l_attacks)]

y_test = np.array([0 if x=='normal' else 1 for x in df_test['label']])
df_test = df_test.drop(['label'],axis=1)
df_test = df_test.reset_index().drop(['index'], axis=1)
# df_test

In [2234]:
X_test = df_test[common_features_l2] 

# X_test

In [2235]:
df_minmax = scaler2.transform(X_test)
X_test = pd.DataFrame(df_minmax, columns=X_test.columns)
X_test

Unnamed: 0,srv_count,urgent,root_shell,hot,dst_host_srv_diff_host_rate,dst_host_count,logged_in,dst_host_same_src_port_rate,srv_diff_host_rate,num_shells,dst_host_srv_count
0,0.001957,0.0,0.0,0.0,0.02,0.525490,0.0,0.61,0.00,0.0,0.337255
1,0.007828,0.0,0.0,0.0,0.03,0.607843,1.0,0.01,0.00,0.0,1.000000
2,0.005871,0.0,0.0,0.0,0.00,1.000000,1.0,0.00,1.00,0.0,0.109804
3,0.001957,0.0,0.0,0.0,0.00,1.000000,0.0,0.00,0.00,0.0,1.000000
4,0.091977,0.0,0.0,0.0,0.03,0.592157,1.0,0.01,0.04,0.0,1.000000
...,...,...,...,...,...,...,...,...,...,...,...
12658,0.001957,0.0,0.0,0.0,0.04,0.360784,1.0,0.01,0.00,0.0,1.000000
12659,0.005871,0.0,0.0,0.0,0.04,0.019608,1.0,0.20,0.00,0.0,1.000000
12660,0.001957,0.0,0.0,0.0,0.01,0.392157,1.0,0.01,0.00,0.0,0.552941
12661,0.021526,0.0,0.0,0.0,0.01,0.772549,1.0,0.01,0.18,0.0,1.000000


In [2236]:
label_enc = ohe2.transform(df_test.iloc[:,1:4])
label_enc.toarray()
new_labels = ohe2.get_feature_names_out(categorical_features)
df_enc = pd.DataFrame(data=label_enc.toarray(), columns=new_labels)
X_test = pd.concat([X_test, df_enc], axis=1)
X_test

Unnamed: 0,srv_count,urgent,root_shell,hot,dst_host_srv_diff_host_rate,dst_host_count,logged_in,dst_host_same_src_port_rate,srv_diff_host_rate,num_shells,...,flag_OTH,flag_REJ,flag_RSTO,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0.001957,0.0,0.0,0.0,0.02,0.525490,0.0,0.61,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.007828,0.0,0.0,0.0,0.03,0.607843,1.0,0.01,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.005871,0.0,0.0,0.0,0.00,1.000000,1.0,0.00,1.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.001957,0.0,0.0,0.0,0.00,1.000000,0.0,0.00,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.091977,0.0,0.0,0.0,0.03,0.592157,1.0,0.01,0.04,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12658,0.001957,0.0,0.0,0.0,0.04,0.360784,1.0,0.01,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
12659,0.005871,0.0,0.0,0.0,0.04,0.019608,1.0,0.20,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
12660,0.001957,0.0,0.0,0.0,0.01,0.392157,1.0,0.01,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
12661,0.021526,0.0,0.0,0.0,0.01,0.772549,1.0,0.01,0.18,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [2237]:
print('Shape of the train set: ', X_train.shape)
print('Shape of its target: ', y_train.shape)
print('Shape of the test set: ', X_test.shape)
print('Shape of its target: ', y_test.shape)

Shape of the train set:  (68390, 51)
Shape of its target:  (68390,)
Shape of the test set:  (12663, 51)
Shape of its target:  (12663,)


In [2238]:
# Under sampling the train set for l2
sm = under_sam(sampling_strategy=1)
X_train, y_train = sm.fit_resample(X_train,y_train)

# Export the datasets
Train set has been scaled, one hot encoded, undersampled
Test set has been scaled and one hot encoded

In [2239]:
# Export the dataset for training layer 2
if EXPORT_DATASETS:
    X_train.to_csv('NSL-KDD Encoded Datasets/KDDTrain+_l2.txt', index=False)
    np.save('NSL-KDD Encoded Datasets/KDDTrain+_l2_targets', y_train)

In [2240]:
from sklearn import tree

# Principal Component Analysis
pca_r2l_u2r = PCA(n_components=0.95)
X_train_r2l_u2r = pca_r2l_u2r.fit_transform(X_train)
X_test_r2l_u2r = pca_r2l_u2r.transform(X_test)

# Try also Decision Trees
# r2l_u2r_classifier = tree.DecisionTreeClassifier()

# Support Vector Machine for layer l2
r2l_u2r_classifier = SVC(C=0.1, gamma=0.01, kernel='rbf')
r2l_u2r_classifier.fit(X_train_r2l_u2r, y_train)
predicted = r2l_u2r_classifier.predict(X_test_r2l_u2r)

X_test_r2l_u2r

array([[ 5.63297231e-01,  4.74581823e-01,  2.22866974e-01, ...,
         1.65923095e-01, -1.40921142e-02, -4.32941940e-01],
       [-7.69292718e-01, -6.58713908e-01,  2.88033406e-01, ...,
        -1.01834752e-02, -1.96293955e-02, -2.60941740e-03],
       [-3.29108769e-01,  6.78369377e-02, -4.98049593e-01, ...,
        -7.31480200e-02,  5.83650707e-03,  7.94045819e-02],
       ...,
       [-2.24052614e-01, -1.24085977e-01, -2.00526146e-01, ...,
         3.44251846e-02,  5.78252186e-02, -5.07823769e-02],
       [-8.40479466e-01, -6.28303602e-01,  2.42356179e-01, ...,
         2.04181569e-04, -3.68087007e-02, -4.52980201e-03],
       [-1.05368199e+00,  1.50181610e+00,  5.92994442e-01, ...,
         1.78882707e-02,  5.89627528e-03,  7.07745257e-03]])

In [2241]:
print('Metrics for layer 2:')
print('Confusion matrix: [TP FN / FP TN]\n', confusion_matrix(y_test,predicted))
print('Accuracy = ', accuracy_score(y_test,predicted))
print('F1 Score = ', f1_score(y_test,predicted))
print('Precision = ', precision_score(y_test,predicted))
print('Recall = ', recall_score(y_test,predicted))
print('Matthew corr = ', matthews_corrcoef(y_test,predicted))
print('Shape of the training set: ', X_train_r2l_u2r.shape)

Metrics for layer 2:
Confusion matrix: [TP FP / FN TN]
 [[9088  623]
 [1427 1525]]
Accuracy =  0.8381110321408829
F1 Score =  0.5980392156862745
Precision =  0.7099627560521415
Recall =  0.5165989159891599
Matthew corr =  0.5097227753691026
Shape of the training set:  (2094, 13)


### Export the classifiers

In [2170]:
if EXPORT_MODELS:
    with open('Models/NSL_l1_classifier.pkl', "wb") as f:
        pickle.dump(dos_probe_classifier, f)
    with open('Models/NSL_l2_classifier.pkl', "wb") as f:
        pickle.dump(r2l_u2r_classifier, f)

# Testing

In [2006]:
df_test1 = copy.deepcopy(df_test_original)
df_test2 = copy.deepcopy(df_test_original)
y_test_real = np.array([0 if x=='normal' else 1 for x in df_test1['label']])
df_test_original

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,private,REJ,0,0,0,0,0,0,...,10,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune
1,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00,normal
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00,saint
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.00,0.0,0.83,0.71,mscan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,0,tcp,smtp,SF,794,333,0,0,0,0,...,141,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00,normal
22540,0,tcp,http,SF,317,938,0,0,0,0,...,255,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00,normal
22541,0,tcp,http,SF,54540,8314,0,0,0,2,...,255,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07,back
22542,0,udp,domain_u,SF,42,42,0,0,0,0,...,252,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00,normal


In [2007]:
X_test = df_test1[common_features_l1]
X_test

Unnamed: 0,logged_in,count,serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_srv_rerror_rate,dst_host_rerror_rate
0,0,229,0.0,1.0,1.0,0.04,0.06,0.00,10,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00
1,0,136,0.0,1.0,1.0,0.01,0.06,0.00,1,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00
2,0,1,0.0,0.0,0.0,1.00,0.00,0.00,86,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00
3,0,1,0.0,0.0,0.0,1.00,0.00,1.00,57,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00
4,0,1,0.0,1.0,0.5,1.00,0.00,0.75,86,0.31,0.17,0.03,0.02,0.00,0.0,0.71,0.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,1,1,0.0,0.0,0.0,1.00,0.00,0.00,141,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00
22540,1,2,0.0,0.0,0.0,1.00,0.00,0.18,255,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00
22541,1,5,0.0,0.0,0.0,1.00,0.00,0.20,255,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07
22542,0,4,0.0,0.0,0.0,1.00,0.00,0.33,252,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00


In [2008]:
df_minmax = scaler1.transform(X_test)
X_test = pd.DataFrame(df_minmax, columns=X_test.columns)
label_enc = ohe.transform(df_test1.iloc[:,1:4])
label_enc.toarray()
new_labels = ohe.get_feature_names_out(categorical_features)
df_enc = pd.DataFrame(data=label_enc.toarray(), columns=new_labels)
X_test = pd.concat([X_test, df_enc], axis=1)

X_test_layer1 = pca_dos_probe.transform(X_test)
print('Test set shape for layer 1: ', X_test_layer1.shape)

Test set shape for layer 1:  (22544, 28)


In [2009]:
X_test = df_test2[common_features_l2] 
X_test

Unnamed: 0,srv_count,urgent,root_shell,hot,dst_host_srv_diff_host_rate,dst_host_count,logged_in,dst_host_same_src_port_rate,srv_diff_host_rate,num_shells,dst_host_srv_count
0,10,0,0,0,0.00,255,0,0.00,0.00,0,10
1,1,0,0,0,0.00,255,0,0.00,0.00,0,1
2,1,0,0,0,0.02,134,0,0.61,0.00,0,86
3,65,0,0,0,0.28,3,0,1.00,1.00,0,57
4,8,0,0,0,0.02,29,0,0.03,0.75,0,86
...,...,...,...,...,...,...,...,...,...,...,...
22539,1,0,0,0,0.01,100,1,0.01,0.00,0,141
22540,11,0,0,0,0.01,197,1,0.01,0.18,0,255
22541,10,0,0,2,0.00,255,1,0.00,0.20,0,255
22542,6,0,0,0,0.00,255,0,0.00,0.33,0,252


In [2010]:
df_minmax = scaler2.transform(X_test)
X_test = pd.DataFrame(df_minmax, columns=X_test.columns)
label_enc = ohe2.transform(df_test2.iloc[:,1:4])
label_enc.toarray()
new_labels = ohe2.get_feature_names_out(categorical_features)
df_enc = pd.DataFrame(data=label_enc.toarray(), columns=new_labels)
X_test = pd.concat([X_test, df_enc], axis=1)

X_test_layer2 = pca_r2l_u2r.transform(X_test)
print('Test set shape for layer 2: ', X_test_layer2.shape)
print('Type of X_test_layer1: ', type(X_test_layer1))
print('Type of X_test_layer1: ', type(X_test_layer2))

Test set shape for layer 2:  (22544, 13)
Type of X_test_layer1:  <class 'numpy.ndarray'>
Type of X_test_layer1:  <class 'numpy.ndarray'>


In [2011]:
# same classifiers obtained above
classifier1 = dos_probe_classifier
classifier2 = r2l_u2r_classifier

In [2012]:
result = []
for i in range(X_test_layer2.shape[0]):
    layer1 = classifier1.predict(X_test_layer1[i].reshape(1, -1))[0]
    if layer1 == 1:
        result.append(layer1)
    else:
        layer2 = classifier2.predict(X_test_layer2[i].reshape(1, -1))[0]
        if layer2 == 1:
            result.append(layer2)
        else:
            result.append(0)
            
result = np.array(result)

In [2013]:
# the results may vary
# C=0.1, gamma=0.01
print('Results for the layer 2 (SVM):')
print(confusion_matrix(y_test_real,result))
print('Accuracy = ', accuracy_score(y_test_real,result))
print('F1 Score = ', f1_score(y_test_real,result))
print('Precision = ', precision_score(y_test_real,result))
print('Recall = ', recall_score(y_test_real,result))
print('Matthew corr = ', matthews_corrcoef(y_test_real,result))

Results for the layer 2 (SVM):
[[ 8133  1578]
 [  851 11982]]
Accuracy =  0.8922551454932577
F1 Score =  0.9079680218239685
Precision =  0.8836283185840708
Recall =  0.9336865892620587
Matthew corr =  0.7799971236137284


### Export the test sets

In [2014]:
if EXPORT_DATASETS:
    column_names = [f'PC{i}' for i in range(1, X_test_layer1.shape[1] + 1)]
    X1_test = pd.DataFrame(data=X_test_layer1, columns=column_names)
    X1_test.to_csv('NSL-KDD Encoded Datasets/X_test_l1.txt', index=False)
    
    column_names = [f'PC{i}' for i in range(1, X_test_layer2.shape[1] + 1)]
    X2_test = pd.DataFrame(data=X_test_layer2, columns=column_names)
    X2_test.to_csv('NSL-KDD Encoded Datasets/X_test_l2.txt', index=False)
    
    np.save('NSL-KDD Encoded Datasets/y_test', y_test_real)

### evaluate seen and unseen attack categories

In [2015]:
# load testset
df_test = pd.read_csv('NSL-KDD Original Datasets\KDDTest+.txt', sep=",", header=None)
df_test = df_test[df_test.columns[:-1]]
df_test.columns = titles.to_list()
y_test = df_test['label']
df_test = df_test.drop(['num_outbound_cmds'],axis=1)
df_test_original = df_test

In [2016]:
if EXPORT_DATASETS:
    df_test_original.to_csv('NSL-KDD Encoded Datasets/KDDTest+', index=False)
    np.save('NSL-KDD Encoded Datasets/KDDTest+_targets', y_test)
    
df_test_original

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,private,REJ,0,0,0,0,0,0,...,10,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune
1,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00,normal
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00,saint
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.00,0.0,0.83,0.71,mscan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,0,tcp,smtp,SF,794,333,0,0,0,0,...,141,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00,normal
22540,0,tcp,http,SF,317,938,0,0,0,0,...,255,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00,normal
22541,0,tcp,http,SF,54540,8314,0,0,0,2,...,255,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07,back
22542,0,udp,domain_u,SF,42,42,0,0,0,0,...,252,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00,normal


In [2017]:
new_attack = []
for i in df_test_original['label'].value_counts().index.tolist()[1:]:
    if i not in df_train_original['label'].value_counts().index.tolist()[1:]:
        new_attack.append(i)
        
new_attack.sort()
new_attack

['apache2',
 'httptunnel',
 'mailbomb',
 'mscan',
 'named',
 'processtable',
 'ps',
 'saint',
 'sendmail',
 'snmpgetattack',
 'snmpguess',
 'sqlattack',
 'udpstorm',
 'worm',
 'xlock',
 'xsnoop',
 'xterm']

In [2018]:
index_of_new_attacks = []

for i in range(len(df_test_original)):
    if df_test_original['label'][i] in new_attack:
        index_of_new_attacks.append(df_test_original.index[i])

In [2019]:
len(index_of_new_attacks)

3750

In [2020]:
new_attack.append('normal')
new_attack

['apache2',
 'httptunnel',
 'mailbomb',
 'mscan',
 'named',
 'processtable',
 'ps',
 'saint',
 'sendmail',
 'snmpgetattack',
 'snmpguess',
 'sqlattack',
 'udpstorm',
 'worm',
 'xlock',
 'xsnoop',
 'xterm',
 'normal']

In [2021]:
index_of_old_attacks = []

for i in range(len(df_test_original)):
    if df_test_original['label'][i] not in new_attack:
        index_of_old_attacks.append(df_test_original.index[i])

In [2022]:
len(index_of_old_attacks)

9083

In [2023]:
print('Number of new attacks in the test set: ', result[index_of_new_attacks].shape[0])
print('Number of new attacks detected by the classifiers: ', result[index_of_new_attacks].sum())
print('Proportion of new attacks detected: ', result[index_of_new_attacks].sum()/result[index_of_new_attacks].shape[0])

Number of new attacks in the test set:  3750
Number of new attacks detected by the classifiers:  3392
Proportion of new attacks detected:  0.9045333333333333


In [2024]:
print('Number of old attacks in the test set: ', result[index_of_old_attacks].shape[0])
print('Number of old attacks detected by the classifiers: ', result[index_of_old_attacks].sum())
print('Proportion of old attacks detected: ', result[index_of_old_attacks].sum()/result[index_of_old_attacks].shape[0])

Number of old attacks in the test set:  9083
Number of old attacks detected by the classifiers:  8590
Proportion of old attacks detected:  0.9457227788175713


### Evaluate single attack types

In [2025]:
# load test set
df_test = pd.read_csv('NSL-KDD Original Datasets/KDDTest+.txt', sep=",", header=None)
df_test = df_test[df_test.columns[:-1]]
df_test.columns = titles.to_list()
y_test = df_test['label']
df_test = df_test.drop(['num_outbound_cmds'],axis=1)
df_test_original = df_test
df = df_test_original

dos_index = df.index[(df['label'].isin(dos_attacks))].tolist()
probe_index = df.index[(df['label'].isin(probe_attacks))].tolist()
r2l_index = df.index[(df['label'].isin(r2l_attacks))].tolist()
u2r_index = df.index[(df['label'].isin(u2r_attacks))].tolist()

print('Evaluation split into single attack type:')
print("Number of dos attacks: ", result[dos_index].shape[0])
print("Number of detected attacks: ", result[dos_index].sum())
print("Ratio of detection: ", result[dos_index].sum()/result[dos_index].shape[0])

print("Number of probe attacks: ", result[probe_index].shape[0])
print("Number of detected attacks: ", result[probe_index].sum())
print("Ratio of detection: ", result[probe_index].sum()/result[probe_index].shape[0])

print("Number of r2l attacks: ", result[r2l_index].shape[0])
print("Number of detected attacks: ", result[r2l_index].sum())
print("Ratio of detection: ", result[r2l_index].sum()/result[r2l_index].shape[0])

print("Number of u2r attacks: ", result[u2r_index].shape[0])
print("Number of detected attacks: ", result[u2r_index].sum())
print("Ratio of detection: ", result[u2r_index].sum()/result[u2r_index].shape[0])

Evaluation split into single attack type:
Number of dos attacks:  7460
Number of detected attacks:  6893
Ratio of detection:  0.923994638069705
Number of probe attacks:  2421
Number of detected attacks:  2246
Ratio of detection:  0.9277158199091284
Number of r2l attacks:  2885
Number of detected attacks:  2776
Ratio of detection:  0.9622183708838822
Number of u2r attacks:  67
Number of detected attacks:  67
Ratio of detection:  1.0
