<a href="https://colab.research.google.com/github/Mondirkb/dataset-CIC-IDS-2017/blob/main/data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 ML-based Intrusion Detection System
**Author:** Moundir Chemseddine Kebir
**Dataset:** [CIC-IDS-2017](https://www.unb.ca/cic/datasets/ids-2017.html)



## Methodology
The pipeline follows the structure defined in **Section 5.1**:
1.  **Preprocessing (Sec 5.1.1):** Cleaning, Label Encoding, Normalization (-1, 0), and Balancing.
2.  **Model Architecture (Sec 5.1.2):** Implementation of Neural Network (NN), CNN, and LSTM based on Thesis Tables 5.2, 5.3, and 5.6.
3.  **Evaluation (Sec 5.2):** Metrics include Accuracy, Precision, Recall, and F1-Score.

# 1. Dataset Preparation

## 1.0 Prerequisites: Generating the Processed Data
Before running this notebook, the raw CIC-IDS-2017 CSV files must be converted into a machine-learning-ready format.

**Origin of the File:**
The file `Processed_Data.npz` is the output of the **Preprocessing Phase** . It was generated by running the `Preprocessing.py` script, which performed:
1.  **Cleaning:** Removing noise and non-numeric identifiers.
2.  **Normalization:** Scaling features to the range [-1, 0].
3.  **Balancing:** Oversampling minority classes and undersampling majority classes.



**1.2 Data Preproccesing**
# 1. Dataset Preparation
In this segment, we proceed with dataset loading. The data has already been preprocessed (cleaned, normalized, balanced) and saved as a compressed NumPy archive (`.npz`) to ensure efficiency and consistency with the Thesis methodology (**Chapter 5.1.1**).

## 1.1 Importing the CIC-IDS-2017 Dataset
We load the preprocessed training and testing sets directly from Google Drive.

In [None]:
import pandas as pd
import numpy as np
import glob
import os
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# --- CONFIGURATION ---
# In Google Colab, __file__ is not defined.
# You need to explicitly set BASE_PATH to the directory where your CSV files are located.
# For example, if your CSVs are in the root of your Colab environment (after uploading):
# BASE_PATH = '/content/'

# Mount Google Drive and set BASE_PATH to your Colab folder if CSVs are there
from google.colab import drive
drive.mount('/content/drive')
BASE_PATH = '/content/drive/My Drive/Colab/' # <--- IMPORTANT: Adjust this path to where your RAW CSV files are located!
# BASE_PATH = os.getcwd() # Using current working directory as a placeholder. Adjust as needed.

def preprocess_cicids2017():
    print(f"Working directory for raw CSVs: {BASE_PATH}")
    print("--- STEP 1: LOAD & CONCATENATE --- Subset (raw data) ---")

    all_files = glob.glob(os.path.join(BASE_PATH, "*.csv"))

    if not all_files:
        print("ERROR: No CSV files found! Please ensure your raw CSVs are in the specified BASE_PATH.")
        return None # Return None if no files are found

    df_list = []
    for filename in all_files:
        print(f"Reading {os.path.basename(filename)}...")
        try:
            # Try standard UTF-8 first
            df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
            df_list.append(df)
        except UnicodeDecodeError:
            print(f"   -> UTF-8 failed. Trying CP1252 encoding for {os.path.basename(filename)}...")
            try:
                # Fallback for the Thursday file causing issues
                df = pd.read_csv(filename, index_col=None, header=0, low_memory=False, encoding='cp1252')
                df_list.append(df)
            except Exception as e:
                print(f"   -> Failed to read file {os.path.basename(filename)}: {e}")

    if not df_list:
        print("CRITICAL ERROR: No data loaded from CSVs.")
        return None

    df = pd.concat(df_list, axis=0, ignore_index=True)
    print(f"Total Rows Loaded from raw CSVs: {df.shape[0]}")
    return df # Return the concatenated DataFrame

# Call the function and display the head of the resulting DataFrame
df_raw_data = preprocess_cicids2017()
if df_raw_data is not None:
    print("\nHead of the raw concatenated table:")
    display(df_raw_data.head())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Working directory for raw CSVs: /content/drive/My Drive/Colab/
--- STEP 1: LOAD & CONCATENATE --- Subset (raw data) ---
Reading Monday-WorkingHours.pcap_ISCX.csv...
Reading Wednesday-workingHours.pcap_ISCX.csv...
Reading Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv...
Reading Tuesday-WorkingHours.pcap_ISCX.csv...
Reading Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv...
Reading Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv...
   -> UTF-8 failed. Trying CP1252 encoding for Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv...
Reading Friday-WorkingHours-Morning.pcap_ISCX.csv...
Reading Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv...
Total Rows Loaded from raw CSVs: 3119345

Head of the raw concatenated table:


Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80.0,192.168.10.5,49188.0,6.0,03/07/2017 08:55:58,4.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
1,192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80.0,192.168.10.5,49188.0,6.0,03/07/2017 08:55:58,1.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
2,192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80.0,192.168.10.5,49188.0,6.0,03/07/2017 08:55:58,1.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
3,192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80.0,192.168.10.5,49188.0,6.0,03/07/2017 08:55:58,1.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
4,192.168.10.14-8.253.185.121-49486-80-6,8.253.185.121,80.0,192.168.10.14,49486.0,6.0,03/07/2017 08:56:22,3.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN


**1.2 Data Preproccesing**
The data preprocessing phase involves the removal of all anomalous samples. In this initial approach, a binary configuration is adopted. During this stage, the features and the target variable are established. Notably, no feature engineering is performed, as this aspect is addressed within the model itself.

In [None]:
# 1.2.1 - Data preprocessing

# Find and handle null values
null_counts = df_raw_data.isnull().sum()
# Print the number of null values
print(f"{null_counts.sum()} null entries have been found in the dataset\n")
# Drop null values
df_raw_data.dropna(inplace=True)          # or df_data = df_data.dropna()

# Find and handle duplicates
duplicate_count = df_raw_data.duplicated().sum()
# Print the number of duplicate entries
print(f"{duplicate_count} duplicate entries have been found in the dataset\n")
# Remove duplicates
df_raw_data.drop_duplicates(inplace=True)  # or df_data = df_data.drop_duplicates()
# Display relative message
print(f"All duplicates have been removed\n")

# Reset the indexes
df_raw_data.reset_index(drop=True, inplace=True);

# Inspect the dataset for categorical columns
print("Categorical columns:",df_raw_data.select_dtypes(include=['object']).columns.tolist(),'\n')

# Print the first 5 lines
df_raw_data.head()

24532528 null entries have been found in the dataset

202 duplicate entries have been found in the dataset

All duplicates have been removed

Categorical columns: ['Flow ID', ' Source IP', ' Destination IP', ' Timestamp', ' Label'] 



Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80.0,192.168.10.5,49188.0,6.0,03/07/2017 08:55:58,4.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
1,192.168.10.5-8.254.250.126-49188-80-6,8.254.250.126,80.0,192.168.10.5,49188.0,6.0,03/07/2017 08:55:58,1.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
2,192.168.10.14-8.253.185.121-49486-80-6,8.253.185.121,80.0,192.168.10.14,49486.0,6.0,03/07/2017 08:56:22,3.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
3,192.168.10.14-8.253.185.121-49486-80-6,8.253.185.121,80.0,192.168.10.14,49486.0,6.0,03/07/2017 08:56:22,1.0,2.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
4,192.168.10.3-192.168.10.9-88-1031-6,192.168.10.9,1031.0,192.168.10.3,88.0,6.0,03/07/2017 08:56:38,609.0,7.0,4.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN


In [None]:
# Clean column names by stripping whitespace
df_raw_data.columns = df_raw_data.columns.str.strip()

columns_to_keep = [
    'Protocol',
    'Flow Duration',
    'Total Fwd Packets',
    'Total Backward Packets',
    'Total Length of Fwd Packets', # Corrected from ' Fwd Packets Length Total'
    'Total Length of Bwd Packets', # Corrected from ' Bwd Packets Length Total'
    'Fwd Packet Length Max',
    'Fwd Packet Length Min',
    'Fwd Packet Length Mean',
    'Fwd Packet Length Std',
    'min_seg_size_forward',        # Corrected from 'Fwd Seg Size Min'
    'Active Mean',
    'Active Std',
    'Active Max',
    'Active Min',
    'Idle Mean',
    'Idle Std',
    'Idle Max',
    'Idle Min',
    'Label'
]

# Create a new DataFrame with only the specified columns
df_filtered = df_raw_data[columns_to_keep].copy()

print("DataFrame after keeping only specified columns:")
display(df_filtered.head(5))

DataFrame after keeping only specified columns:


Unnamed: 0,Protocol,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,6.0,4.0,2.0,0.0,12.0,0.0,6.0,6.0,6.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
1,6.0,1.0,2.0,0.0,12.0,0.0,6.0,6.0,6.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
2,6.0,3.0,2.0,0.0,12.0,0.0,6.0,6.0,6.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
3,6.0,1.0,2.0,0.0,12.0,0.0,6.0,6.0,6.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
4,6.0,609.0,7.0,4.0,484.0,414.0,233.0,0.0,69.142857,111.967895,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN


In [None]:

# 1.2.2 - Inspection of Target Feature

print('Shape of Dataframe: ',df_filtered.shape,'\n')
print('Inspection of Target Feature - y:\n')
# Target feature counts
print(df_filtered['Label'].value_counts())

Shape of Dataframe:  (2829183, 20) 

Inspection of Target Feature - y:

Label
BENIGN                        2272487
DoS Hulk                       230123
PortScan                       158930
DDoS                           128027
DoS GoldenEye                   10293
FTP-Patator                      7938
SSH-Patator                      5897
DoS slowloris                    5796
DoS Slowhttptest                 5499
Bot                              1966
Web Attack – Brute Force         1507
Web Attack – XSS                  652
Infiltration                       36
Web Attack – Sql Injection         21
Heartbleed                         11
Name: count, dtype: int64


## 2. Model Architecture: Neural Network (NN)

Following the methodology described in **Section 5.1.2** of the Thesis, we will now implement the Neural Network model. This section will cover the construction, compilation, and initial training setup for the NN.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Separate features (X) and target (y)
X = df_filtered.drop('Label', axis=1)
y = df_filtered['Label']

# Encode the target variable (Label) into numerical format
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Convert to numpy arrays
X = X.values
y = y_encoded

# Determine the input shape for the Neural Network
input_shape = X.shape[1]

print(f"Shape of features (X): {X.shape}")
print(f"Shape of labels (y): {y.shape}")
print(f"Number of input features (input_shape): {input_shape}")
print("Encoded Labels (first 5):", y[:5])
print("Original Labels (first 5):", label_encoder.inverse_transform(y[:5]))
print("All unique labels and their encodings:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {label}: {i}")

Shape of features (X): (2829183, 19)
Shape of labels (y): (2829183,)
Number of input features (input_shape): 19
Encoded Labels (first 5): [0 0 0 0 0]
Original Labels (first 5): ['BENIGN' 'BENIGN' 'BENIGN' 'BENIGN' 'BENIGN']
All unique labels and their encodings:
  BENIGN: 0
  Bot: 1
  DDoS: 2
  DoS GoldenEye: 3
  DoS Hulk: 4
  DoS Slowhttptest: 5
  DoS slowloris: 6
  FTP-Patator: 7
  Heartbleed: 8
  Infiltration: 9
  PortScan: 10
  SSH-Patator: 11
  Web Attack – Brute Force: 12
  Web Attack – Sql Injection: 13
  Web Attack – XSS: 14
