# Network Traffic Classification using Machine Learning Techniques

# Overview

Develop classification models using Python programming to analyze a network-related dataset. 

The primary goal is to explore the dataset, preprocess it, create and evaluate different classification models, and report your findings. 

This assignment will enhance your understanding of machine learning techniques, data preprocessing, and model evaluation while applying them to a practical problem related to network security.

# Dataset

This is a real-world dataset created by collecting network data from Universidad Del Cauca, Popayn, Colombia over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017 using multiple packet capturing tools and data extracting tools. 

This dataset is consisting of 3,577,296 instances and 87 features and originally designed for application classification. Each row represents a traffic flow from a source to a destination and each column represents features of the traffic data.

This dataset is downloaded from Kaggle "IP Network Traffic Flows, Labeled with 75 Apps."

# Reference data


https://www.rfc-editor.org/

# Purpose

This test attempts to test on the dataset to determine which application does the configuration belongs to based on the prediction.


# Literature Review

The current dataset from Kaggle is presented similar to a Session data.

According to the book "The Tao of Network Security Monitoring" by Richard Bejtlich in Chapter 7. Session Data, the chapter describes Session Data as a summary of conversation between two parties.

The basic elements of Session Data consists of:
- Source IP
- Source Port
- Destination IP
- Destination Port
- Timestamp
- Measure of the amount of information exchanged during the session

The reason why Session Data is used in the analysis is because Session Data ability to track down intruder activities in content-neutral way. 



# Assumption Made

# Origin of CICFlowMeter

https://www.unb.ca/cic/research/applications.html#CICFlowMeter

https://www.kaggle.com/datasets/jsrojas/ip-network-traffic-flows-labeled-with-87-apps 

https://www.ntop.org/products/traffic-analysis/ntop/

# Environment Setup 

In [75]:
%pip install --upgrade pip
%pip install --upgrade setuptools 
%pip install --upgrade pandas 
%pip install --upgrade scikit-learn
%pip install --upgrade kagglehub 

%pip install --upgrade matplotlib 
%pip install --upgrade seaborn 
%pip install --upgrade ipywidgets 


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Install PyTorch with CUDA (Optional)

This section installs the Pytorch with CUDA

In [76]:
%pip install torch torchvision torchvision --index-url https://download.pytorch.org/whl/cu124

Looking in indexes: https://download.pytorch.org/whl/cu124
Note: you may need to restart the kernel to use updated packages.


## Install PyTorch without Cuda

In [77]:
%pip install --upgrade torch torchvision torchaudio

Note: you may need to restart the kernel to use updated packages.


# Import Packages

In [78]:
# setup automl

In [137]:
import kagglehub

import numpy
import pandas
import matplotlib
import seaborn

import os
import shutil
import torch

from sklearn.model_selection import train_test_split

# Data Retrieval

Retrieving data using kagglehub package to simplify the data retrieval process

In [80]:
def download_csv_file_from_kaggle_to_project_folder():

    while True:
        path = kagglehub.dataset_download("jsrojas/ip-network-traffic-flows-labeled-with-87-apps")

        # Verify if this dataset has been downloaded before
        if len(os.listdir(path)) == 0:
            
            #Verify if the path provided by kagglehub exists
            if os.path.isdir(path):
                #remove the folder so that new csv file can be downloaded
                os.removedirs(path)

        else:
            csv_file = os.listdir(path)
            ## https://www.freecodecamp.org/news/python-get-current-directory/

            current_project_folder: str = os.getcwd()
            destination_file_path: str = os.path.join(current_project_folder, csv_file[0])            
            source_file_path: str = os.path.join(path, csv_file[0])
            
            shutil.move(source_file_path, destination_file_path)

            return


In [81]:
download_csv_file_from_kaggle_to_project_folder()

Downloading from https://www.kaggle.com/api/v1/datasets/download/jsrojas/ip-network-traffic-flows-labeled-with-87-apps?dataset_version_number=1...


100%|██████████| 514M/514M [00:24<00:00, 21.8MB/s] 

Extracting files...





# GPU Configuration

In [82]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:30:10_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0


In [83]:
!nvidia-smi

Tue Jan 14 17:08:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17                 Driver Version: 561.17         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A500 Laptop GPU   WDDM  |   00000000:03:00.0 Off |                  N/A |
| N/A   52C    P8              5W /   30W |      69MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [84]:
torch.cuda.empty_cache()

In [85]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Project Configuration

In [86]:
SOURCE_CSV_FILE: str = os.path.join(os.getcwd(), "Dataset-Unicauca-Version2-87Atts.csv")
PROJECT_FOLDER_FILE_PATH: str = os.path.join(os.getcwd())
STACKED_AUTOENCODER_MODEL_FILE_PATH: str = os.path.join(PROJECT_FOLDER_FILE_PATH, "stacked_autoencoder_best_model.pt")
SUPERVISED_STACKED_AUTOENCODER_MODEL_FILE_PATH: str = os.path.join(PROJECT_FOLDER_FILE_PATH, "supervised_stacked_autoencoder_best_model.pt")
CLASSIFICATION_TEST_STACKED_AUTOENCODER_MODEL_FILE_PATH: str = os.path.join(PROJECT_FOLDER_FILE_PATH, "classification_test_stacked_autoencoder_best_model.pt")


# Data Loading into DataFrame

* Load and explore the dataset.

Attempting to load the data into pandas dataframe for the data exploration

In [87]:
chunk_size: int = 100000
data_chunks: list = []

for chunk in pandas.read_csv(SOURCE_CSV_FILE, chunksize=chunk_size):
    data_chunks.append(chunk)


network_traffic_analysis_dataframe: pandas.DataFrame = pandas.concat(data_chunks, ignore_index=True)

In [88]:
network_traffic_analysis_dataframe

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,...,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577291,10.200.7.199-98.138.79.73-42135-443-6,98.138.79.73,443,10.200.7.199,42135,6,15/05/201705:43:40,2290821,5,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577292,10.200.7.217-98.138.79.73-51546-443-6,98.138.79.73,443,10.200.7.217,51546,6,15/05/201705:46:10,24,5,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577293,10.200.7.218-98.138.79.73-44366-443-6,98.138.79.73,443,10.200.7.218,44366,6,15/05/201705:45:39,2591653,6,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577294,10.200.7.195-98.138.79.73-52341-443-6,98.138.79.73,443,10.200.7.195,52341,6,15/05/201705:45:59,2622421,4,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL


In [89]:
network_traffic_analysis_dataframe["Label"].unique()

array(['BENIGN'], dtype=object)

# Basic Data Exploration


* Handle missing data and outliers.
* Perform data visualization to gain insights into the dataset.

### Feature and its description

- Flow duration		
    - Duration of the flow in Microsecond
- total Fwd Packet		
    - Total packets in the forward direction
- total Bwd packets		
    - Total packets in the backward direction
- total Length of Fwd Packet	
    - Total size of packet in forward direction
- total Length of Bwd Packet	
    - Total size of packet in backward direction
- Fwd Packet Length Min 		
    - Minimum size of packet in forward direction
- Fwd Packet Length Max 		
    - Maximum size of packet in forward direction
- Fwd Packet Length Mean		
    - Mean size of packet in forward direction
- Fwd Packet Length Std		
    - Standard deviation size of packet in forward direction
- Bwd Packet Length Min		
    - Minimum size of packet in backward direction
- Bwd Packet Length Max		
    - Maximum size of packet in backward direction
- Bwd Packet Length Mean		
    - Mean size of packet in backward direction
- Bwd Packet Length Std		
    - Standard deviation size of packet in backward direction
- Flow Byte/s			
    - Number of flow packets per second
- Flow Packets/s			
    - Number of flow bytes per second 
- Flow IAT Mean			
    - Mean time between two packets sent in the flow
- Flow IAT Std			
    - Standard deviation time between two packets sent in the flow
- Flow IAT Max			
    - Maximum time between two packets sent in the flow
- Flow IAT Min			
    - Minimum time between two packets sent in the flow
- Fwd IAT Min			
    - Minimum time between two packets sent in the forward direction
- Fwd IAT Max			
    - Maximum time between two packets sent in the forward direction
- Fwd IAT Mean			
    - Mean time between two packets sent in the forward direction
- Fwd IAT Std			
    - Standard deviation time between two packets sent in the forward direction
- Fwd IAT Total   		
    - Total time between two packets sent in the forward direction
- Bwd IAT Min			
    - Minimum time between two packets sent in the backward direction
- Bwd IAT Max			
    - Maximum time between two packets sent in the backward direction
- Bwd IAT Mean			
    - Mean time between two packets sent in the backward direction
- Bwd IAT Std			
    - Standard deviation time between two packets sent in the backward direction
- Bwd IAT Total			
    - Total time between two packets sent in the backward direction
- Fwd PSH flag			
    - Number of times the PSH flag was set in packets travelling in the forward direction (0 for UDP)
- Bwd PSH Flag			
    - Number of times the PSH flag was set in packets travelling in the backward direction (0 for UDP)
- Fwd URG Flag			
    - Number of times the URG flag was set in packets travelling in the forward direction (0 for UDP)
- Bwd URG Flag			
    - Number of times the URG flag was set in packets travelling in the backward direction (0 for UDP)
- Fwd Header Length		
    - Total bytes used for headers in the forward direction
- Bwd Header Length		
    - Total bytes used for headers in the backward direction
- FWD Packets/s			
    - Number of forward packets per second
- Bwd Packets/s			
    - Number of backward packets per second
- Min Packet Length 		
    - Minimum length of a packet
- Max Packet Length 		
    - Maximum length of a packet
- Packet Length Mean 		
    - Mean length of a packet
- Packet Length Std		
    - Standard deviation length of a packet
- Packet Length Variance  	
    - Variance length of a packet
- FIN Flag Count 			
    - Number of packets with FIN
- SYN Flag Count 			
    - Number of packets with SYN
- RST Flag Count 			
    - Number of packets with RST
- PSH Flag Count 			
    - Number of packets with PUSH
- ACK Flag Count 			
    - Number of packets with ACK
- URG Flag Count 			
    - Number of packets with URG
- CWR Flag Count 			
    - Number of packets with CWE
- ECE Flag Count 			
    - Number of packets with ECE
- down/Up Ratio			
    - Download and upload ratio
- Average Packet Size 		
    - Average size of packet
- Avg Fwd Segment Size 		
    - Average size observed in the forward direction
- AVG Bwd Segment Size 		
    - Average number of bytes bulk rate in the backward direction
- Fwd Header Length		
    - Length of the forward packet header
- Fwd Avg Bytes/Bulk		
    - Average number of bytes bulk rate in the forward direction
- Fwd AVG Packet/Bulk 		
    - Average number of packets bulk rate in the forward direction
- Fwd AVG Bulk Rate 		
    - Average number of bulk rate in the forward direction
- Bwd Avg Bytes/Bulk		
    - Average number of bytes bulk rate in the backward direction
- Bwd AVG Packet/Bulk 		
    - Average number of packets bulk rate in the backward direction
- Bwd AVG Bulk Rate 		
    - Average number of bulk rate in the backward direction
- Subflow Fwd Packets		
    - The average number of packets in a sub flow in the forward direction
- Subflow Fwd Bytes		
    - The average number of bytes in a sub flow in the forward direction
- Subflow Bwd Packets		
    - The average number of packets in a sub flow in the backward direction
- Subflow Bwd Bytes		
    - The average number of bytes in a sub flow in the backward direction
- Init_Win_bytes_forward		
    - The total number of bytes sent in initial window in the forward direction
- Init_Win_bytes_backward		
    - The total number of bytes sent in initial window in the backward direction
- Act_data_pkt_forward		
    - Count of packets with at least 1 byte of TCP data payload in the forward direction
- min_seg_size_forward		
    - Minimum segment size observed in the forward direction
- Active Min			
    - Minimum time a flow was active before becoming idle
- Active Mean			
    - Mean time a flow was active before becoming idle
- Active Max			
    - Maximum time a flow was active before becoming idle
- Active Std			
    - Standard deviation time a flow was active before becoming idle
- Idle Min			
    - Minimum time a flow was idle before becoming active
- Idle Mean			
    - Mean time a flow was idle before becoming active
- Idle Max			
    - Maximum time a flow was idle before becoming active
- Idle Std			
    - Standard deviation time a flow was idle before becoming active

In [90]:
network_traffic_analysis_dataframe.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

### Data Type for all columns in the dataset

In [91]:
print(network_traffic_analysis_dataframe.dtypes[:20])

Flow.ID                         object
Source.IP                       object
Source.Port                      int64
Destination.IP                  object
Destination.Port                 int64
Protocol                         int64
Timestamp                       object
Flow.Duration                    int64
Total.Fwd.Packets                int64
Total.Backward.Packets           int64
Total.Length.of.Fwd.Packets      int64
Total.Length.of.Bwd.Packets    float64
Fwd.Packet.Length.Max            int64
Fwd.Packet.Length.Min            int64
Fwd.Packet.Length.Mean         float64
Fwd.Packet.Length.Std          float64
Bwd.Packet.Length.Max            int64
Bwd.Packet.Length.Min            int64
Bwd.Packet.Length.Mean         float64
Bwd.Packet.Length.Std          float64
dtype: object


In [92]:
print(network_traffic_analysis_dataframe.dtypes[21:40])

Flow.Packets.s    float64
Flow.IAT.Mean     float64
Flow.IAT.Std      float64
Flow.IAT.Max      float64
Flow.IAT.Min        int64
Fwd.IAT.Total     float64
Fwd.IAT.Mean      float64
Fwd.IAT.Std       float64
Fwd.IAT.Max       float64
Fwd.IAT.Min       float64
Bwd.IAT.Total     float64
Bwd.IAT.Mean      float64
Bwd.IAT.Std       float64
Bwd.IAT.Max       float64
Bwd.IAT.Min       float64
Fwd.PSH.Flags       int64
Bwd.PSH.Flags       int64
Fwd.URG.Flags       int64
Bwd.URG.Flags       int64
dtype: object


In [93]:
print(network_traffic_analysis_dataframe.dtypes[41:60])

Bwd.Header.Length           int64
Fwd.Packets.s             float64
Bwd.Packets.s             float64
Min.Packet.Length           int64
Max.Packet.Length           int64
Packet.Length.Mean        float64
Packet.Length.Std         float64
Packet.Length.Variance    float64
FIN.Flag.Count              int64
SYN.Flag.Count              int64
RST.Flag.Count              int64
PSH.Flag.Count              int64
ACK.Flag.Count              int64
URG.Flag.Count              int64
CWE.Flag.Count              int64
ECE.Flag.Count              int64
Down.Up.Ratio               int64
Average.Packet.Size       float64
Avg.Fwd.Segment.Size      float64
dtype: object


In [94]:
print(network_traffic_analysis_dataframe.dtypes[71:])

Subflow.Bwd.Bytes            int64
Init_Win_bytes_forward       int64
Init_Win_bytes_backward      int64
act_data_pkt_fwd             int64
min_seg_size_forward         int64
Active.Mean                float64
Active.Std                 float64
Active.Max                 float64
Active.Min                 float64
Idle.Mean                  float64
Idle.Std                   float64
Idle.Max                   float64
Idle.Min                   float64
Label                       object
L7Protocol                   int64
ProtocolName                object
dtype: object


### Dataset Information

In [95]:
network_traffic_analysis_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3577296 entries, 0 to 3577295
Data columns (total 87 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   Flow.ID                      object 
 1   Source.IP                    object 
 2   Source.Port                  int64  
 3   Destination.IP               object 
 4   Destination.Port             int64  
 5   Protocol                     int64  
 6   Timestamp                    object 
 7   Flow.Duration                int64  
 8   Total.Fwd.Packets            int64  
 9   Total.Backward.Packets       int64  
 10  Total.Length.of.Fwd.Packets  int64  
 11  Total.Length.of.Bwd.Packets  float64
 12  Fwd.Packet.Length.Max        int64  
 13  Fwd.Packet.Length.Min        int64  
 14  Fwd.Packet.Length.Mean       float64
 15  Fwd.Packet.Length.Std        float64
 16  Bwd.Packet.Length.Max        int64  
 17  Bwd.Packet.Length.Min        int64  
 18  Bwd.Packet.Length.Mean       float64
 19  

### Describe the dataset

In [96]:
network_traffic_analysis_dataframe.iloc[:,:10].describe()

Unnamed: 0,Source.Port,Destination.Port,Protocol,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,37999.38,12042.46,6.005508,25442470.0,62.37799,65.34083
std,22017.13,20449.16,0.3274574,40144300.0,1094.086,1108.092
min,0.0,0.0,0.0,1.0,1.0,0.0
25%,3697.0,443.0,6.0,628.0,2.0,1.0
50%,49377.0,3128.0,6.0,584729.5,6.0,5.0
75%,53799.0,3128.0,6.0,45001530.0,15.0,15.0
max,65534.0,65534.0,17.0,120000000.0,453190.0,542196.0


In [97]:
network_traffic_analysis_dataframe.iloc[:,11:20].describe()

Unnamed: 0,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,Fwd.Packet.Length.Mean,Fwd.Packet.Length.Std,Bwd.Packet.Length.Max,Bwd.Packet.Length.Min,Bwd.Packet.Length.Mean,Bwd.Packet.Length.Std
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,84457.42,512.3645,9.340408,114.9212,152.0501,1103.231,11.13491,254.7845,289.8878
std,2124319.0,1039.319,82.99983,246.4707,240.4702,2352.374,105.5422,506.0731,485.3004
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,6.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0
50%,208.0,206.0,0.0,46.57143,74.21124,81.0,0.0,30.14286,32.42474
75%,3629.0,613.0,6.0,122.5,207.9035,1366.0,0.0,256.75,423.2105
max,1345796000.0,32832.0,16060.0,16060.0,6225.487,37648.0,13032.0,13032.0,8434.804


In [98]:
network_traffic_analysis_dataframe.iloc[:,21:30].describe()

Unnamed: 0,Flow.Packets.s,Flow.IAT.Mean,Flow.IAT.Std,Flow.IAT.Max,Flow.IAT.Min,Fwd.IAT.Total,Fwd.IAT.Mean,Fwd.IAT.Std,Fwd.IAT.Max
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,88963.38,1422201.0,3365395.0,12850200.0,88702.01,24187960.0,3124467.0,3649620.0,12096240.0
std,402762.0,3550414.0,6260959.0,20765180.0,1605272.0,39625630.0,8358652.0,7390979.0,20491800.0
min,0.01666667,0.2,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.128096,415.0,8.485281,570.0,0.0,7.0,5.0,0.0,6.0
50%,33.93752,33202.38,68364.44,281239.5,1.0,389264.5,37006.79,47175.96,207629.0
75%,4214.963,936657.6,3980748.0,23915460.0,33.0,40011610.0,1549711.0,2932647.0,19269760.0
max,6000000.0,120000000.0,84852730.0,120000000.0,120000000.0,120000000.0,120000000.0,84852560.0,120000000.0


In [99]:
network_traffic_analysis_dataframe.iloc[:,41:50].describe()

Unnamed: 0,Bwd.Header.Length,Fwd.Packets.s,Bwd.Packets.s,Min.Packet.Length,Max.Packet.Length,Packet.Length.Mean,Packet.Length.Std,Packet.Length.Variance,FIN.Flag.Count
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,1743.621,77058.16,11905.22,3.043745,1333.25,198.8191,303.519,279273.6,0.007037159
std,30391.9,368315.3,108020.6,41.45472,2453.395,332.7427,432.6083,725860.8,0.0835921
min,0.0,0.008333337,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,32.0,0.5417242,0.1009873,0.0,6.0,6.0,0.0,0.0,0.0
50%,136.0,15.63422,2.951696,0.0,355.0,62.83333,106.9828,11445.31,0.0
75%,420.0,2164.502,83.44459,6.0,1460.0,250.0,481.8125,232143.2,0.0
max,12844400.0,6000000.0,5000000.0,7063.0,37648.0,10708.67,9268.781,85910310.0,1.0


In [100]:
network_traffic_analysis_dataframe.iloc[:,51:60].describe()

Unnamed: 0,RST.Flag.Count,PSH.Flag.Count,ACK.Flag.Count,URG.Flag.Count,CWE.Flag.Count,ECE.Flag.Count,Down.Up.Ratio,Average.Packet.Size,Avg.Fwd.Segment.Size
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,0.0006655865,0.405821,0.5995705,0.2773847,0.0,0.0006566412,0.9085471,207.563,114.9212
std,0.02579038,0.4910503,0.4899855,0.447708,0.0,0.0256166,1.269945,343.227,246.4707
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,6.0
50%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,66.5,46.57143
75%,0.0,1.0,1.0,1.0,0.0,0.0,1.0,263.7184,122.5
max,1.0,1.0,1.0,1.0,0.0,1.0,293.0,16063.0,16060.0


In [101]:
network_traffic_analysis_dataframe.iloc[:,61:70].describe()

Unnamed: 0,Fwd.Header.Length.1,Fwd.Avg.Bytes.Bulk,Fwd.Avg.Packets.Bulk,Fwd.Avg.Bulk.Rate,Bwd.Avg.Bytes.Bulk,Bwd.Avg.Packets.Bulk,Bwd.Avg.Bulk.Rate,Subflow.Fwd.Packets,Subflow.Fwd.Bytes
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,1653.339,0.0,0.0,0.0,0.0,0.0,0.0,62.37799,46833.23
std,30088.9,0.0,0.0,0.0,0.0,0.0,0.0,1094.086,1816196.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,40.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,12.0
50%,152.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,443.0
75%,392.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,1769.0
max,15439500.0,0.0,0.0,0.0,0.0,0.0,0.0,453190.0,678023600.0


In [102]:
network_traffic_analysis_dataframe.iloc[:,71:80].describe()

Unnamed: 0,Subflow.Bwd.Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,84457.42,8984.691,2123.489,45.03535,25.69738,298199.0,183640.6,522937.2,167633.6
std,2124319.0,14101.26,7704.789,974.8192,6.025989,2349390.0,1325838.0,3266508.0,2064219.0
min,0.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
25%,0.0,411.0,18.0,0.0,20.0,0.0,0.0,0.0,0.0
50%,208.0,5840.0,262.0,2.0,20.0,0.0,0.0,0.0,0.0
75%,3629.0,14600.0,660.0,9.0,32.0,45.0,0.0,57.0,2.0
max,1345796000.0,65535.0,65535.0,328694.0,523.0,114695000.0,72971360.0,114695000.0,114695000.0


In [103]:
network_traffic_analysis_dataframe.iloc[:,81:].describe()

Unnamed: 0,Idle.Std,Idle.Max,Idle.Min,L7Protocol
count,3577296.0,3577296.0,3577296.0,3577296.0
mean,1370991.0,9743845.0,7252097.0,102.9508
std,4814474.0,18885570.0,16007540.0,51.29198
min,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,91.0
50%,0.0,0.0,0.0,126.0
75%,0.0,8034389.0,5369712.0,130.0
max,77387460.0,120000000.0,120000000.0,222.0


# Data Preparation

### Detect for any null values in the dataset

In [104]:
network_traffic_analysis_dataframe.isnull().sum()[:20]

Flow.ID                        0
Source.IP                      0
Source.Port                    0
Destination.IP                 0
Destination.Port               0
Protocol                       0
Timestamp                      0
Flow.Duration                  0
Total.Fwd.Packets              0
Total.Backward.Packets         0
Total.Length.of.Fwd.Packets    0
Total.Length.of.Bwd.Packets    0
Fwd.Packet.Length.Max          0
Fwd.Packet.Length.Min          0
Fwd.Packet.Length.Mean         0
Fwd.Packet.Length.Std          0
Bwd.Packet.Length.Max          0
Bwd.Packet.Length.Min          0
Bwd.Packet.Length.Mean         0
Bwd.Packet.Length.Std          0
dtype: int64

In [105]:
network_traffic_analysis_dataframe.isnull().sum()[21:40]

Flow.Packets.s    0
Flow.IAT.Mean     0
Flow.IAT.Std      0
Flow.IAT.Max      0
Flow.IAT.Min      0
Fwd.IAT.Total     0
Fwd.IAT.Mean      0
Fwd.IAT.Std       0
Fwd.IAT.Max       0
Fwd.IAT.Min       0
Bwd.IAT.Total     0
Bwd.IAT.Mean      0
Bwd.IAT.Std       0
Bwd.IAT.Max       0
Bwd.IAT.Min       0
Fwd.PSH.Flags     0
Bwd.PSH.Flags     0
Fwd.URG.Flags     0
Bwd.URG.Flags     0
dtype: int64

In [106]:
network_traffic_analysis_dataframe.isnull().sum()[41:60]

Bwd.Header.Length         0
Fwd.Packets.s             0
Bwd.Packets.s             0
Min.Packet.Length         0
Max.Packet.Length         0
Packet.Length.Mean        0
Packet.Length.Std         0
Packet.Length.Variance    0
FIN.Flag.Count            0
SYN.Flag.Count            0
RST.Flag.Count            0
PSH.Flag.Count            0
ACK.Flag.Count            0
URG.Flag.Count            0
CWE.Flag.Count            0
ECE.Flag.Count            0
Down.Up.Ratio             0
Average.Packet.Size       0
Avg.Fwd.Segment.Size      0
dtype: int64

In [107]:
network_traffic_analysis_dataframe.isnull().sum()[61:80]

Fwd.Header.Length.1        0
Fwd.Avg.Bytes.Bulk         0
Fwd.Avg.Packets.Bulk       0
Fwd.Avg.Bulk.Rate          0
Bwd.Avg.Bytes.Bulk         0
Bwd.Avg.Packets.Bulk       0
Bwd.Avg.Bulk.Rate          0
Subflow.Fwd.Packets        0
Subflow.Fwd.Bytes          0
Subflow.Bwd.Packets        0
Subflow.Bwd.Bytes          0
Init_Win_bytes_forward     0
Init_Win_bytes_backward    0
act_data_pkt_fwd           0
min_seg_size_forward       0
Active.Mean                0
Active.Std                 0
Active.Max                 0
Active.Min                 0
dtype: int64

In [108]:
network_traffic_analysis_dataframe.isnull().sum()[81:]

Idle.Std        0
Idle.Max        0
Idle.Min        0
Label           0
L7Protocol      0
ProtocolName    0
dtype: int64

### Detect for any na values in the dataset

In [109]:
network_traffic_analysis_dataframe.isna().sum()[:20]

Flow.ID                        0
Source.IP                      0
Source.Port                    0
Destination.IP                 0
Destination.Port               0
Protocol                       0
Timestamp                      0
Flow.Duration                  0
Total.Fwd.Packets              0
Total.Backward.Packets         0
Total.Length.of.Fwd.Packets    0
Total.Length.of.Bwd.Packets    0
Fwd.Packet.Length.Max          0
Fwd.Packet.Length.Min          0
Fwd.Packet.Length.Mean         0
Fwd.Packet.Length.Std          0
Bwd.Packet.Length.Max          0
Bwd.Packet.Length.Min          0
Bwd.Packet.Length.Mean         0
Bwd.Packet.Length.Std          0
dtype: int64

In [110]:
network_traffic_analysis_dataframe.isna().sum()[21:40]

Flow.Packets.s    0
Flow.IAT.Mean     0
Flow.IAT.Std      0
Flow.IAT.Max      0
Flow.IAT.Min      0
Fwd.IAT.Total     0
Fwd.IAT.Mean      0
Fwd.IAT.Std       0
Fwd.IAT.Max       0
Fwd.IAT.Min       0
Bwd.IAT.Total     0
Bwd.IAT.Mean      0
Bwd.IAT.Std       0
Bwd.IAT.Max       0
Bwd.IAT.Min       0
Fwd.PSH.Flags     0
Bwd.PSH.Flags     0
Fwd.URG.Flags     0
Bwd.URG.Flags     0
dtype: int64

In [111]:
network_traffic_analysis_dataframe.isna().sum()[41:60]

Bwd.Header.Length         0
Fwd.Packets.s             0
Bwd.Packets.s             0
Min.Packet.Length         0
Max.Packet.Length         0
Packet.Length.Mean        0
Packet.Length.Std         0
Packet.Length.Variance    0
FIN.Flag.Count            0
SYN.Flag.Count            0
RST.Flag.Count            0
PSH.Flag.Count            0
ACK.Flag.Count            0
URG.Flag.Count            0
CWE.Flag.Count            0
ECE.Flag.Count            0
Down.Up.Ratio             0
Average.Packet.Size       0
Avg.Fwd.Segment.Size      0
dtype: int64

In [112]:
network_traffic_analysis_dataframe.isna().sum()[61:80]

Fwd.Header.Length.1        0
Fwd.Avg.Bytes.Bulk         0
Fwd.Avg.Packets.Bulk       0
Fwd.Avg.Bulk.Rate          0
Bwd.Avg.Bytes.Bulk         0
Bwd.Avg.Packets.Bulk       0
Bwd.Avg.Bulk.Rate          0
Subflow.Fwd.Packets        0
Subflow.Fwd.Bytes          0
Subflow.Bwd.Packets        0
Subflow.Bwd.Bytes          0
Init_Win_bytes_forward     0
Init_Win_bytes_backward    0
act_data_pkt_fwd           0
min_seg_size_forward       0
Active.Mean                0
Active.Std                 0
Active.Max                 0
Active.Min                 0
dtype: int64

In [113]:
network_traffic_analysis_dataframe.isna().sum()[81:]

Idle.Std        0
Idle.Max        0
Idle.Min        0
Label           0
L7Protocol      0
ProtocolName    0
dtype: int64

### Determine the classes in data columns that are object data type

In [114]:
network_traffic_analysis_dataframe.select_dtypes("object").columns

Index(['Flow.ID', 'Source.IP', 'Destination.IP', 'Timestamp', 'Label',
       'ProtocolName'],
      dtype='object')

### Determine the classes in Label column

In [115]:
network_traffic_analysis_dataframe["Label"].unique()

array(['BENIGN'], dtype=object)

### Mapping of L7Protocol and ProtocolName

In [116]:
# Create a dictionary based on

protocol_name_list: list[str] = network_traffic_analysis_dataframe["ProtocolName"].unique()

number_of_data_to_iterate: int = len(network_traffic_analysis_dataframe["ProtocolName"].unique())

protocol_index_to_name_mapping: dict[int, str] = {}
protocol_name_to_index_mapping: dict[str, int] = {}

for index in range(number_of_data_to_iterate):
    data = network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["ProtocolName"] == protocol_name_list[index]].head(1)[["L7Protocol", "ProtocolName"]]

    protocol_index_to_name_mapping[data["L7Protocol"].values[0]] = data["ProtocolName"].values[0]
    protocol_name_to_index_mapping[data["ProtocolName"].values[0]] = data["L7Protocol"].values[0]




In [117]:
protocol_index_to_name_mapping

{np.int64(131): 'HTTP_PROXY',
 np.int64(7): 'HTTP',
 np.int64(130): 'HTTP_CONNECT',
 np.int64(91): 'SSL',
 np.int64(126): 'GOOGLE',
 np.int64(124): 'YOUTUBE',
 np.int64(119): 'FACEBOOK',
 np.int64(40): 'CONTENT_FLASH',
 np.int64(121): 'DROPBOX',
 np.int64(147): 'WINDOWS_UPDATE',
 np.int64(178): 'AMAZON',
 np.int64(212): 'MICROSOFT',
 np.int64(163): 'TOR',
 np.int64(122): 'GMAIL',
 np.int64(70): 'YAHOO',
 np.int64(68): 'MSN',
 np.int64(64): 'SSL_NO_CERT',
 np.int64(125): 'SKYPE',
 np.int64(221): 'MS_ONE_DRIVE',
 np.int64(114): 'MSSQL',
 np.int64(120): 'TWITTER',
 np.int64(143): 'APPLE_ICLOUD',
 np.int64(220): 'CLOUDFLARE',
 np.int64(169): 'UBUNTUONE',
 np.int64(219): 'OFFICE_365',
 np.int64(176): 'WIKIPEDIA',
 np.int64(201): 'OPENSIGNAL',
 np.int64(5): 'DNS',
 np.int64(60): 'HTTP_DOWNLOAD',
 np.int64(142): 'WHATSAPP',
 np.int64(145): 'APPLE_ITUNES',
 np.int64(175): 'FTP_DATA',
 np.int64(132): 'CITRIX',
 np.int64(140): 'APPLE',
 np.int64(222): 'MQTT',
 np.int64(211): 'INSTAGRAM',
 np.int

### Mapping of Protocol and OSI Model

In [118]:
# Create a mapping of the number in Protocol column and the OSI model
network_traffic_analysis_dataframe["Protocol"].unique()


array([ 6, 17,  0])

https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/src/main/java/cic/cs/unb/ca/jnetpcap/PacketReader.java

From Line 401 to 438

The code implementation shows, if the protocol is TCP then the protocol number is set to 6
If the protocol is UDP then the protocol is set to 17


https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/src/main/java/cic/cs/unb/ca/jnetpcap/FlowFeature.java

From Line 188 to 206 also shows the same implemenatation

TCP = 6 
UDP = 17

Others = 0

https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/src/main/java/cic/cs/unb/ca/jnetpcap/BasicFlow.java

From line 788 to 795

There is a mapping for protocol number to protocol str

Therefore the mapping of Protocol to Code will be in this mapping.

TCP = 6

UDP = 17

OTHERS = 0


In [119]:
TCP: str = "TCP"
UDP: str = "UDP"
OTHERS: str = "OTHERS"

In [120]:
TCP_CODE: int = 6
UDP_CODE: int = 17
OTHER_CODE: int = 0

In [121]:
protocol_to_code_mapping: dict = {
    TCP: TCP_CODE,
    UDP: UDP_CODE,
    OTHERS: OTHER_CODE
}

### Split TimeStamp into 2 different columns (Date and Time)

Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html

In [122]:
# https://regexr.com/
network_traffic_analysis_dataframe["Date"] = network_traffic_analysis_dataframe["Timestamp"].str.extract(r"(\d{1,2}\/\d{1,2}\/\d{2,4})", expand=False)


In [123]:
network_traffic_analysis_dataframe["Time"] = network_traffic_analysis_dataframe["Timestamp"].str.extract(r"(\d{1,2}\:\d{1,2}\:\d{1,2})", expand=False)


* Preprocess the data for modeling, including feature scaling and encoding categorical variables.

# Further Data Exploration

## Explore the distribution of Application used

The L7Protocol and ProtocolName are related where L7Protocol is the unique numerical data that represents the ProtocolName and ProtocolName is the name of the application used to access the internet.

In [50]:
network_traffic_analysis_dataframe["ProtocolName"].value_counts().head(20)

ProtocolName
GOOGLE            959110
HTTP              683734
HTTP_PROXY        623210
SSL               404883
HTTP_CONNECT      317526
YOUTUBE           170781
AMAZON             86875
MICROSOFT          54710
GMAIL              40260
WINDOWS_UPDATE     34471
SKYPE              30657
FACEBOOK           29033
DROPBOX            25102
YAHOO              21268
TWITTER            18259
CLOUDFLARE         14737
MSN                14478
CONTENT_FLASH       8589
APPLE               7615
OFFICE_365          5941
Name: count, dtype: int64

## Analysis on TCP 3-way handshake in the dataset

### Explore the SYN Flag Count distribution

In [51]:
network_traffic_analysis_dataframe["SYN.Flag.Count"].value_counts()

SYN.Flag.Count
0    2961853
1     615443
Name: count, dtype: int64

In [52]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["SYN.Flag.Count"] == 0]["Protocol"].unique()

array([ 6, 17,  0])

In [53]:
network_traffic_analysis_dataframe[(network_traffic_analysis_dataframe["SYN.Flag.Count"] == 0) & (network_traffic_analysis_dataframe["Protocol"] == 6)]["Protocol"].count()

np.int64(2957532)

In [54]:
network_traffic_analysis_dataframe[(network_traffic_analysis_dataframe["SYN.Flag.Count"] == 0) & (network_traffic_analysis_dataframe["Protocol"] != 6)]["Protocol"].count()

np.int64(4321)

In [55]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["SYN.Flag.Count"] == 1]["Protocol"].unique()

array([6])

https://www.imperva.com/learn/ddos/syn-flood/ 

https://en.wikipedia.org/wiki/SYN_flood

TCP connection is initiated with SYN packet and there are higher frequency of TCP flow without SYN packets

In [56]:
network_traffic_analysis_dataframe.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

### Explore RST Flag data distribution

https://en.wikipedia.org/wiki/TCP_reset_attack 

https://www.extrahop.com/blog/tcp-resets-rst-prevent-command-and-control-dos-attacks

https://www.rfc-editor.org/info/bcp60
Inappropriate TCP resets considered harmful

In [57]:
network_traffic_analysis_dataframe["RST.Flag.Count"].value_counts()

RST.Flag.Count
0    3574915
1       2381
Name: count, dtype: int64

In [58]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["RST.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    2381
Name: count, dtype: int64

In [59]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["RST.Flag.Count"] == 1]

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,...,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName,Date,Time
1900,192.168.32.3-10.200.7.8-50687-3128-6,192.168.32.3,50687,10.200.7.8,3128,6,26/04/201711:11:28,118867,8,14,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:28
1943,192.168.32.3-10.200.7.8-50688-3128-6,192.168.32.3,50688,10.200.7.8,3128,6,26/04/201711:11:28,194774,8,15,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:28
2356,192.168.32.3-10.200.7.8-50699-3128-6,192.168.32.3,50699,10.200.7.8,3128,6,26/04/201711:11:29,445551,10,20,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:29
2858,192.168.32.3-10.200.7.8-50704-3128-6,192.168.32.3,50704,10.200.7.8,3128,6,26/04/201711:11:31,245917,8,14,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:31
3579,192.168.32.3-10.200.7.8-50703-3128-6,192.168.32.3,50703,10.200.7.8,3128,6,26/04/201711:11:31,3466473,16,33,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,130,HTTP_CONNECT,26/04/2017,11:11:31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3549053,192.168.32.93-10.200.7.9-51666-3128-6,192.168.32.93,51666,10.200.7.9,3128,6,15/05/201705:21:22,431352,14,13,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,140,APPLE,15/05/2017,05:21:22
3549054,192.168.32.93-10.200.7.8-51642-3128-6,192.168.32.93,51642,10.200.7.8,3128,6,15/05/201705:19:19,90338973,29,30,...,80.0,4.514784e+07,6.746577e+04,45195545.0,45100134.0,BENIGN,126,GOOGLE,15/05/2017,05:19:19
3549061,192.168.32.93-10.200.7.8-51645-3128-6,192.168.32.93,51645,10.200.7.8,3128,6,15/05/201705:19:24,90812319,42,29,...,175.0,4.531954e+07,3.367999e+05,45557693.0,45081386.0,BENIGN,126,GOOGLE,15/05/2017,05:19:24
3549080,192.168.32.93-10.200.7.8-51665-3128-6,192.168.32.93,51665,10.200.7.8,3128,6,15/05/201705:21:11,315796,42,26,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,126,GOOGLE,15/05/2017,05:21:11


In [60]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["RST.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     3570594
17       2684
0        1637
Name: count, dtype: int64

### Explore the FIN Flag Count distribution

Based on the documentation of the dataset, the FIN flag is set once the TCP connection ends.

In [61]:
network_traffic_analysis_dataframe["FIN.Flag.Count"].value_counts()

FIN.Flag.Count
0    3552122
1      25174
Name: count, dtype: int64

In [62]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["FIN.Flag.Count"] == 0]["Protocol"].unique()

array([ 6, 17,  0])

According to the documentation of the tools that was used to generate this dataset [https://www.unb.ca/cic/research/applications.html#CICFlowMeter], the TCP flow are usually terminated when there is a connection teardown by the FIN packet.

The UDP flows are terminated by flow timeout.

The high number of absence FIN packet shows weird occurence and the TCP flow are without FIN packet are abnormally high.

There is a mapping done in previous section of the notebook where the index 6 = TCP and 17 = UDP and 0 = other protocol.



### Explore the Flow Timeout value data

In the ReadMe.txt of the CICflowMeter [https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/ReadMe.txt], the Flow duration column is measured in Microseconds.

In [63]:
network_traffic_analysis_dataframe["Flow.Duration"].describe()

count    3.577296e+06
mean     2.544247e+07
std      4.014430e+07
min      1.000000e+00
25%      6.280000e+02
50%      5.847295e+05
75%      4.500153e+07
max      1.200000e+08
Name: Flow.Duration, dtype: float64

In [64]:
def transform_microseconds_to_seconds(data: int) -> float:
    if data == 0:
        return 0.0
    
    return data / 1000000.0


In [65]:

network_traffic_analysis_dataframe["Flow.Duration"].apply(transform_microseconds_to_seconds)

0          0.045523
1          0.000001
2          0.000001
3          0.000217
4          0.078068
             ...   
3577291    2.290821
3577292    0.000024
3577293    2.591653
3577294    2.622421
3577295    2.009138
Name: Flow.Duration, Length: 3577296, dtype: float64

### Exploring the TCP PSH Packet Flag distribution

The TCP PSH flag is used for real-time application such as voice and video streaming. The delay in data transmission can cause poor user experience.

In [66]:
network_traffic_analysis_dataframe["PSH.Flag.Count"].value_counts()

PSH.Flag.Count
0    2125554
1    1451742
Name: count, dtype: int64

In [67]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["PSH.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     2121233
17       2684
0        1637
Name: count, dtype: int64

In [68]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["PSH.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    1451742
Name: count, dtype: int64

In [69]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["PSH.Flag.Count"] == 1]["ProtocolName"].value_counts().head(20)

ProtocolName
GOOGLE            407360
HTTP_CONNECT      192516
SSL               184339
HTTP              173905
HTTP_PROXY        167665
YOUTUBE            95905
AMAZON             52442
MICROSOFT          36443
WINDOWS_UPDATE     23998
GMAIL              15260
FACEBOOK           14978
SKYPE              14957
YAHOO              13503
MSN                 9748
TWITTER             9572
CLOUDFLARE          7600
CONTENT_FLASH       7213
DROPBOX             5147
APPLE               4016
OFFICE_365          2514
Name: count, dtype: int64

https://orhanergun.net/understanding-tcp-psh-packet-flag

### Exploring the TCP Ack Flag distribution

In [70]:
network_traffic_analysis_dataframe["ACK.Flag.Count"].value_counts()

ACK.Flag.Count
1    2144841
0    1432455
Name: count, dtype: int64

In [71]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["ACK.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     1428134
17       2684
0        1637
Name: count, dtype: int64

In [72]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["ACK.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    2144841
Name: count, dtype: int64

### Exploring the TCP URG flag packet distribution

In this blogpost about TCP PSH [https://orhanergun.net/tcp-psh-vs-urg-whats-the-difference], the URG flag in TCP is the Urgent Pointer field is valid in the packet. This URG flag highlights the portion of the data that requires immediate attention to the Receiver.

The Receiver will priortise processing the urgent data first before other data.

Typical use case of TCP PSH flag will be data containing control signals or error messages.

In [73]:
network_traffic_analysis_dataframe["URG.Flag.Count"].value_counts()

URG.Flag.Count
0    2585009
1     992287
Name: count, dtype: int64

In [74]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["URG.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     2580688
17       2684
0        1637
Name: count, dtype: int64

In [75]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["URG.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    992287
Name: count, dtype: int64

### Exploring the CWE Flag distribution

https://kb.clavister.com/317180249/explicit-congestion-notification---ecn-ece-cwe-ns-ect-ce 

https://www.catchpoint.com/blog/ece-cwr-tcp

In [76]:
network_traffic_analysis_dataframe["CWE.Flag.Count"].value_counts()

CWE.Flag.Count
0    3577296
Name: count, dtype: int64

### Exploring on the ECE flat distribution

In [77]:
network_traffic_analysis_dataframe["ECE.Flag.Count"].value_counts()

ECE.Flag.Count
0    3574947
1       2349
Name: count, dtype: int64

The ECN (Explicit Congestion Notification) is a mechanism in TCP/IP to allow Routers to signal if the Routers are almost overloaded.

ECE (Echo of Congestion Encountered) is the mark where the receiver see the packet understanding that the sender informs the receiver that it almost experience traffic congestion.

CWR (Congestion Window Reduced) 

In [78]:
network_traffic_analysis_dataframe.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

### Exploring Down Up Ratio distribution

In [79]:
network_traffic_analysis_dataframe["Down.Up.Ratio"].value_counts()

Down.Up.Ratio
0      1573265
1      1410146
2       305292
3       111856
4        72685
5        61585
6        25359
7         8727
8         3471
11        1618
9         1599
10         797
12         419
16          94
13          93
14          78
15          66
17          28
19          21
20          17
18          13
21          12
26           6
22           6
24           5
23           5
29           4
25           4
35           3
40           2
30           2
31           2
62           1
57           1
27           1
95           1
102          1
38           1
106          1
61           1
43           1
39           1
293          1
194          1
33           1
221          1
36           1
32           1
Name: count, dtype: int64

To conclude the EDA, there are too much variable to reduce the dimension effectively and to determine which condition would likely be classified as Malignant or Benign.

Therefore, the use of deep learning through Stacked Denoising Autoencoder could help to learn the key features in the dataset.

# Data Columns Removal to prepare for deep learning training

### Create a deep copy of the dataframe containing the network traffic data

In [124]:
network_traffic_analysis_dataframe_deep_learning: pandas.DataFrame = network_traffic_analysis_dataframe.copy(deep=True)

### Remove Label

In [125]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="Label", axis=1, inplace=True)

### Remove Timestamp

In [126]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="Timestamp", axis=1, inplace=True)

### Remove date and time

In [127]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="Date", axis=1, inplace=True)

In [128]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="Time", axis=1,inplace=True)

### Remove ProtocolName

The protocol name is the application type that is related to the data record.

In [129]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="ProtocolName", axis=1, inplace=True)

### Remove Flow.ID column

This is because this column Flow.ID is an identifier for each row. There is no meaning in the data therefore it should be removed.

In [130]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="Flow.ID", axis=1, inplace=True)

### Remove Source.IP and Destination.IP

In [131]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="Source.IP", axis=1, inplace=True)

In [132]:
network_traffic_analysis_dataframe_deep_learning.drop(labels="Destination.IP", axis=1, inplace=True)

In [133]:
network_traffic_analysis_dataframe_deep_learning.select_dtypes(["object"])

0
1
2
3
4
...
3577291
3577292
3577293
3577294
3577295


In [134]:
#What are the top 20 Source Ports in this dataset?

network_traffic_analysis_dataframe_deep_learning["Source.Port"].unique()

array([52422,  3128,    80, ...,  6507, 10192, 10182])

# Prepare the Train, Validation and Test dataset

### Create Train Dataset and Test Dataset

The current dataset distribution is at 80% Train_Validation and 20% Test

In [138]:
train_validation_dataset, test_dataset = train_test_split(network_traffic_analysis_dataframe_deep_learning, train_size = 0.8, test_size = 0.2, random_state=1, shuffle=True)

Current dataset distribution after train test split

- Train dataset: 60% (0.8 * 0.75)
- Test dataset: 20% 
- Validation dataset: 20% (0.8 * 0.25) 

In [139]:
train_dataset, validation_dataset = train_test_split(train_validation_dataset, train_size=0.75, test_size=0.25, random_state=1, shuffle=True)

In [140]:
train_dataset_features = train_dataset.drop(labels="L7Protocol", axis=1)
train_dataset_target = train_dataset["L7Protocol"]

In [141]:
validation_dataset_features = validation_dataset.drop(labels="L7Protocol", axis=1)
validation_dataset_target = validation_dataset["L7Protocol"]

In [142]:
test_dataset_features = test_dataset.drop(labels="L7Protocol", axis=1)
test_dataset_target = test_dataset["L7Protocol"]

### Perform min-max normalization on all datasets

In [143]:
def min_max_normalization(input_data: float, minimum_value: float, maximum_value: float) -> float:
    
    # Function output range: [0, 1]

    if (maximum_value - minimum_value) == 0:
        return 0.0

    result_min_max_value = (input_data - minimum_value) / (maximum_value - minimum_value)
    
    return result_min_max_value

In [144]:
def apply_min_max_normalization_in_the_dataframe(dataframe: pandas.DataFrame) -> pandas.DataFrame:

    transformed_min_max_normalization_data: dict = {}

    for data_column in dataframe.columns:

        data_column_minimum_value: float = dataframe[data_column].min()
        data_column_maximum_value: float = dataframe[data_column].max()

        transformed_data_column: pandas.Series = dataframe[data_column].apply(lambda data: min_max_normalization(data, minimum_value=data_column_minimum_value, maximum_value=data_column_maximum_value))

        transformed_min_max_normalization_data[data_column] = transformed_data_column

    
    transformed_dataframe: pandas.DataFrame = pandas.DataFrame(data = transformed_min_max_normalization_data)
    
    if transformed_dataframe.columns.difference(validation_dataset.columns).empty == True:
        return transformed_dataframe
    else:
        raise Exception("There is a mismatch in the new dataframe.")

    return None

    


In [145]:
train_dataset_normalized = apply_min_max_normalization_in_the_dataframe(train_dataset_features)
validation_dataset_normalized = apply_min_max_normalization_in_the_dataframe(validation_dataset_features)
test_dataset_normalized = apply_min_max_normalization_in_the_dataframe(test_dataset_features)

In [146]:
train_dataset_normalized

Unnamed: 0,Source.Port,Destination.Port,Protocol,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,...,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min
2829033,0.807840,0.047731,0.352941,0.000004,0.000009,0.000005,4.424625e-08,1.368621e-08,0.000183,0.000691,...,0.000012,0.416667,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000
1799074,0.047731,0.792886,0.352941,0.054740,0.000015,0.000020,9.797594e-06,1.031028e-06,0.105263,0.000000,...,0.000021,0.416667,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000
2412808,0.609790,0.001221,0.352941,0.003180,0.000011,0.000010,1.165151e-06,2.657405e-07,0.016661,0.000000,...,0.000006,0.666667,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000
1408912,0.755272,0.006760,0.352941,0.004185,0.000011,0.000010,8.259300e-07,1.573914e-07,0.015747,0.000000,...,0.000006,0.666667,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000
1548715,0.556795,0.006760,0.352941,0.078803,0.000026,0.000025,3.218177e-06,6.466733e-07,0.042550,0.000000,...,0.000015,0.666667,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2446753,0.614658,0.006760,0.352941,0.000006,0.000000,0.000005,0.000000e+00,0.000000e+00,0.000000,0.000000,...,0.000000,0.666667,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000
878231,0.630238,0.001221,0.352941,0.000100,0.000004,0.000007,2.477790e-07,0.000000e+00,0.005117,0.000000,...,0.000003,0.666667,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000
106371,0.047731,0.883206,0.352941,0.867055,0.000088,0.000017,2.598730e-06,3.615440e-07,0.001827,0.000691,...,0.000122,0.416667,1.170951e-03,0.002309,3.353101e-03,4.381446e-04,0.215538,0.023618,0.235963,0.200993
3397935,0.047731,0.790811,0.352941,0.344832,0.000042,0.000010,4.309290e-05,1.368621e-08,0.088938,0.000691,...,0.000058,0.416667,1.569379e-07,0.000000,1.569379e-07,1.569379e-07,0.344762,0.000000,0.344762,0.344762


# Determine the label on this dataset using Stacked Denoising Autoencoder

Despite the data exploration, I am still not sure what are the labels I should indicate based on the criteria or conditions in the dataset.

I have found this research paper "Session-Based Network Intrusion Detection Using a Deep Learning Architecture" by Yang Yu, Jun Long and Zhiping Cai. (https://www.researchgate.net/publication/319660558_Session-Based_Network_Intrusion_Detection_Using_a_Deep_Learning_Architecture)

According to this paper, the findings based on the deep learning architecture proposed in the paper. It was able to learn the essential features from raw network packets to determine normal and malicious network traffics and achieve high detection accuracy using the deep learning architecture.

In this section, I will use pytorch with GPU to train and infer the features and labels.


## Import Packages

In [147]:
import torch

torch.cuda.is_available()


True

In [148]:
import numpy as np
import time
from tempfile import TemporaryDirectory
import torch.version
import pandas as pd




## Preparing dataset and data loader for deep learning model training

### Convert datasets into Tensor 

In [149]:
train_dataset_normalized_tensor: torch.Tensor = torch.tensor(train_dataset_normalized.to_numpy(), dtype=torch.float32)
validation_dataset_normalized_tensor: torch.Tensor = torch.tensor(validation_dataset_normalized.to_numpy(), dtype=torch.float32)
test_dataset_normalized_tensor: torch.Tensor = torch.tensor(test_dataset_normalized.to_numpy(), dtype=torch.float32)


#### Convert target dataset into Tensor

In [150]:
train_dataset_target_tensor: torch.Tensor = torch.tensor(train_dataset_target.to_numpy(), dtype=torch.int64)
validation_dataset_target_tensor: torch.Tensor = torch.tensor(validation_dataset_target.to_numpy(), dtype=torch.int64)
test_dataset_target_tensor: torch.Tensor = torch.tensor(test_dataset_target.to_numpy(), dtype=torch.int64)


### Define TensorDataset

Reference

https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset


In [151]:
train_normalized_tensor_dataset: torch.utils.data.TensorDataset = torch.utils.data.TensorDataset(train_dataset_normalized_tensor, train_dataset_target_tensor)
validation_normalized_tensor_dataset: torch.utils.data.TensorDataset = torch.utils.data.TensorDataset(validation_dataset_normalized_tensor, validation_dataset_target_tensor)
test_normalized_tensor_dataset: torch.utils.data.TensorDataset = torch.utils.data.TensorDataset(test_dataset_normalized_tensor, test_dataset_target_tensor)


### Define Dataset size dictionary

In [152]:
dataset_sizes: dict[str, int] = {
    "train": len(train_normalized_tensor_dataset),
    "test": len(test_normalized_tensor_dataset),
    "validation": len(validation_normalized_tensor_dataset)
}

### Define DataLoader

Reference

https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

In [153]:
train_normalized_tensor_dataloader: torch.utils.data.DataLoader = torch.utils.data.DataLoader(dataset=train_normalized_tensor_dataset, batch_size=64, pin_memory=True)
validation_normalized_tensor_dataloader: torch.utils.data.DataLoader = torch.utils.data.DataLoader(dataset=validation_normalized_tensor_dataset, batch_size=64, pin_memory=True)
test_normalized_tensor_dataloader: torch.utils.data.DataLoader = torch.utils.data.DataLoader(dataset=test_normalized_tensor_dataset, batch_size=64, pin_memory=True)

### Define DataLoader dictionary

In [154]:
dataset_loader_dictionary: dict[str, torch.utils.data.DataLoader] = {
    "train": train_normalized_tensor_dataloader,
    "validation": validation_normalized_tensor_dataloader,
    "test": test_normalized_tensor_dataloader
}

## Autoencoder 

Reference for Stacked Denoising Autoencoder: https://towardsdatascience.com/stacked-autoencoders-f0a4391ae282

Reference Research Paper used: Session-Based Network Intrusion Detection Using a Deep Learning Architecture

https://blog.keras.io/building-autoencoders-in-keras.html

Purpose of using Stacked Denoising Autoencoder

The dataset itself is non-linear and it will difficult for PCA and K-means to determine which features are relevant and important.

Therefore the use of Stacked Denoising Autoencoder is implemented on this dataset and this model can be used to determine the target variables for this dataset. (Semi-supervised learning)

### Define the autoencoder

Input size of 1 record = 1 x 81

81 -> 40 -> 20 -> 9 -> 20 -> 40 -> 81

For each record, there will be 1 row of record with 81 columns

In [155]:
class AutoEncoder(torch.nn.Module):

    def __init__(self, x_dimension: int, h_dimension_1: int, h_dimension_2: int, h_dimension_3: int):
        super().__init__()

        # In the documentation for torch.nn.linear, https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear
        # The output data type is TensorFloat32 if there are no dtype specified

        # By specifying dtype=torch.double, the tensor data type will be able to fit into the torch.sigmoid()

        self.encoder = torch.nn.Sequential(
            torch.nn.Linear(x_dimension, h_dimension_1), # 80
            torch.nn.Linear(h_dimension_1, h_dimension_2), # 40
            torch.nn.Linear(h_dimension_2, h_dimension_3), # 20
            torch.nn.Sigmoid()
        )

        self.decoder = torch.nn.Sequential(
            torch.nn.Sigmoid(),
            torch.nn.Linear(h_dimension_3, h_dimension_2), # 20
            torch.nn.Linear(h_dimension_2, h_dimension_1), # 40
            torch.nn.Linear(h_dimension_1, x_dimension) #80
        )

    def forward(self, data):
        data = self.encoder(data)
        data = self.decoder(data)

        return data



In [156]:
class StackedAutoEncoder(torch.nn.Module):

    autoencoder_name_list: list = []
    autoencoder_dictionary: dict = {}

    def __init__(self, first_layer_dimension: int, second_layer_dimension: int, third_layer_dimension: int, latent_space_dimension: int):
        super().__init__()

        self.stacked_autoencoder = torch.nn.Sequential(
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
        )
        
        

    def train_model(self, dataset_loader_dictionary: dict[str, torch.utils.data.DataLoader], model, optimizer, scheduler, number_of_epoch: int = 25 ):

        start_time = time.time()

        with TemporaryDirectory() as tempdir:
            best_model_params_path = os.path.join(tempdir, 'best_model_params.pt')

            torch.save(model.state_dict(), best_model_params_path)

            for epoch in range(number_of_epoch):

                print(f"Current Epoch {epoch} / {number_of_epoch - 1}")
                #print("-" * 10)

                for phase in ['train']:

                    if phase == 'train':
                        model.train()

                    running_loss: float = 0.0

                    for inputs, labels in dataset_loader_dictionary[phase]:

                        inputs = inputs.to(device)

                        optimizer.zero_grad()

                        with torch.set_grad_enabled(phase == 'train'):
                            outputs = model(inputs)

                            if phase == 'train':
                                optimizer.step()
                            
                    if phase == 'train':
                        scheduler.step()
                    
                    if phase == 'training':
                        torch.save(model.state_dict(), best_model_params_path)

                print()

            time_elapsed = time.time() - start_time

            print(f"Training complete in {time_elapsed // 60:.0f} minutes {time_elapsed % 60:.0f} seconds")
            
            print("")

            model.load_state_dict(torch.load(best_model_params_path, weights_only=True))

        return model

    
    def _prepare_optimizer(self, autoencoder_parameters):
        #stackedAutoEncoder.autoencoder_decoder.parameters()
        optimizer: torch.optim.Adam = torch.optim.Adam(autoencoder_parameters, lr=0.005)

        return optimizer
    
    def _prepare_learning_scheduler(self, optimizer):
        # Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html#torch.optim.lr_scheduler.StepLR        
        exp_lr_scheduler: torch.optim.lr_scheduler.StepLR = torch.optim.lr_scheduler.StepLR(optimizer, step_size = 7, gamma = 0.1)

        return exp_lr_scheduler

    def start_training_stacked_autoencoder_model(self, dataset_loader_dictionary: dict[str, torch.utils.data.DataLoader], device_to_process: torch.device,  number_of_epoch: int = 25 ) -> AutoEncoder:

        # Move the autoencoder to GPU
        self.stacked_autoencoder.to(device_to_process)

        # Define the adam optimizer on Encoder
        optimizer = self._prepare_optimizer(self.stacked_autoencoder.parameters())

        # Define the learning scheduler
        scheduler = self._prepare_learning_scheduler(optimizer)            

        trained_stacked_autoencoder_model = self.train_model(dataset_loader_dictionary, self.stacked_autoencoder, optimizer, scheduler, number_of_epoch)


        return trained_stacked_autoencoder_model


    
    

## Replicating the steps to train SDA-based deep learning architecture

The purpose of using the Stacked Denoising Autoencoder is to determine the optimal features to predict the target variable ("L7Protocol").

The L7Protocol is the protocol used on the network.

#### Unsupervised Layer-wise Training

In [121]:
stackedAutoEncoder = StackedAutoEncoder(80, 40 , 20, 10)

##### Train the Stacked AutoEncoder using only Train Dataset

In [122]:
trained_stacked_autoencoder = stackedAutoEncoder.start_training_stacked_autoencoder_model(dataset_loader_dictionary, device)

Current Epoch 0 / 24

Current Epoch 1 / 24

Current Epoch 2 / 24

Current Epoch 3 / 24

Current Epoch 4 / 24

Current Epoch 5 / 24

Current Epoch 6 / 24

Current Epoch 7 / 24

Current Epoch 8 / 24

Current Epoch 9 / 24

Current Epoch 10 / 24

Current Epoch 11 / 24

Current Epoch 12 / 24

Current Epoch 13 / 24

Current Epoch 14 / 24

Current Epoch 15 / 24

Current Epoch 16 / 24

Current Epoch 17 / 24

Current Epoch 18 / 24

Current Epoch 19 / 24

Current Epoch 20 / 24

Current Epoch 21 / 24

Current Epoch 22 / 24

Current Epoch 23 / 24

Current Epoch 24 / 24

Training complete in 33 minutes 41 seconds



##### Save the trained model

In [123]:
torch.save(trained_stacked_autoencoder.state_dict(), STACKED_AUTOENCODER_MODEL_FILE_PATH)

#### Supervised Fine-Tuning

This stage will use the parameters from the previous stage of training in the Stacked Auto Encoder model.

Once the training on the Stacked Auto Encoder is completed using the Validation dataset, the Logistic Regression model will be used to further calibrate 

In [156]:
class SupervisedStageStackedAutoEncoder(torch.nn.Module):

    def __init__(self, first_layer_dimension: int, second_layer_dimension: int, third_layer_dimension: int, latent_space_dimension: int):
        super().__init__()

        self.stacked_autoencoder = torch.nn.Sequential(
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
        )

    
    def retrieve_parameters_from_unsupervised_stage_model(self, file_name: str):
        self.stacked_autoencoder.load_state_dict(torch.load(STACKED_AUTOENCODER_MODEL_FILE_PATH, weights_only=True))
        
        
    def train_supervised_stacked_auto_encoder_model(self, dataset_loader_dictionary: dict[str, torch.utils.data.DataLoader], dataset_sizes: dict[str, int] , model, criterion, optimizer, scheduler, number_of_epoch: int = 25 ):

        start_time = time.time()

        with TemporaryDirectory() as tempdir:
            best_model_params_path = os.path.join(tempdir, 'best_supervised_stacked_autoencoder_model_params.pt')

            torch.save(model.state_dict(), best_model_params_path)

            best_accuracy: float = 0.0
            
            for epoch in range(number_of_epoch):

                print(f"Epoch {epoch} / {number_of_epoch - 1}")
                print("-" * 10)

                for phase in ['validation']:

                    if phase == 'validation':
                        model.train()

                    
                    running_loss: float = 0.0
                    running_corrects: int = 0

                    for inputs, labels in dataset_loader_dictionary[phase]:

                        inputs = inputs.to(device)
                        labels = labels.to(device)

                        optimizer.zero_grad()

                        with torch.set_grad_enabled(phase == 'validation'):
                            outputs = model(inputs)
                            _, predictions = torch.max(outputs, 1)

                            # based on the research paper Section 2.2
                            loss = criterion.forward(input=inputs, target=outputs)

                            if phase == 'validation':
                                loss.backward()
                                optimizer.step()

                        running_loss += loss.item() * inputs.size(0)
                        running_corrects += torch.sum(predictions == labels.data)

                    if phase == 'validation':
                        scheduler.step()

                    
                    epoch_loss = running_loss / dataset_sizes[phase]
                    epoch_accuracy = running_corrects.double() / dataset_sizes[phase]

                    print(f"{phase} Loss: {epoch_loss: .4f} Accuracy: {epoch_accuracy: .4f}")

                    if phase == 'validation' and epoch_accuracy > best_accuracy:
                        best_accuracy = epoch_accuracy
                        torch.save(model.state_dict(), best_model_params_path)
                    
                print()

            time_elapsed = time.time() - start_time

            print(f"Training complete in {time_elapsed // 60:.0f} minutes {time_elapsed % 60:.0f} seconds")
            print(f"Best Validation Accuracy: {best_accuracy:4f}")
            print("")

            model.load_state_dict(torch.load(best_model_params_path, weights_only=True))

        
        return model


    def _prepare_criterion(self):
        criterion: torch.nn.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
        return criterion
    
    def _prepare_optimizer(self, autoencoder_parameters):
        #stackedAutoEncoder.autoencoder_decoder.parameters()
        optimizer: torch.optim.Adam = torch.optim.Adam(autoencoder_parameters, lr=0.005)

        return optimizer
    
    def _prepare_learning_scheduler(self, optimizer):
        # Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html#torch.optim.lr_scheduler.StepLR        
        exp_lr_scheduler: torch.optim.lr_scheduler.StepLR = torch.optim.lr_scheduler.StepLR(optimizer, step_size = 7, gamma = 0.1)

        return exp_lr_scheduler

    def start_training_stacked_autoencoder_model(self, dataset_loader_dictionary: dict[str, torch.utils.data.DataLoader], dataset_sizes: dict[str, int], device_to_process: torch.device,  number_of_epoch: int = 25 ) -> AutoEncoder:

        # Move the autoencoder to GPU
        self.stacked_autoencoder.to(device_to_process)

        # Define Criterion 
        criterion = self._prepare_criterion()

        # Define the adam optimizer on Encoder
        optimizer = self._prepare_optimizer(self.stacked_autoencoder.parameters())

        # Define the learning scheduler
        scheduler = self._prepare_learning_scheduler(optimizer)            

        trained_stacked_autoencoder_model = self.train_supervised_stacked_auto_encoder_model(dataset_loader_dictionary, dataset_sizes, model=self.stacked_autoencoder,criterion=criterion, optimizer=optimizer, scheduler=scheduler, number_of_epoch=number_of_epoch)


        return trained_stacked_autoencoder_model


    
    

##### Define the Supervised Tuning Stacked Auto Encoder model

In [157]:
supervised_stage_stacked_auto_encoder_model = SupervisedStageStackedAutoEncoder(80, 40, 20, 10)

##### Load parameters from previous model in Unsupervised Stage

In [158]:
supervised_stage_stacked_auto_encoder_model.retrieve_parameters_from_unsupervised_stage_model(STACKED_AUTOENCODER_MODEL_FILE_PATH)


##### Train the Stacked Autoencoder at Supervised Stage

In [161]:
trained_supervised_stage_stacked_auto_encoder_model = supervised_stage_stacked_auto_encoder_model.start_training_stacked_autoencoder_model(dataset_loader_dictionary, dataset_sizes,device)

Epoch 0 / 24
----------
validation Loss: -307076396038.4272 Accuracy:  0.0000

Epoch 1 / 24
----------
validation Loss: -3160592499610.3428 Accuracy:  0.0000

Epoch 2 / 24
----------
validation Loss: -11345278693320.3711 Accuracy:  0.0000

Epoch 3 / 24
----------
validation Loss: -27418308341207.9883 Accuracy:  0.0000

Epoch 4 / 24
----------
validation Loss: -53880082127673.7266 Accuracy:  0.0000

Epoch 5 / 24
----------
validation Loss: -93222402602054.1094 Accuracy:  0.0000

Epoch 6 / 24
----------
validation Loss: -147877445304823.1562 Accuracy:  0.0000

Epoch 7 / 24
----------
validation Loss: -184494819426407.5000 Accuracy:  0.0000

Epoch 8 / 24
----------
validation Loss: -191570838932076.0000 Accuracy:  0.0000

Epoch 9 / 24
----------
validation Loss: -198821444409270.1562 Accuracy:  0.0000

Epoch 10 / 24
----------
validation Loss: -206252720786116.0625 Accuracy:  0.0000

Epoch 11 / 24
----------
validation Loss: -213866888550233.8750 Accuracy:  0.0000

Epoch 12 / 24
---------

Save the autoencoder model in this stage

In [162]:
torch.save(trained_supervised_stage_stacked_auto_encoder_model.state_dict(), SUPERVISED_STACKED_AUTOENCODER_MODEL_FILE_PATH)

#### Retrieving the features from the Stacked Auto Encoder using Test Dataset

In [None]:
# Determine the features that matters using autoencoder

In [67]:
class ClassificationTestSetStackedAutoEncoder(torch.nn.Module):

    def __init__(self, first_layer_dimension: int, second_layer_dimension: int, third_layer_dimension: int, latent_space_dimension: int):
        super().__init__()

        self.stacked_autoencoder = torch.nn.Sequential(
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
        )

    
    def retrieve_parameters_from_supervised_stage_model(self, file_name: str):
        self.stacked_autoencoder.load_state_dict(torch.load(SUPERVISED_STACKED_AUTOENCODER_MODEL_FILE_PATH, weights_only=True))
        
    def train_supervised_stacked_auto_encoder_model(self, dataset_loader_dictionary: dict[str, torch.utils.data.DataLoader], dataset_sizes: dict[str, int] , model, criterion, optimizer, scheduler, number_of_epoch: int = 25 ):

        start_time = time.time()

        with TemporaryDirectory() as tempdir:
            best_model_params_path = os.path.join(tempdir, 'best_test_stacked_autoencoder_model_params.pt')

            torch.save(model.state_dict(), best_model_params_path)

            best_accuracy: float = 0.0
            
            for epoch in range(number_of_epoch):

                print(f"Epoch {epoch} / {number_of_epoch - 1}")
                print("-" * 10)

                for phase in ['test']:

                    if phase == 'test':
                        model.eval()

                    
                    running_loss: float = 0.0
                    running_corrects: int = 0

                    for inputs, labels in dataset_loader_dictionary[phase]:

                        inputs = inputs.to(device)
                        labels = labels.to(device)

                        optimizer.zero_grad()

                        with torch.set_grad_enabled(False):
                            outputs = model(inputs)
                            _, predictions = torch.max(outputs, 1)

                            # based on the research paper Section 2.2
                            loss = criterion.forward(input=inputs, target=outputs)

                        running_loss += loss.item() * inputs.size(0)
                        running_corrects += torch.sum(predictions == labels.data)
                    
                    epoch_loss = running_loss / dataset_sizes[phase]
                    epoch_accuracy = running_corrects.double() / dataset_sizes[phase]

                    print(f"{phase} Loss: {epoch_loss: .4f} Accuracy: {epoch_accuracy: .4f}")

                    if phase == 'validation' and epoch_accuracy > best_accuracy:
                        best_accuracy = epoch_accuracy
                        torch.save(model.state_dict(), best_model_params_path)
                    
                print()

            time_elapsed = time.time() - start_time

            print(f"Training complete in {time_elapsed // 60:.0f} minutes {time_elapsed % 60:.0f} seconds")
            print(f"Best Test Accuracy: {best_accuracy:4f}")
            print("")

            model.load_state_dict(torch.load(best_model_params_path, weights_only=True))

        
        return model


    def _prepare_criterion(self):
        criterion: torch.nn.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
        return criterion
    
    def _prepare_optimizer(self, autoencoder_parameters):
        #stackedAutoEncoder.autoencoder_decoder.parameters()
        optimizer: torch.optim.Adam = torch.optim.Adam(autoencoder_parameters, lr=0.005)

        return optimizer
    
    def _prepare_learning_scheduler(self, optimizer):
        # Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html#torch.optim.lr_scheduler.StepLR        
        exp_lr_scheduler: torch.optim.lr_scheduler.StepLR = torch.optim.lr_scheduler.StepLR(optimizer, step_size = 7, gamma = 0.1)

        return exp_lr_scheduler

    def start_training_stacked_autoencoder_model(self, dataset_loader_dictionary: dict[str, torch.utils.data.DataLoader], dataset_sizes: dict[str, int], device_to_process: torch.device,  number_of_epoch: int = 25 ) -> AutoEncoder:

        # Move the autoencoder to GPU
        self.stacked_autoencoder.to(device_to_process)

        # Define Criterion 
        criterion = self._prepare_criterion()

        # Define the adam optimizer on Encoder
        optimizer = self._prepare_optimizer(self.stacked_autoencoder.parameters())

        # Define the learning scheduler
        scheduler = self._prepare_learning_scheduler(optimizer)            

        trained_stacked_autoencoder_model = self.train_supervised_stacked_auto_encoder_model(dataset_loader_dictionary, dataset_sizes, model=self.stacked_autoencoder,criterion=criterion, optimizer=optimizer, scheduler=scheduler, number_of_epoch=number_of_epoch)


        return trained_stacked_autoencoder_model
    
    def retrieve_stacked_autoencoder(self):
        return self.stacked_autoencoder


    
    

In [174]:
classification_test_set_stacked_auto_encoder =  ClassificationTestSetStackedAutoEncoder(80, 40, 20, 10)

In [175]:
classification_test_set_stacked_auto_encoder.retrieve_parameters_from_supervised_stage_model(SUPERVISED_STACKED_AUTOENCODER_MODEL_FILE_PATH)

In [176]:
classification_test_set_stacked_auto_encoder.start_training_stacked_autoencoder_model(dataset_loader_dictionary, dataset_sizes,device)

Epoch 0 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 1 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 2 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 3 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 4 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 5 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 6 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 7 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 8 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 9 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 10 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 11 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 12 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 13 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 14 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Epoch 15 / 24
----------
test Loss: -9.8632 Accuracy:  0.0000

Ep

Sequential(
  (0): AutoEncoder(
    (encoder): Sequential(
      (0): Linear(in_features=80, out_features=40, bias=True)
      (1): Linear(in_features=40, out_features=20, bias=True)
      (2): Linear(in_features=20, out_features=10, bias=True)
      (3): Sigmoid()
    )
    (decoder): Sequential(
      (0): Sigmoid()
      (1): Linear(in_features=10, out_features=20, bias=True)
      (2): Linear(in_features=20, out_features=40, bias=True)
      (3): Linear(in_features=40, out_features=80, bias=True)
    )
  )
  (1): AutoEncoder(
    (encoder): Sequential(
      (0): Linear(in_features=80, out_features=40, bias=True)
      (1): Linear(in_features=40, out_features=20, bias=True)
      (2): Linear(in_features=20, out_features=10, bias=True)
      (3): Sigmoid()
    )
    (decoder): Sequential(
      (0): Sigmoid()
      (1): Linear(in_features=10, out_features=20, bias=True)
      (2): Linear(in_features=20, out_features=40, bias=True)
      (3): Linear(in_features=40, out_features=80, b

In [None]:
trained_model = classification_test_set_stacked_auto_encoder.retrieve_stacked_autoencoder()

Save the model

In [193]:
torch.save(trained_model.state_dict(), CLASSIFICATION_TEST_STACKED_AUTOENCODER_MODEL_FILE_PATH)

Post Processing to retrieve the reduce dimension features

In [157]:
first_layer_dimension: int = 80
second_layer_dimension: int = 40
third_layer_dimension: int = 20
latent_space_dimension: int = 10

stacked_autoencoder = torch.nn.Sequential(
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
            AutoEncoder(first_layer_dimension, second_layer_dimension, third_layer_dimension, latent_space_dimension),
)

In [158]:
stacked_autoencoder.load_state_dict(torch.load(CLASSIFICATION_TEST_STACKED_AUTOENCODER_MODEL_FILE_PATH, weights_only=True))

<All keys matched successfully>

In [161]:
test_dataset_normalized_tensor.to(device=device)

sae_1_1 = stacked_autoencoder[0].encoder(test_dataset_normalized_tensor)
sae_1_2 = stacked_autoencoder[0].decoder(sae_1_1)
sae_2_1 = stacked_autoencoder[1].encoder(sae_1_2)
sae_2_2 = stacked_autoencoder[1].decoder(sae_2_1)
sae_3_1 = stacked_autoencoder[2].encoder(sae_2_2)
sae_3_2 = stacked_autoencoder[2].decoder(sae_3_1)


In [170]:
processed_test_dataset = pandas.DataFrame(sae_3_1.detach().numpy())

# Model Building

* Split the dataset into training and testing sets.
* Implement at least three different classification models (e.g., Decision Tree, Random Forest, SVM, etc.).
* Train and fine-tune each model using appropriate techniques.
* Discuss the choice of hyperparameters and the reasoning behind it

## Comparison of the dimension reduction technique

## Classification Model Training and Evaluation

### Decision Tree

In [83]:
from sklearn import tree

decision_tree_classifier = tree.DecisionTreeClassifier()

decision_tree_classifier.fit(X_train, Y_train)


In [84]:
y_predict_from_x_test = decision_tree_classifier.predict(X_test)

In [85]:
decision_tree_classifier.score(X_test, Y_test)

0.7235498839907193

In [86]:
from sklearn.metrics import classification_report

print(classification_report(Y_test, y_predict_from_x_test))

              precision    recall  f1-score   support

           1       0.86      1.00      0.92         6
           5       0.73      0.65      0.68       353
           7       0.82      0.82      0.82    136949
           9       1.00      1.00      1.00        28
          11       0.00      0.00      0.00         0
          13       0.00      0.00      0.00         3
          36       0.35      0.38      0.36        16
          37       0.60      1.00      0.75         3
          40       0.89      0.90      0.90      1719
          48       0.00      0.00      0.00         1
          51       1.00      0.33      0.50         3
          60       0.40      0.37      0.38       100
          64       0.23      0.28      0.25       170
          67       0.57      1.00      0.73         8
          68       0.46      0.47      0.46      2915
          69       0.00      0.00      0.00         3
          70       0.33      0.36      0.35      4162
          81       1.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#TODO: Produce two types of model evaluation using test dataset and validation dataset

### SVM

### Random Forest

Bagged Tree

In [None]:
# Classification Report

In [None]:
# ROC 

In [None]:
# AUC

Logistic Regression

# Model Evaluation

* Evaluate the models using appropriate classification metrics (accuracy, precision, recall, F1-score, etc.).
* Visualize the model performance using ROC curves and confusion matrices.
* Compare the models and justify your choice of the best-performing model.

# Conclusion

# Reference