# Network Traffic Classification using Machine Learning Techniques

# Overview

Develop classification models using Python programming to analyze a network-related dataset. 

The primary goal is to explore the dataset, preprocess it, create and evaluate different classification models, and report your findings. 

This assignment will enhance your understanding of machine learning techniques, data preprocessing, and model evaluation while applying them to a practical problem related to network security.

# Dataset

This is a real-world dataset created by collecting network data from Universidad Del Cauca, Popayn, Colombia over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017 using multiple packet capturing tools and data extracting tools. 

This dataset is consisting of 3,577,296 instances and 87 features and originally designed for application classification. Each row represents a traffic flow from a source to a destination and each column represents features of the traffic data.

This dataset is downloaded from Kaggle "IP Network Traffic Flows, Labeled with 75 Apps."

# Reference data


https://www.rfc-editor.org/

# Purpose

This test attempts to test on the dataset to determine which application does the configuration belongs to based on the prediction.


# Literature Review

The current dataset from Kaggle is presented similar to a Session data.

According to the book "The Tao of Network Security Monitoring" by Richard Bejtlich in Chapter 7. Session Data, the chapter describes Session Data as a summary of conversation between two parties.

The basic elements of Session Data consists of:
- Source IP
- Source Port
- Destination IP
- Destination Port
- Timestamp
- Measure of the amount of information exchanged during the session

The reason why Session Data is used in the analysis is because Session Data ability to track down intruder activities in content-neutral way. 



# Assumption Made

# Origin of CICFlowMeter

https://www.unb.ca/cic/research/applications.html#CICFlowMeter

https://www.kaggle.com/datasets/jsrojas/ip-network-traffic-flows-labeled-with-87-apps 

https://www.ntop.org/products/traffic-analysis/ntop/

# Environment Setup

In [1]:
%pip install --upgrade pip
%pip install setuptools -U
%pip install pandas -U
%pip install -U scikit-learn
%pip install kagglehub -U

%pip install matplotlib -U
%pip install seaborn -U
%pip install tensorflow -U
%pip install --upgrade keras
%pip install polars


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install ipywidgets -U

Note: you may need to restart the kernel to use updated packages.


In [17]:
# setup automl

In [20]:
import kagglehub

import numpy
import matplotlib
import seaborn
import tensorflow
import keras

import os
import shutil

import polars

# Data Retrieval

Retrieving data using kagglehub package to simplify the data retrieval process

In [4]:
path = kagglehub.dataset_download("jsrojas/ip-network-traffic-flows-labeled-with-87-apps")

In [5]:
print(path)

C:\Users\angko\.cache\kagglehub\datasets\jsrojas\ip-network-traffic-flows-labeled-with-87-apps\versions\1


### List all files under the folder kagglehub has downloaded to

In [14]:
csv_file = os.listdir(path)

In [15]:
csv_file

['Dataset-Unicauca-Version2-87Atts.csv']

### Move the file to the current project folder

In [7]:
# https://www.freecodecamp.org/news/python-get-current-directory/

current_project_folder: str = os.getcwd()

In [17]:
source_file_path: str = os.path.join(path, csv_file[0])
source_file_path

'C:\\Users\\angko\\.cache\\kagglehub\\datasets\\jsrojas\\ip-network-traffic-flows-labeled-with-87-apps\\versions\\1\\Dataset-Unicauca-Version2-87Atts.csv'

In [18]:
shutil.move(source_file_path, current_project_folder)

'c:\\Users\\angko\\Desktop\\Network-Analysis-Project\\Dataset-Unicauca-Version2-87Atts.csv'

# Data Loading into DataFrame

* Load and explore the dataset.

Attempting to load the data into pandas dataframe for the data exploration

In [22]:
network_traffic_analysis_dataframe: polars.DataFrame = polars.read_csv("Dataset-Unicauca-Version2-87Atts.csv")

ComputeError: could not parse `325090.5` as dtype `i64` at column 'Active.Mean' (column number 77)

The current offset in the file is 1746894 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `schema_overrides` argument
- setting `ignore_errors` to `True`,
- adding `325090.5` to the `null_values` list.

Original error: ```remaining bytes non-empty```

In [22]:
network_traffic_analysis_dataframe

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,...,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577291,10.200.7.199-98.138.79.73-42135-443-6,98.138.79.73,443,10.200.7.199,42135,6,15/05/201705:43:40,2290821,5,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577292,10.200.7.217-98.138.79.73-51546-443-6,98.138.79.73,443,10.200.7.217,51546,6,15/05/201705:46:10,24,5,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577293,10.200.7.218-98.138.79.73-44366-443-6,98.138.79.73,443,10.200.7.218,44366,6,15/05/201705:45:39,2591653,6,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577294,10.200.7.195-98.138.79.73-52341-443-6,98.138.79.73,443,10.200.7.195,52341,6,15/05/201705:45:59,2622421,4,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL


In [23]:
network_traffic_analysis_dataframe.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

In [24]:
network_traffic_analysis_dataframe["Label"].unique()

array(['BENIGN'], dtype=object)

# Basic Data Exploration


* Handle missing data and outliers.
* Perform data visualization to gain insights into the dataset.

### Feature and its description

- Flow duration		
    - Duration of the flow in Microsecond
- total Fwd Packet		
    - Total packets in the forward direction
- total Bwd packets		
    - Total packets in the backward direction
- total Length of Fwd Packet	
    - Total size of packet in forward direction
- total Length of Bwd Packet	
    - Total size of packet in backward direction
- Fwd Packet Length Min 		
    - Minimum size of packet in forward direction
- Fwd Packet Length Max 		
    - Maximum size of packet in forward direction
- Fwd Packet Length Mean		
    - Mean size of packet in forward direction
- Fwd Packet Length Std		
    - Standard deviation size of packet in forward direction
- Bwd Packet Length Min		
    - Minimum size of packet in backward direction
- Bwd Packet Length Max		
    - Maximum size of packet in backward direction
- Bwd Packet Length Mean		
    - Mean size of packet in backward direction
- Bwd Packet Length Std		
    - Standard deviation size of packet in backward direction
- Flow Byte/s			
    - Number of flow packets per second
- Flow Packets/s			
    - Number of flow bytes per second 
- Flow IAT Mean			
    - Mean time between two packets sent in the flow
- Flow IAT Std			
    - Standard deviation time between two packets sent in the flow
- Flow IAT Max			
    - Maximum time between two packets sent in the flow
- Flow IAT Min			
    - Minimum time between two packets sent in the flow
- Fwd IAT Min			
    - Minimum time between two packets sent in the forward direction
- Fwd IAT Max			
    - Maximum time between two packets sent in the forward direction
- Fwd IAT Mean			
    - Mean time between two packets sent in the forward direction
- Fwd IAT Std			
    - Standard deviation time between two packets sent in the forward direction
- Fwd IAT Total   		
    - Total time between two packets sent in the forward direction
- Bwd IAT Min			
    - Minimum time between two packets sent in the backward direction
- Bwd IAT Max			
    - Maximum time between two packets sent in the backward direction
- Bwd IAT Mean			
    - Mean time between two packets sent in the backward direction
- Bwd IAT Std			
    - Standard deviation time between two packets sent in the backward direction
- Bwd IAT Total			
    - Total time between two packets sent in the backward direction
- Fwd PSH flag			
    - Number of times the PSH flag was set in packets travelling in the forward direction (0 for UDP)
- Bwd PSH Flag			
    - Number of times the PSH flag was set in packets travelling in the backward direction (0 for UDP)
- Fwd URG Flag			
    - Number of times the URG flag was set in packets travelling in the forward direction (0 for UDP)
- Bwd URG Flag			
    - Number of times the URG flag was set in packets travelling in the backward direction (0 for UDP)
- Fwd Header Length		
    - Total bytes used for headers in the forward direction
- Bwd Header Length		
    - Total bytes used for headers in the backward direction
- FWD Packets/s			
    - Number of forward packets per second
- Bwd Packets/s			
    - Number of backward packets per second
- Min Packet Length 		
    - Minimum length of a packet
- Max Packet Length 		
    - Maximum length of a packet
- Packet Length Mean 		
    - Mean length of a packet
- Packet Length Std		
    - Standard deviation length of a packet
- Packet Length Variance  	
    - Variance length of a packet
- FIN Flag Count 			
    - Number of packets with FIN
- SYN Flag Count 			
    - Number of packets with SYN
- RST Flag Count 			
    - Number of packets with RST
- PSH Flag Count 			
    - Number of packets with PUSH
- ACK Flag Count 			
    - Number of packets with ACK
- URG Flag Count 			
    - Number of packets with URG
- CWR Flag Count 			
    - Number of packets with CWE
- ECE Flag Count 			
    - Number of packets with ECE
- down/Up Ratio			
    - Download and upload ratio
- Average Packet Size 		
    - Average size of packet
- Avg Fwd Segment Size 		
    - Average size observed in the forward direction
- AVG Bwd Segment Size 		
    - Average number of bytes bulk rate in the backward direction
- Fwd Header Length		
    - Length of the forward packet header
- Fwd Avg Bytes/Bulk		
    - Average number of bytes bulk rate in the forward direction
- Fwd AVG Packet/Bulk 		
    - Average number of packets bulk rate in the forward direction
- Fwd AVG Bulk Rate 		
    - Average number of bulk rate in the forward direction
- Bwd Avg Bytes/Bulk		
    - Average number of bytes bulk rate in the backward direction
- Bwd AVG Packet/Bulk 		
    - Average number of packets bulk rate in the backward direction
- Bwd AVG Bulk Rate 		
    - Average number of bulk rate in the backward direction
- Subflow Fwd Packets		
    - The average number of packets in a sub flow in the forward direction
- Subflow Fwd Bytes		
    - The average number of bytes in a sub flow in the forward direction
- Subflow Bwd Packets		
    - The average number of packets in a sub flow in the backward direction
- Subflow Bwd Bytes		
    - The average number of bytes in a sub flow in the backward direction
- Init_Win_bytes_forward		
    - The total number of bytes sent in initial window in the forward direction
- Init_Win_bytes_backward		
    - The total number of bytes sent in initial window in the backward direction
- Act_data_pkt_forward		
    - Count of packets with at least 1 byte of TCP data payload in the forward direction
- min_seg_size_forward		
    - Minimum segment size observed in the forward direction
- Active Min			
    - Minimum time a flow was active before becoming idle
- Active Mean			
    - Mean time a flow was active before becoming idle
- Active Max			
    - Maximum time a flow was active before becoming idle
- Active Std			
    - Standard deviation time a flow was active before becoming idle
- Idle Min			
    - Minimum time a flow was idle before becoming active
- Idle Mean			
    - Mean time a flow was idle before becoming active
- Idle Max			
    - Maximum time a flow was idle before becoming active
- Idle Std			
    - Standard deviation time a flow was idle before becoming active

### Data Type for all columns in the dataset

In [25]:
print(network_traffic_analysis_dataframe.dtypes[:20])

Flow.ID                         object
Source.IP                       object
Source.Port                      int64
Destination.IP                  object
Destination.Port                 int64
Protocol                         int64
Timestamp                       object
Flow.Duration                    int64
Total.Fwd.Packets                int64
Total.Backward.Packets           int64
Total.Length.of.Fwd.Packets      int64
Total.Length.of.Bwd.Packets    float64
Fwd.Packet.Length.Max            int64
Fwd.Packet.Length.Min            int64
Fwd.Packet.Length.Mean         float64
Fwd.Packet.Length.Std          float64
Bwd.Packet.Length.Max            int64
Bwd.Packet.Length.Min            int64
Bwd.Packet.Length.Mean         float64
Bwd.Packet.Length.Std          float64
dtype: object


In [26]:
print(network_traffic_analysis_dataframe.dtypes[21:40])

Flow.Packets.s    float64
Flow.IAT.Mean     float64
Flow.IAT.Std      float64
Flow.IAT.Max      float64
Flow.IAT.Min        int64
Fwd.IAT.Total     float64
Fwd.IAT.Mean      float64
Fwd.IAT.Std       float64
Fwd.IAT.Max       float64
Fwd.IAT.Min       float64
Bwd.IAT.Total     float64
Bwd.IAT.Mean      float64
Bwd.IAT.Std       float64
Bwd.IAT.Max       float64
Bwd.IAT.Min       float64
Fwd.PSH.Flags       int64
Bwd.PSH.Flags       int64
Fwd.URG.Flags       int64
Bwd.URG.Flags       int64
dtype: object


In [27]:
print(network_traffic_analysis_dataframe.dtypes[41:60])

Bwd.Header.Length           int64
Fwd.Packets.s             float64
Bwd.Packets.s             float64
Min.Packet.Length           int64
Max.Packet.Length           int64
Packet.Length.Mean        float64
Packet.Length.Std         float64
Packet.Length.Variance    float64
FIN.Flag.Count              int64
SYN.Flag.Count              int64
RST.Flag.Count              int64
PSH.Flag.Count              int64
ACK.Flag.Count              int64
URG.Flag.Count              int64
CWE.Flag.Count              int64
ECE.Flag.Count              int64
Down.Up.Ratio               int64
Average.Packet.Size       float64
Avg.Fwd.Segment.Size      float64
dtype: object


In [28]:
print(network_traffic_analysis_dataframe.dtypes[71:])

Subflow.Bwd.Bytes            int64
Init_Win_bytes_forward       int64
Init_Win_bytes_backward      int64
act_data_pkt_fwd             int64
min_seg_size_forward         int64
Active.Mean                float64
Active.Std                 float64
Active.Max                 float64
Active.Min                 float64
Idle.Mean                  float64
Idle.Std                   float64
Idle.Max                   float64
Idle.Min                   float64
Label                       object
L7Protocol                   int64
ProtocolName                object
dtype: object


### Dataset Information

In [29]:
network_traffic_analysis_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3577296 entries, 0 to 3577295
Data columns (total 87 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   Flow.ID                      object 
 1   Source.IP                    object 
 2   Source.Port                  int64  
 3   Destination.IP               object 
 4   Destination.Port             int64  
 5   Protocol                     int64  
 6   Timestamp                    object 
 7   Flow.Duration                int64  
 8   Total.Fwd.Packets            int64  
 9   Total.Backward.Packets       int64  
 10  Total.Length.of.Fwd.Packets  int64  
 11  Total.Length.of.Bwd.Packets  float64
 12  Fwd.Packet.Length.Max        int64  
 13  Fwd.Packet.Length.Min        int64  
 14  Fwd.Packet.Length.Mean       float64
 15  Fwd.Packet.Length.Std        float64
 16  Bwd.Packet.Length.Max        int64  
 17  Bwd.Packet.Length.Min        int64  
 18  Bwd.Packet.Length.Mean       float64
 19  

### Describe the dataset

In [30]:
network_traffic_analysis_dataframe.iloc[:,:10].describe()

Unnamed: 0,Source.Port,Destination.Port,Protocol,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,37999.38,12042.46,6.005508,25442470.0,62.37799,65.34083
std,22017.13,20449.16,0.3274574,40144300.0,1094.086,1108.092
min,0.0,0.0,0.0,1.0,1.0,0.0
25%,3697.0,443.0,6.0,628.0,2.0,1.0
50%,49377.0,3128.0,6.0,584729.5,6.0,5.0
75%,53799.0,3128.0,6.0,45001530.0,15.0,15.0
max,65534.0,65534.0,17.0,120000000.0,453190.0,542196.0


In [31]:
network_traffic_analysis_dataframe.iloc[:,11:20].describe()

Unnamed: 0,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,Fwd.Packet.Length.Mean,Fwd.Packet.Length.Std,Bwd.Packet.Length.Max,Bwd.Packet.Length.Min,Bwd.Packet.Length.Mean,Bwd.Packet.Length.Std
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,84457.42,512.3645,9.340408,114.9212,152.0501,1103.231,11.13491,254.7845,289.8878
std,2124319.0,1039.319,82.99983,246.4707,240.4702,2352.374,105.5422,506.0731,485.3004
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,6.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0
50%,208.0,206.0,0.0,46.57143,74.21124,81.0,0.0,30.14286,32.42474
75%,3629.0,613.0,6.0,122.5,207.9035,1366.0,0.0,256.75,423.2105
max,1345796000.0,32832.0,16060.0,16060.0,6225.487,37648.0,13032.0,13032.0,8434.804


In [32]:
network_traffic_analysis_dataframe.iloc[:,21:30].describe()

Unnamed: 0,Flow.Packets.s,Flow.IAT.Mean,Flow.IAT.Std,Flow.IAT.Max,Flow.IAT.Min,Fwd.IAT.Total,Fwd.IAT.Mean,Fwd.IAT.Std,Fwd.IAT.Max
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,88963.38,1422201.0,3365395.0,12850200.0,88702.01,24187960.0,3124467.0,3649620.0,12096240.0
std,402762.0,3550414.0,6260959.0,20765180.0,1605272.0,39625630.0,8358652.0,7390979.0,20491800.0
min,0.01666667,0.2,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.128096,415.0,8.485281,570.0,0.0,7.0,5.0,0.0,6.0
50%,33.93752,33202.38,68364.44,281239.5,1.0,389264.5,37006.79,47175.96,207629.0
75%,4214.963,936657.6,3980748.0,23915460.0,33.0,40011610.0,1549711.0,2932647.0,19269760.0
max,6000000.0,120000000.0,84852730.0,120000000.0,120000000.0,120000000.0,120000000.0,84852560.0,120000000.0


In [33]:
network_traffic_analysis_dataframe.iloc[:,31:40].describe()

Unnamed: 0,Bwd.IAT.Total,Bwd.IAT.Mean,Bwd.IAT.Std,Bwd.IAT.Max,Bwd.IAT.Min,Fwd.PSH.Flags,Bwd.PSH.Flags,Fwd.URG.Flags,Bwd.URG.Flags
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,21104510.0,2476877.0,2932460.0,9830803.0,888999.1,0.1720414,0.0,0.0,0.0
std,38626340.0,7578111.0,6666650.0,18835210.0,6231082.0,0.3774165,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,181562.5,15587.65,26175.95,95181.5,0.0,0.0,0.0,0.0,0.0
75%,14976530.0,334214.2,752634.2,7508778.0,1.0,0.0,0.0,0.0,0.0
max,120000000.0,119999900.0,84852750.0,119999900.0,119999900.0,1.0,0.0,0.0,0.0


In [34]:
network_traffic_analysis_dataframe.iloc[:,41:50].describe()

Unnamed: 0,Bwd.Header.Length,Fwd.Packets.s,Bwd.Packets.s,Min.Packet.Length,Max.Packet.Length,Packet.Length.Mean,Packet.Length.Std,Packet.Length.Variance,FIN.Flag.Count
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,1743.621,77058.16,11905.22,3.043745,1333.25,198.8191,303.519,279273.6,0.007037159
std,30391.9,368315.3,108020.6,41.45472,2453.395,332.7427,432.6083,725860.8,0.0835921
min,0.0,0.008333337,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,32.0,0.5417242,0.1009873,0.0,6.0,6.0,0.0,0.0,0.0
50%,136.0,15.63422,2.951696,0.0,355.0,62.83333,106.9828,11445.31,0.0
75%,420.0,2164.502,83.44459,6.0,1460.0,250.0,481.8125,232143.2,0.0
max,12844400.0,6000000.0,5000000.0,7063.0,37648.0,10708.67,9268.781,85910310.0,1.0


In [35]:
network_traffic_analysis_dataframe.iloc[:,51:60].describe()

Unnamed: 0,RST.Flag.Count,PSH.Flag.Count,ACK.Flag.Count,URG.Flag.Count,CWE.Flag.Count,ECE.Flag.Count,Down.Up.Ratio,Average.Packet.Size,Avg.Fwd.Segment.Size
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,0.0006655865,0.405821,0.5995705,0.2773847,0.0,0.0006566412,0.9085471,207.563,114.9212
std,0.02579038,0.4910503,0.4899855,0.447708,0.0,0.0256166,1.269945,343.227,246.4707
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,6.0
50%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,66.5,46.57143
75%,0.0,1.0,1.0,1.0,0.0,0.0,1.0,263.7184,122.5
max,1.0,1.0,1.0,1.0,0.0,1.0,293.0,16063.0,16060.0


In [36]:
network_traffic_analysis_dataframe.iloc[:,61:70].describe()

Unnamed: 0,Fwd.Header.Length.1,Fwd.Avg.Bytes.Bulk,Fwd.Avg.Packets.Bulk,Fwd.Avg.Bulk.Rate,Bwd.Avg.Bytes.Bulk,Bwd.Avg.Packets.Bulk,Bwd.Avg.Bulk.Rate,Subflow.Fwd.Packets,Subflow.Fwd.Bytes
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,1653.339,0.0,0.0,0.0,0.0,0.0,0.0,62.37799,46833.23
std,30088.9,0.0,0.0,0.0,0.0,0.0,0.0,1094.086,1816196.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,40.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,12.0
50%,152.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,443.0
75%,392.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,1769.0
max,15439500.0,0.0,0.0,0.0,0.0,0.0,0.0,453190.0,678023600.0


In [37]:
network_traffic_analysis_dataframe.iloc[:,71:80].describe()

Unnamed: 0,Subflow.Bwd.Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,84457.42,8984.691,2123.489,45.03535,25.69738,298199.0,183640.6,522937.2,167633.6
std,2124319.0,14101.26,7704.789,974.8192,6.025989,2349390.0,1325838.0,3266508.0,2064219.0
min,0.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
25%,0.0,411.0,18.0,0.0,20.0,0.0,0.0,0.0,0.0
50%,208.0,5840.0,262.0,2.0,20.0,0.0,0.0,0.0,0.0
75%,3629.0,14600.0,660.0,9.0,32.0,45.0,0.0,57.0,2.0
max,1345796000.0,65535.0,65535.0,328694.0,523.0,114695000.0,72971360.0,114695000.0,114695000.0


In [38]:
network_traffic_analysis_dataframe.iloc[:,81:].describe()

Unnamed: 0,Idle.Std,Idle.Max,Idle.Min,L7Protocol
count,3577296.0,3577296.0,3577296.0,3577296.0
mean,1370991.0,9743845.0,7252097.0,102.9508
std,4814474.0,18885570.0,16007540.0,51.29198
min,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,91.0
50%,0.0,0.0,0.0,126.0
75%,0.0,8034389.0,5369712.0,130.0
max,77387460.0,120000000.0,120000000.0,222.0


# Data Preparation

### Detect for any null values in the dataset

In [39]:
network_traffic_analysis_dataframe.isnull().sum()[:20]

Flow.ID                        0
Source.IP                      0
Source.Port                    0
Destination.IP                 0
Destination.Port               0
Protocol                       0
Timestamp                      0
Flow.Duration                  0
Total.Fwd.Packets              0
Total.Backward.Packets         0
Total.Length.of.Fwd.Packets    0
Total.Length.of.Bwd.Packets    0
Fwd.Packet.Length.Max          0
Fwd.Packet.Length.Min          0
Fwd.Packet.Length.Mean         0
Fwd.Packet.Length.Std          0
Bwd.Packet.Length.Max          0
Bwd.Packet.Length.Min          0
Bwd.Packet.Length.Mean         0
Bwd.Packet.Length.Std          0
dtype: int64

In [40]:
network_traffic_analysis_dataframe.isnull().sum()[21:40]

Flow.Packets.s    0
Flow.IAT.Mean     0
Flow.IAT.Std      0
Flow.IAT.Max      0
Flow.IAT.Min      0
Fwd.IAT.Total     0
Fwd.IAT.Mean      0
Fwd.IAT.Std       0
Fwd.IAT.Max       0
Fwd.IAT.Min       0
Bwd.IAT.Total     0
Bwd.IAT.Mean      0
Bwd.IAT.Std       0
Bwd.IAT.Max       0
Bwd.IAT.Min       0
Fwd.PSH.Flags     0
Bwd.PSH.Flags     0
Fwd.URG.Flags     0
Bwd.URG.Flags     0
dtype: int64

In [41]:
network_traffic_analysis_dataframe.isnull().sum()[41:60]

Bwd.Header.Length         0
Fwd.Packets.s             0
Bwd.Packets.s             0
Min.Packet.Length         0
Max.Packet.Length         0
Packet.Length.Mean        0
Packet.Length.Std         0
Packet.Length.Variance    0
FIN.Flag.Count            0
SYN.Flag.Count            0
RST.Flag.Count            0
PSH.Flag.Count            0
ACK.Flag.Count            0
URG.Flag.Count            0
CWE.Flag.Count            0
ECE.Flag.Count            0
Down.Up.Ratio             0
Average.Packet.Size       0
Avg.Fwd.Segment.Size      0
dtype: int64

In [42]:
network_traffic_analysis_dataframe.isnull().sum()[61:80]

Fwd.Header.Length.1        0
Fwd.Avg.Bytes.Bulk         0
Fwd.Avg.Packets.Bulk       0
Fwd.Avg.Bulk.Rate          0
Bwd.Avg.Bytes.Bulk         0
Bwd.Avg.Packets.Bulk       0
Bwd.Avg.Bulk.Rate          0
Subflow.Fwd.Packets        0
Subflow.Fwd.Bytes          0
Subflow.Bwd.Packets        0
Subflow.Bwd.Bytes          0
Init_Win_bytes_forward     0
Init_Win_bytes_backward    0
act_data_pkt_fwd           0
min_seg_size_forward       0
Active.Mean                0
Active.Std                 0
Active.Max                 0
Active.Min                 0
dtype: int64

In [43]:
network_traffic_analysis_dataframe.isnull().sum()[81:]

Idle.Std        0
Idle.Max        0
Idle.Min        0
Label           0
L7Protocol      0
ProtocolName    0
dtype: int64

### Detect for any na values in the dataset

In [44]:
network_traffic_analysis_dataframe.isna().sum()[:20]

Flow.ID                        0
Source.IP                      0
Source.Port                    0
Destination.IP                 0
Destination.Port               0
Protocol                       0
Timestamp                      0
Flow.Duration                  0
Total.Fwd.Packets              0
Total.Backward.Packets         0
Total.Length.of.Fwd.Packets    0
Total.Length.of.Bwd.Packets    0
Fwd.Packet.Length.Max          0
Fwd.Packet.Length.Min          0
Fwd.Packet.Length.Mean         0
Fwd.Packet.Length.Std          0
Bwd.Packet.Length.Max          0
Bwd.Packet.Length.Min          0
Bwd.Packet.Length.Mean         0
Bwd.Packet.Length.Std          0
dtype: int64

In [45]:
network_traffic_analysis_dataframe.isna().sum()[21:40]

Flow.Packets.s    0
Flow.IAT.Mean     0
Flow.IAT.Std      0
Flow.IAT.Max      0
Flow.IAT.Min      0
Fwd.IAT.Total     0
Fwd.IAT.Mean      0
Fwd.IAT.Std       0
Fwd.IAT.Max       0
Fwd.IAT.Min       0
Bwd.IAT.Total     0
Bwd.IAT.Mean      0
Bwd.IAT.Std       0
Bwd.IAT.Max       0
Bwd.IAT.Min       0
Fwd.PSH.Flags     0
Bwd.PSH.Flags     0
Fwd.URG.Flags     0
Bwd.URG.Flags     0
dtype: int64

In [46]:
network_traffic_analysis_dataframe.isna().sum()[41:60]

Bwd.Header.Length         0
Fwd.Packets.s             0
Bwd.Packets.s             0
Min.Packet.Length         0
Max.Packet.Length         0
Packet.Length.Mean        0
Packet.Length.Std         0
Packet.Length.Variance    0
FIN.Flag.Count            0
SYN.Flag.Count            0
RST.Flag.Count            0
PSH.Flag.Count            0
ACK.Flag.Count            0
URG.Flag.Count            0
CWE.Flag.Count            0
ECE.Flag.Count            0
Down.Up.Ratio             0
Average.Packet.Size       0
Avg.Fwd.Segment.Size      0
dtype: int64

In [47]:
network_traffic_analysis_dataframe.isna().sum()[61:80]

Fwd.Header.Length.1        0
Fwd.Avg.Bytes.Bulk         0
Fwd.Avg.Packets.Bulk       0
Fwd.Avg.Bulk.Rate          0
Bwd.Avg.Bytes.Bulk         0
Bwd.Avg.Packets.Bulk       0
Bwd.Avg.Bulk.Rate          0
Subflow.Fwd.Packets        0
Subflow.Fwd.Bytes          0
Subflow.Bwd.Packets        0
Subflow.Bwd.Bytes          0
Init_Win_bytes_forward     0
Init_Win_bytes_backward    0
act_data_pkt_fwd           0
min_seg_size_forward       0
Active.Mean                0
Active.Std                 0
Active.Max                 0
Active.Min                 0
dtype: int64

In [48]:
network_traffic_analysis_dataframe.isna().sum()[81:]

Idle.Std        0
Idle.Max        0
Idle.Min        0
Label           0
L7Protocol      0
ProtocolName    0
dtype: int64

### Determine the classes in data columns that are object data type

In [49]:
network_traffic_analysis_dataframe.select_dtypes("object").columns

Index(['Flow.ID', 'Source.IP', 'Destination.IP', 'Timestamp', 'Label',
       'ProtocolName'],
      dtype='object')

### Determine the classes in Label column

In [50]:
network_traffic_analysis_dataframe["Label"].unique()

array(['BENIGN'], dtype=object)

### Mapping of L7Protocol and ProtocolName

In [51]:
# Create a dictionary based on

protocol_name_list: list[str] = network_traffic_analysis_dataframe["ProtocolName"].unique()

number_of_data_to_iterate: int = len(network_traffic_analysis_dataframe["ProtocolName"].unique())

protocol_index_to_name_mapping: dict[int, str] = {}
protocol_name_to_index_mapping: dict[str, int] = {}

for index in range(number_of_data_to_iterate):
    data = network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["ProtocolName"] == protocol_name_list[index]].head(1)[["L7Protocol", "ProtocolName"]]

    protocol_index_to_name_mapping[data["L7Protocol"].values[0]] = data["ProtocolName"].values[0]
    protocol_name_to_index_mapping[data["ProtocolName"].values[0]] = data["L7Protocol"].values[0]




In [52]:
protocol_index_to_name_mapping

{np.int64(131): 'HTTP_PROXY',
 np.int64(7): 'HTTP',
 np.int64(130): 'HTTP_CONNECT',
 np.int64(91): 'SSL',
 np.int64(126): 'GOOGLE',
 np.int64(124): 'YOUTUBE',
 np.int64(119): 'FACEBOOK',
 np.int64(40): 'CONTENT_FLASH',
 np.int64(121): 'DROPBOX',
 np.int64(147): 'WINDOWS_UPDATE',
 np.int64(178): 'AMAZON',
 np.int64(212): 'MICROSOFT',
 np.int64(163): 'TOR',
 np.int64(122): 'GMAIL',
 np.int64(70): 'YAHOO',
 np.int64(68): 'MSN',
 np.int64(64): 'SSL_NO_CERT',
 np.int64(125): 'SKYPE',
 np.int64(221): 'MS_ONE_DRIVE',
 np.int64(114): 'MSSQL',
 np.int64(120): 'TWITTER',
 np.int64(143): 'APPLE_ICLOUD',
 np.int64(220): 'CLOUDFLARE',
 np.int64(169): 'UBUNTUONE',
 np.int64(219): 'OFFICE_365',
 np.int64(176): 'WIKIPEDIA',
 np.int64(201): 'OPENSIGNAL',
 np.int64(5): 'DNS',
 np.int64(60): 'HTTP_DOWNLOAD',
 np.int64(142): 'WHATSAPP',
 np.int64(145): 'APPLE_ITUNES',
 np.int64(175): 'FTP_DATA',
 np.int64(132): 'CITRIX',
 np.int64(140): 'APPLE',
 np.int64(222): 'MQTT',
 np.int64(211): 'INSTAGRAM',
 np.int

### Mapping of Protocol and OSI Model

In [53]:
# Create a mapping of the number in Protocol column and the OSI model
network_traffic_analysis_dataframe["Protocol"].unique()


array([ 6, 17,  0])

https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/src/main/java/cic/cs/unb/ca/jnetpcap/PacketReader.java

From Line 401 to 438

The code implementation shows, if the protocol is TCP then the protocol number is set to 6
If the protocol is UDP then the protocol is set to 17


https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/src/main/java/cic/cs/unb/ca/jnetpcap/FlowFeature.java

From Line 188 to 206 also shows the same implemenatation

TCP = 6 
UDP = 17

Others = 0

https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/src/main/java/cic/cs/unb/ca/jnetpcap/BasicFlow.java

From line 788 to 795

There is a mapping for protocol number to protocol str

Therefore the mapping of Protocol to Code will be in this mapping.

TCP = 6

UDP = 17

OTHERS = 0


In [54]:
TCP: str = "TCP"
UDP: str = "UDP"
OTHERS: str = "OTHERS"

In [55]:
TCP_CODE: int = 6
UDP_CODE: int = 17
OTHER_CODE: int = 0

In [56]:
protocol_to_code_mapping: dict = {
    TCP: TCP_CODE,
    UDP: UDP_CODE,
    OTHERS: OTHER_CODE
}

### Split TimeStamp into 2 different columns (Date and Time)

In [57]:
# https://regexr.com/
network_traffic_analysis_dataframe["Date"] = network_traffic_analysis_dataframe["Timestamp"].str.extract(r"(\d{1,2}\/\d{1,2}\/\d{2,4})")


In [58]:
network_traffic_analysis_dataframe["Time"] = network_traffic_analysis_dataframe["Timestamp"].str.extract(r"(\d{1,2}\:\d{1,2}\:\d{1,2})")

In [59]:
network_traffic_analysis_dataframe["Time"]

0          11:11:17
1          11:11:17
2          11:11:17
3          11:11:17
4          11:11:17
             ...   
3577291    05:43:40
3577292    05:46:10
3577293    05:45:39
3577294    05:45:59
3577295    05:46:05
Name: Time, Length: 3577296, dtype: object


* Preprocess the data for modeling, including feature scaling and encoding categorical variables.

# Further Data Exploration

## Explore the distribution of Application used

The L7Protocol and ProtocolName are related where L7Protocol is the unique numerical data that represents the ProtocolName and ProtocolName is the name of the application used to access the internet.

In [60]:
network_traffic_analysis_dataframe["ProtocolName"].value_counts().head(20)

ProtocolName
GOOGLE            959110
HTTP              683734
HTTP_PROXY        623210
SSL               404883
HTTP_CONNECT      317526
YOUTUBE           170781
AMAZON             86875
MICROSOFT          54710
GMAIL              40260
WINDOWS_UPDATE     34471
SKYPE              30657
FACEBOOK           29033
DROPBOX            25102
YAHOO              21268
TWITTER            18259
CLOUDFLARE         14737
MSN                14478
CONTENT_FLASH       8589
APPLE               7615
OFFICE_365          5941
Name: count, dtype: int64

## Analysis on TCP 3-way handshake in the dataset

### Explore the SYN Flag Count distribution

In [61]:
network_traffic_analysis_dataframe["SYN.Flag.Count"].value_counts()

SYN.Flag.Count
0    2961853
1     615443
Name: count, dtype: int64

In [62]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["SYN.Flag.Count"] == 0]["Protocol"].unique()

array([ 6, 17,  0])

In [63]:
network_traffic_analysis_dataframe[(network_traffic_analysis_dataframe["SYN.Flag.Count"] == 0) & (network_traffic_analysis_dataframe["Protocol"] == 6)]["Protocol"].count()

np.int64(2957532)

In [64]:
network_traffic_analysis_dataframe[(network_traffic_analysis_dataframe["SYN.Flag.Count"] == 0) & (network_traffic_analysis_dataframe["Protocol"] != 6)]["Protocol"].count()

np.int64(4321)

In [65]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["SYN.Flag.Count"] == 1]["Protocol"].unique()

array([6])

https://www.imperva.com/learn/ddos/syn-flood/ 

https://en.wikipedia.org/wiki/SYN_flood

TCP connection is initiated with SYN packet and there are higher frequency of TCP flow without SYN packets

In [66]:
network_traffic_analysis_dataframe.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

### Explore RST Flag data distribution

https://en.wikipedia.org/wiki/TCP_reset_attack 

https://www.extrahop.com/blog/tcp-resets-rst-prevent-command-and-control-dos-attacks

https://www.rfc-editor.org/info/bcp60
Inappropriate TCP resets considered harmful

In [67]:
network_traffic_analysis_dataframe["RST.Flag.Count"].value_counts()

RST.Flag.Count
0    3574915
1       2381
Name: count, dtype: int64

In [68]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["RST.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    2381
Name: count, dtype: int64

In [69]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["RST.Flag.Count"] == 1]

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,...,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName,Date,Time
1900,192.168.32.3-10.200.7.8-50687-3128-6,192.168.32.3,50687,10.200.7.8,3128,6,26/04/201711:11:28,118867,8,14,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:28
1943,192.168.32.3-10.200.7.8-50688-3128-6,192.168.32.3,50688,10.200.7.8,3128,6,26/04/201711:11:28,194774,8,15,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:28
2356,192.168.32.3-10.200.7.8-50699-3128-6,192.168.32.3,50699,10.200.7.8,3128,6,26/04/201711:11:29,445551,10,20,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:29
2858,192.168.32.3-10.200.7.8-50704-3128-6,192.168.32.3,50704,10.200.7.8,3128,6,26/04/201711:11:31,245917,8,14,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:31
3579,192.168.32.3-10.200.7.8-50703-3128-6,192.168.32.3,50703,10.200.7.8,3128,6,26/04/201711:11:31,3466473,16,33,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,130,HTTP_CONNECT,26/04/2017,11:11:31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3549053,192.168.32.93-10.200.7.9-51666-3128-6,192.168.32.93,51666,10.200.7.9,3128,6,15/05/201705:21:22,431352,14,13,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,140,APPLE,15/05/2017,05:21:22
3549054,192.168.32.93-10.200.7.8-51642-3128-6,192.168.32.93,51642,10.200.7.8,3128,6,15/05/201705:19:19,90338973,29,30,...,80.0,4.514784e+07,6.746577e+04,45195545.0,45100134.0,BENIGN,126,GOOGLE,15/05/2017,05:19:19
3549061,192.168.32.93-10.200.7.8-51645-3128-6,192.168.32.93,51645,10.200.7.8,3128,6,15/05/201705:19:24,90812319,42,29,...,175.0,4.531954e+07,3.367999e+05,45557693.0,45081386.0,BENIGN,126,GOOGLE,15/05/2017,05:19:24
3549080,192.168.32.93-10.200.7.8-51665-3128-6,192.168.32.93,51665,10.200.7.8,3128,6,15/05/201705:21:11,315796,42,26,...,0.0,0.000000e+00,0.000000e+00,0.0,0.0,BENIGN,126,GOOGLE,15/05/2017,05:21:11


In [70]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["RST.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     3570594
17       2684
0        1637
Name: count, dtype: int64

### Explore the FIN Flag Count distribution

Based on the documentation of the dataset, the FIN flag is set once the TCP connection ends.

In [71]:
network_traffic_analysis_dataframe["FIN.Flag.Count"].value_counts()

FIN.Flag.Count
0    3552122
1      25174
Name: count, dtype: int64

In [72]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["FIN.Flag.Count"] == 0]["Protocol"].unique()

array([ 6, 17,  0])

According to the documentation of the tools that was used to generate this dataset [https://www.unb.ca/cic/research/applications.html#CICFlowMeter], the TCP flow are usually terminated when there is a connection teardown by the FIN packet.

The UDP flows are terminated by flow timeout.

The high number of absence FIN packet shows weird occurence and the TCP flow are without FIN packet are abnormally high.

There is a mapping done in previous section of the notebook where the index 6 = TCP and 17 = UDP and 0 = other protocol.



### Explore the Flow Timeout value data

In the ReadMe.txt of the CICflowMeter [https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter/blob/master/ReadMe.txt], the Flow duration column is measured in Microseconds.

In [73]:
network_traffic_analysis_dataframe["Flow.Duration"].describe()

count    3.577296e+06
mean     2.544247e+07
std      4.014430e+07
min      1.000000e+00
25%      6.280000e+02
50%      5.847295e+05
75%      4.500153e+07
max      1.200000e+08
Name: Flow.Duration, dtype: float64

In [74]:
def transform_microseconds_to_seconds(data: int) -> float:
    if data == 0:
        return 0.0
    
    return data / 1000000.0


In [75]:

network_traffic_analysis_dataframe["Flow.Duration"].apply(transform_microseconds_to_seconds)

0          0.045523
1          0.000001
2          0.000001
3          0.000217
4          0.078068
             ...   
3577291    2.290821
3577292    0.000024
3577293    2.591653
3577294    2.622421
3577295    2.009138
Name: Flow.Duration, Length: 3577296, dtype: float64

### Exploring the TCP PSH Packet Flag distribution

The TCP PSH flag is used for real-time application such as voice and video streaming. The delay in data transmission can cause poor user experience.

In [76]:
network_traffic_analysis_dataframe["PSH.Flag.Count"].value_counts()

PSH.Flag.Count
0    2125554
1    1451742
Name: count, dtype: int64

In [77]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["PSH.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     2121233
17       2684
0        1637
Name: count, dtype: int64

In [78]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["PSH.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    1451742
Name: count, dtype: int64

In [79]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["PSH.Flag.Count"] == 1]["ProtocolName"].value_counts().head(20)

ProtocolName
GOOGLE            407360
HTTP_CONNECT      192516
SSL               184339
HTTP              173905
HTTP_PROXY        167665
YOUTUBE            95905
AMAZON             52442
MICROSOFT          36443
WINDOWS_UPDATE     23998
GMAIL              15260
FACEBOOK           14978
SKYPE              14957
YAHOO              13503
MSN                 9748
TWITTER             9572
CLOUDFLARE          7600
CONTENT_FLASH       7213
DROPBOX             5147
APPLE               4016
OFFICE_365          2514
Name: count, dtype: int64

https://orhanergun.net/understanding-tcp-psh-packet-flag

### Exploring the TCP Ack Flag distribution

In [80]:
network_traffic_analysis_dataframe["ACK.Flag.Count"].value_counts()

ACK.Flag.Count
1    2144841
0    1432455
Name: count, dtype: int64

In [81]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["ACK.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     1428134
17       2684
0        1637
Name: count, dtype: int64

In [82]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["ACK.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    2144841
Name: count, dtype: int64

### Exploring the TCP URG flag packet distribution

In this blogpost about TCP PSH [https://orhanergun.net/tcp-psh-vs-urg-whats-the-difference], the URG flag in TCP is the Urgent Pointer field is valid in the packet. This URG flag highlights the portion of the data that requires immediate attention to the Receiver.

The Receiver will priortise processing the urgent data first before other data.

Typical use case of TCP PSH flag will be data containing control signals or error messages.

In [83]:
network_traffic_analysis_dataframe["URG.Flag.Count"].value_counts()

URG.Flag.Count
0    2585009
1     992287
Name: count, dtype: int64

In [84]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["URG.Flag.Count"] == 0]["Protocol"].value_counts()

Protocol
6     2580688
17       2684
0        1637
Name: count, dtype: int64

In [85]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["URG.Flag.Count"] == 1]["Protocol"].value_counts()

Protocol
6    992287
Name: count, dtype: int64

### Exploring the CWE Flag distribution

https://kb.clavister.com/317180249/explicit-congestion-notification---ecn-ece-cwe-ns-ect-ce 

https://www.catchpoint.com/blog/ece-cwr-tcp

In [86]:
network_traffic_analysis_dataframe["CWE.Flag.Count"].value_counts()

CWE.Flag.Count
0    3577296
Name: count, dtype: int64

### Exploring on the ECE flat distribution

In [87]:
network_traffic_analysis_dataframe["ECE.Flag.Count"].value_counts()

ECE.Flag.Count
0    3574947
1       2349
Name: count, dtype: int64

The ECN (Explicit Congestion Notification) is a mechanism in TCP/IP to allow Routers to signal if the Routers are almost overloaded.

ECE (Echo of Congestion Encountered) is the mark where the receiver see the packet understanding that the sender informs the receiver that it almost experience traffic congestion.

CWR (Congestion Window Reduced) 

In [88]:
network_traffic_analysis_dataframe.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

### Exploring Down Up Ratio distribution

In [89]:
network_traffic_analysis_dataframe["Down.Up.Ratio"].value_counts()

Down.Up.Ratio
0      1573265
1      1410146
2       305292
3       111856
4        72685
5        61585
6        25359
7         8727
8         3471
11        1618
9         1599
10         797
12         419
16          94
13          93
14          78
15          66
17          28
19          21
20          17
18          13
21          12
26           6
22           6
23           5
24           5
25           4
29           4
35           3
40           2
30           2
31           2
27           1
57           1
62           1
95           1
61           1
106          1
102          1
38           1
39           1
43           1
293          1
194          1
33           1
221          1
36           1
32           1
Name: count, dtype: int64

The purpose of using KDD 99 dataset is to provide reliable method to label the dataset to indicate if the record is malicious.

In [90]:
network_traffic_analysis_dataframe

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,...,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName,Date,Time
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,...,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:17
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,...,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:17
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,...,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP,26/04/2017,11:11:17
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,...,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP,26/04/2017,11:11:17
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,...,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY,26/04/2017,11:11:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577291,10.200.7.199-98.138.79.73-42135-443-6,98.138.79.73,443,10.200.7.199,42135,6,15/05/201705:43:40,2290821,5,4,...,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL,15/05/2017,05:43:40
3577292,10.200.7.217-98.138.79.73-51546-443-6,98.138.79.73,443,10.200.7.217,51546,6,15/05/201705:46:10,24,5,0,...,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL,15/05/2017,05:46:10
3577293,10.200.7.218-98.138.79.73-44366-443-6,98.138.79.73,443,10.200.7.218,44366,6,15/05/201705:45:39,2591653,6,5,...,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL,15/05/2017,05:45:39
3577294,10.200.7.195-98.138.79.73-52341-443-6,98.138.79.73,443,10.200.7.195,52341,6,15/05/201705:45:59,2622421,4,3,...,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL,15/05/2017,05:45:59


# Data Columns Removal

### Remove Label

In [91]:
network_traffic_analysis_dataframe.drop(labels="Label", axis=1, inplace=True)

### Remove Timestamp

In [92]:
network_traffic_analysis_dataframe.drop(labels="Timestamp", axis=1, inplace=True)

### Remove date and time

In [93]:
network_traffic_analysis_dataframe.drop(labels="Date", axis=1, inplace=True)

In [94]:
network_traffic_analysis_dataframe.drop(labels="Time", axis=1,inplace=True)

### Remove ProtocolName

The protocol name is the application type that is related to the data record.

In [95]:
network_traffic_analysis_dataframe.drop(labels="ProtocolName", axis=1, inplace=True)

### Remove Flow.ID column

This is because this column Flow.ID is an identifier for each row. There is no meaning in the data therefore it should be removed.

In [96]:
network_traffic_analysis_dataframe.drop(labels="Flow.ID", axis=1, inplace=True)

### Remove Source.IP and Destination.IP

In [97]:
network_traffic_analysis_dataframe.drop(labels="Source.IP", axis=1, inplace=True)

In [98]:
network_traffic_analysis_dataframe.drop(labels="Destination.IP", axis=1, inplace=True)

In [99]:
network_traffic_analysis_dataframe.select_dtypes(["object"])

0
1
2
3
4
...
3577291
3577292
3577293
3577294
3577295


In [51]:
#What are the top 20 Source Ports in this dataset?

network_traffic_analysis_dataframe["Source.Port"].unique()

array([52422,  3128,    80, ...,  6507, 10192, 10182])

# Label the current dataset using Autoencoder

In [None]:
#https://www.kaggle.com/datasets/chethuhn/network-intrusion-dataset?select=Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv

# use the dataset to place a label in the main dataset

In [100]:
%pip install tensorflow

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Using cached tensorflow-2.18.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Using cached tensorflow-2.18.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (615.5 MB)
Installing collected packages: tensorflow
[31mERROR: Could not install packages due to an OSError: [Errno 28] No space left on device: '/home/vscode/.local/lib/python3.12/site-packages/tensorflow/include/tensorflow/compiler/mlir/tensorflow/ir/host_runtime'
[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'tensorflow.python'

In [None]:


# Simulating an unlabeled network packet dataset (replace with real data)
# Example: Network packet features like size, flags, times, etc.
# Let's assume 1000 samples with 50 features each (e.g., network traffic metrics)
X = np.random.rand(1000, 50)  # 1000 network packets, each with 50 features

# Split into train/test sets
X_train, X_test = train_test_split(network_traffic_analysis_dataframe, test_size=0.2, random_state=42)

# Normalize the data (network data should generally be normalized)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Add noise to the input data (simulating real-world noisy packets)
def add_noise(x, noise_factor=0.5):
    noisy_x = x + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x.shape)
    noisy_x = np.clip(noisy_x, 0., 1.)  # Ensure values are within [0, 1]
    return noisy_x

X_train_noisy = add_noise(X_train)
X_test_noisy = add_noise(X_test)

# Build the Stacked Denoising Autoencoder (SDAE) model
def build_autoencoder(input_shape):
    input_layer = layers.Input(shape=input_shape)
    
    # Encoder part
    encoded = layers.Dense(256, activation='relu')(input_layer)
    encoded = layers.Dense(128, activation='relu')(encoded)
    encoded = layers.Dense(64, activation='relu')(encoded)
    
    # Decoder part
    decoded = layers.Dense(128, activation='relu')(encoded)
    decoded = layers.Dense(256, activation='relu')(decoded)
    decoded = layers.Dense(input_shape[0], activation='sigmoid')(decoded)
    
    # Autoencoder model (including both encoder and decoder)
    autoencoder = models.Model(input_layer, decoded)
    encoder = models.Model(input_layer, encoded)  # Encoder model for feature extraction
    
    return autoencoder, encoder

# Pre-train the autoencoders layer by layer
def pretrain_autoencoders(input_shape, num_layers=3):
    encoders = []
    for i in range(num_layers):
        # Build a new autoencoder for each layer
        autoencoder, encoder = build_autoencoder(input_shape)
        autoencoder.compile(optimizer=Adam(), loss='binary_crossentropy')
        
        # Train the autoencoder on the noisy data
        autoencoder.fit(X_train_noisy, X_train, epochs=10, batch_size=128, shuffle=True, validation_data=(X_test_noisy, X_test))
        
        # Freeze the encoder (do not train it in future layers)
        encoder.trainable = False
        
        # Store the encoder for stacking
        encoders.append(encoder)
        
        # Update the input shape for the next layer's encoder
        input_shape = encoder.output_shape[1:]  # The output shape of the encoder becomes the new input shape
    
    return encoders

# Pre-train the autoencoders
encoders = pretrain_autoencoders((X_train.shape[1],), num_layers=3)

# Stack the encoders to form a full model
def build_stacked_autoencoder(encoders):
    input_layer = layers.Input(shape=(X_train.shape[1],))
    x = input_layer
    
    # Pass the input through each encoder sequentially
    for encoder in encoders:
        x = encoder(x)
    
    # Output layer (same dimension as input for reconstruction)
    classification = layers.Dense(1, activation='sigmoid')(x)
    
    stacked_autoencoder = models.Model(input_layer, classification)
    return stacked_autoencoder

# Build the stacked autoencoder with classification output
stacked_autoencoder = build_stacked_autoencoder(encoders)

# Compile the model
stacked_autoencoder.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Train the stacked autoencoder for classification (can be fine-tuned)
stacked_autoencoder.fit(X_train_noisy, X_train, epochs=20, batch_size=128, validation_data=(X_test_noisy, X_test))

# Evaluate the model on the test set
test_loss, test_accuracy = stacked_autoencoder.evaluate(X_test_noisy, X_test)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

# Make predictions on the test set
predictions = stacked_autoencoder.predict(X_test_noisy)
predictions = (predictions > 0.5).astype(int)  # Convert probability to binary (0 or 1)

print("Predictions on test set:")
print(predictions)

# Model Building

* Split the dataset into training and testing sets.
* Implement at least three different classification models (e.g., Decision Tree, Random Forest, SVM, etc.).
* Train and fine-tune each model using appropriate techniques.
* Discuss the choice of hyperparameters and the reasoning behind it

In [52]:
sample_dataframe: pandas.DataFrame = network_traffic_analysis_dataframe

In [53]:
sample_dataframe.columns

Index(['Source.Port', 'Destination.Port', 'Protocol', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
       'Packet.Length.Std', 'Packet.Length.Variance', 'FIN.Flag.

## Using KNN to produce the grouping in the dataset 

Unnamed: 0,Source.Port,Destination.Port,Protocol,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,...,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,L7Protocol
0,52422,3128,6,45523,22,55,132,110414.0,6,6,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,131
1,3128,52422,6,1,2,0,12,0.0,6,6,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,131
2,80,38848,6,1,3,0,674,0.0,337,0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7
3,80,38848,6,217,1,3,0,0.0,0,0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7
4,55961,3128,6,78068,5,0,1076,0.0,529,6,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577291,443,42135,6,2290821,5,4,599,2159.0,599,0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91
3577292,443,51546,6,24,5,0,1448,0.0,1448,0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91
3577293,443,44366,6,2591653,6,5,1202,4184.0,601,0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91
3577294,443,52341,6,2622421,4,3,632,2352.0,352,0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91


## Creating target column

In [87]:
target = sample_dataframe["L7Protocol"]

In [88]:
train = sample_dataframe.drop(labels="L7Protocol", axis=1)

## Prepare Training dataset and Testing dataset

In [81]:
from sklearn.model_selection import train_test_split

# Training dataset: 80%
# Test Dataset: 20%

X_train, X_test, Y_train, Y_test = train_test_split(train, target, test_size=0.20, train_size=0.80, random_state=11, shuffle=True)

## Create Validation Dataset from training dataset

In [82]:
# Training dataset: 60%
# Validation Dataset: 20% (0.25 Current test size to obtain validation dataset * 0.80 Percentage of Train data from original dataset)
# Test Dataset: 20%

X_train, X_validate, Y_train, Y_validate = train_test_split(X_train, Y_train, test_size=0.25, train_size=0.75, random_state=11, shuffle=True)


## Decision Tree

In [83]:
from sklearn import tree

decision_tree_classifier = tree.DecisionTreeClassifier()

decision_tree_classifier.fit(X_train, Y_train)


In [84]:
y_predict_from_x_test = decision_tree_classifier.predict(X_test)

In [85]:
decision_tree_classifier.score(X_test, Y_test)

0.7235498839907193

In [86]:
from sklearn.metrics import classification_report

print(classification_report(Y_test, y_predict_from_x_test))

              precision    recall  f1-score   support

           1       0.86      1.00      0.92         6
           5       0.73      0.65      0.68       353
           7       0.82      0.82      0.82    136949
           9       1.00      1.00      1.00        28
          11       0.00      0.00      0.00         0
          13       0.00      0.00      0.00         3
          36       0.35      0.38      0.36        16
          37       0.60      1.00      0.75         3
          40       0.89      0.90      0.90      1719
          48       0.00      0.00      0.00         1
          51       1.00      0.33      0.50         3
          60       0.40      0.37      0.38       100
          64       0.23      0.28      0.25       170
          67       0.57      1.00      0.73         8
          68       0.46      0.47      0.46      2915
          69       0.00      0.00      0.00         3
          70       0.33      0.36      0.35      4162
          81       1.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#TODO: Produce two types of model evaluation using test dataset and validation dataset

SVM

Random Forest

# Model Evaluation

* Evaluate the models using appropriate classification metrics (accuracy, precision, recall, F1-score, etc.).
* Visualize the model performance using ROC curves and confusion matrices.
* Compare the models and justify your choice of the best-performing model.

# Conclusion

# Reference