# Network Traffic Classification using Machine Learning Techniques

# Overview

Develop classification models using Python programming to analyze a network-related dataset. 

The primary goal is to explore the dataset, preprocess it, create and evaluate different classification models, and report your findings. 

This assignment will enhance your understanding of machine learning techniques, data preprocessing, and model evaluation while applying them to a practical problem related to network security.

# Dataset

This is a real-world dataset created by collecting network data from Universidad Del Cauca, Popayn, Colombia over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017 using multiple packet capturing tools and data extracting tools. 

This dataset is consisting of 3,577,296 instances and 87 features and originally designed for application classification. Each row represents a traffic flow from a source to a destination and each column represents features of the traffic data.

This dataset is downloaded from Kaggle "IP Network Traffic Flows, Labeled with 75 Apps."

# Purpose

# Literature Review

# Assumption Made

# Origin of CICFlowMeter

https://www.unb.ca/cic/research/applications.html#CICFlowMeter

https://www.kaggle.com/datasets/jsrojas/ip-network-traffic-flows-labeled-with-87-apps 

https://www.ntop.org/products/traffic-analysis/ntop/

# Environment Setup

In [1]:
!pip install --upgrade pip
!pip install setuptools -U
!pip install pandas -U
!pip install -U scikit-learn
!pip install kagglehub -U

!pip install matplotlib -U
!pip install seaborn -U

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.9.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.54.1-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x

In [2]:
%pip install ipywidgets

Defaulting to user installation because normal site-packages is not writeable
Collecting ipywidgets
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
Downloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: widgetsnbextension, jupyterlab-widgets, ipywidgets
Successfully installed ipywidgets-8.1.5 jupyterlab-widgets-3.0.13 widgetsnbextension-4.0.13
Note: you may need to restart the kernel to use updated packages.


In [None]:
%pip install fireducks

In [5]:
import kagglehub
import pandas
import numpy

import matplotlib
import seaborn
import fireducks.pandas

# Data Retrieval

Retrieving data using kagglehub package to simplify the data retrieval process

In [2]:
path = kagglehub.dataset_download("jsrojas/ip-network-traffic-flows-labeled-with-87-apps")

In [3]:
print(path)

/home/vscode/.cache/kagglehub/datasets/jsrojas/ip-network-traffic-flows-labeled-with-87-apps/versions/1


# Data Loading into DataFrame

Attempting to load the data into pandas dataframe for the data exploration

In [10]:
network_traffic_analysis_dataframe: fireducks.pandas.frame.DataFrame = fireducks.pandas.read_csv("Dataset-Unicauca-Version2-87Atts.csv")

In [11]:
network_traffic_analysis_dataframe

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,...,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577291,10.200.7.199-98.138.79.73-42135-443-6,98.138.79.73,443,10.200.7.199,42135,6,15/05/201705:43:40,2290821,5,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577292,10.200.7.217-98.138.79.73-51546-443-6,98.138.79.73,443,10.200.7.217,51546,6,15/05/201705:46:10,24,5,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577293,10.200.7.218-98.138.79.73-44366-443-6,98.138.79.73,443,10.200.7.218,44366,6,15/05/201705:45:39,2591653,6,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577294,10.200.7.195-98.138.79.73-52341-443-6,98.138.79.73,443,10.200.7.195,52341,6,15/05/201705:45:59,2622421,4,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL


# Data Exploration

### Data Type for all columns in the dataset

In [25]:
print(network_traffic_analysis_dataframe.dtypes[:20])

Source.IP                       object
Source.Port                      int64
Destination.IP                  object
Destination.Port                 int64
Protocol                         int64
Timestamp                       object
Flow.Duration                    int64
Total.Fwd.Packets                int64
Total.Backward.Packets           int64
Total.Length.of.Fwd.Packets      int64
Total.Length.of.Bwd.Packets    float64
Fwd.Packet.Length.Max            int64
Fwd.Packet.Length.Min            int64
Fwd.Packet.Length.Mean         float64
Fwd.Packet.Length.Std          float64
Bwd.Packet.Length.Max            int64
Bwd.Packet.Length.Min            int64
Bwd.Packet.Length.Mean         float64
Bwd.Packet.Length.Std          float64
Flow.Bytes.s                   float64
dtype: object


In [26]:
print(network_traffic_analysis_dataframe.dtypes[21:40])

Flow.IAT.Mean        float64
Flow.IAT.Std         float64
Flow.IAT.Max         float64
Flow.IAT.Min           int64
Fwd.IAT.Total        float64
Fwd.IAT.Mean         float64
Fwd.IAT.Std          float64
Fwd.IAT.Max          float64
Fwd.IAT.Min          float64
Bwd.IAT.Total        float64
Bwd.IAT.Mean         float64
Bwd.IAT.Std          float64
Bwd.IAT.Max          float64
Bwd.IAT.Min          float64
Fwd.PSH.Flags          int64
Bwd.PSH.Flags          int64
Fwd.URG.Flags          int64
Bwd.URG.Flags          int64
Fwd.Header.Length      int64
dtype: object


In [27]:
print(network_traffic_analysis_dataframe.dtypes[41:60])

Fwd.Packets.s             float64
Bwd.Packets.s             float64
Min.Packet.Length           int64
Max.Packet.Length           int64
Packet.Length.Mean        float64
Packet.Length.Std         float64
Packet.Length.Variance    float64
FIN.Flag.Count              int64
SYN.Flag.Count              int64
RST.Flag.Count              int64
PSH.Flag.Count              int64
ACK.Flag.Count              int64
URG.Flag.Count              int64
CWE.Flag.Count              int64
ECE.Flag.Count              int64
Down.Up.Ratio               int64
Average.Packet.Size       float64
Avg.Fwd.Segment.Size      float64
Avg.Bwd.Segment.Size      float64
dtype: object


In [28]:
print(network_traffic_analysis_dataframe.dtypes[71:])

Init_Win_bytes_forward       int64
Init_Win_bytes_backward      int64
act_data_pkt_fwd             int64
min_seg_size_forward         int64
Active.Mean                float64
Active.Std                 float64
Active.Max                 float64
Active.Min                 float64
Idle.Mean                  float64
Idle.Std                   float64
Idle.Max                   float64
Idle.Min                   float64
Label                       object
L7Protocol                   int64
ProtocolName                object
dtype: object


### Feature and its description

- Flow duration		
    - Duration of the flow in Microsecond
- total Fwd Packet		
    - Total packets in the forward direction
- total Bwd packets		
    - Total packets in the backward direction
- total Length of Fwd Packet	
    - Total size of packet in forward direction
- total Length of Bwd Packet	
    - Total size of packet in backward direction
- Fwd Packet Length Min 		
    - Minimum size of packet in forward direction
- Fwd Packet Length Max 		
    - Maximum size of packet in forward direction
- Fwd Packet Length Mean		
    - Mean size of packet in forward direction
- Fwd Packet Length Std		
    - Standard deviation size of packet in forward direction
- Bwd Packet Length Min		
    - Minimum size of packet in backward direction
- Bwd Packet Length Max		
    - Maximum size of packet in backward direction
- Bwd Packet Length Mean		
    - Mean size of packet in backward direction
- Bwd Packet Length Std		
    - Standard deviation size of packet in backward direction
- Flow Byte/s			
    - Number of flow packets per second
- Flow Packets/s			
    - Number of flow bytes per second 
- Flow IAT Mean			
    - Mean time between two packets sent in the flow
- Flow IAT Std			
    - Standard deviation time between two packets sent in the flow
- Flow IAT Max			
    - Maximum time between two packets sent in the flow
- Flow IAT Min			
    - Minimum time between two packets sent in the flow
- Fwd IAT Min			
    - Minimum time between two packets sent in the forward direction
- Fwd IAT Max			
    - Maximum time between two packets sent in the forward direction
- Fwd IAT Mean			
    - Mean time between two packets sent in the forward direction
- Fwd IAT Std			
    - Standard deviation time between two packets sent in the forward direction
- Fwd IAT Total   		
    - Total time between two packets sent in the forward direction
- Bwd IAT Min			
    - Minimum time between two packets sent in the backward direction
- Bwd IAT Max			
    - Maximum time between two packets sent in the backward direction
- Bwd IAT Mean			
    - Mean time between two packets sent in the backward direction
- Bwd IAT Std			
    - Standard deviation time between two packets sent in the backward direction
- Bwd IAT Total			
    - Total time between two packets sent in the backward direction
- Fwd PSH flag			
    - Number of times the PSH flag was set in packets travelling in the forward direction (0 for UDP)
- Bwd PSH Flag			
    - Number of times the PSH flag was set in packets travelling in the backward direction (0 for UDP)
- Fwd URG Flag			
    - Number of times the URG flag was set in packets travelling in the forward direction (0 for UDP)
- Bwd URG Flag			
    - Number of times the URG flag was set in packets travelling in the backward direction (0 for UDP)
- Fwd Header Length		
    - Total bytes used for headers in the forward direction
- Bwd Header Length		
    - Total bytes used for headers in the backward direction
- FWD Packets/s			
    - Number of forward packets per second
- Bwd Packets/s			
    - Number of backward packets per second
- Min Packet Length 		
    - Minimum length of a packet
- Max Packet Length 		
    - Maximum length of a packet
- Packet Length Mean 		
    - Mean length of a packet
- Packet Length Std		
    - Standard deviation length of a packet
- Packet Length Variance  	
    - Variance length of a packet
- FIN Flag Count 			
    - Number of packets with FIN
- SYN Flag Count 			
    - Number of packets with SYN
- RST Flag Count 			
    - Number of packets with RST
- PSH Flag Count 			
    - Number of packets with PUSH
- ACK Flag Count 			
    - Number of packets with ACK
- URG Flag Count 			
    - Number of packets with URG
- CWR Flag Count 			
    - Number of packets with CWE
- ECE Flag Count 			
    - Number of packets with ECE
- down/Up Ratio			
    - Download and upload ratio
- Average Packet Size 		
    - Average size of packet
- Avg Fwd Segment Size 		
    - Average size observed in the forward direction
- AVG Bwd Segment Size 		
    - Average number of bytes bulk rate in the backward direction
- Fwd Header Length		
    - Length of the forward packet header
- Fwd Avg Bytes/Bulk		
    - Average number of bytes bulk rate in the forward direction
- Fwd AVG Packet/Bulk 		
    - Average number of packets bulk rate in the forward direction
- Fwd AVG Bulk Rate 		
    - Average number of bulk rate in the forward direction
- Bwd Avg Bytes/Bulk		
    - Average number of bytes bulk rate in the backward direction
- Bwd AVG Packet/Bulk 		
    - Average number of packets bulk rate in the backward direction
- Bwd AVG Bulk Rate 		
    - Average number of bulk rate in the backward direction
- Subflow Fwd Packets		
    - The average number of packets in a sub flow in the forward direction
- Subflow Fwd Bytes		
    - The average number of bytes in a sub flow in the forward direction
- Subflow Bwd Packets		
    - The average number of packets in a sub flow in the backward direction
- Subflow Bwd Bytes		
    - The average number of bytes in a sub flow in the backward direction
- Init_Win_bytes_forward		
    - The total number of bytes sent in initial window in the forward direction
- Init_Win_bytes_backward		
    - The total number of bytes sent in initial window in the backward direction
- Act_data_pkt_forward		
    - Count of packets with at least 1 byte of TCP data payload in the forward direction
- min_seg_size_forward		
    - Minimum segment size observed in the forward direction
- Active Min			
    - Minimum time a flow was active before becoming idle
- Active Mean			
    - Mean time a flow was active before becoming idle
- Active Max			
    - Maximum time a flow was active before becoming idle
- Active Std			
    - Standard deviation time a flow was active before becoming idle
- Idle Min			
    - Minimum time a flow was idle before becoming active
- Idle Mean			
    - Mean time a flow was idle before becoming active
- Idle Max			
    - Maximum time a flow was idle before becoming active
- Idle Std			
    - Standard deviation time a flow was idle before becoming active

### Dataset Information

In [32]:
network_traffic_analysis_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3577296 entries, 0 to 3577295
Data columns (total 86 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   Source.IP                    object 
 1   Source.Port                  int64  
 2   Destination.IP               object 
 3   Destination.Port             int64  
 4   Protocol                     int64  
 5   Timestamp                    object 
 6   Flow.Duration                int64  
 7   Total.Fwd.Packets            int64  
 8   Total.Backward.Packets       int64  
 9   Total.Length.of.Fwd.Packets  int64  
 10  Total.Length.of.Bwd.Packets  float64
 11  Fwd.Packet.Length.Max        int64  
 12  Fwd.Packet.Length.Min        int64  
 13  Fwd.Packet.Length.Mean       float64
 14  Fwd.Packet.Length.Std        float64
 15  Bwd.Packet.Length.Max        int64  
 16  Bwd.Packet.Length.Min        int64  
 17  Bwd.Packet.Length.Mean       float64
 18  Bwd.Packet.Length.Std        float64
 19  

### Describe the dataset

In [41]:
network_traffic_analysis_dataframe.iloc[:,:10].describe()

Unnamed: 0,Source.Port,Destination.Port,Protocol,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,37999.38,12042.46,6.005508,25442470.0,62.37799,65.34083,46833.23
std,22017.13,20449.16,0.3274574,40144300.0,1094.086,1108.092,1816196.0
min,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,3697.0,443.0,6.0,628.0,2.0,1.0,12.0
50%,49377.0,3128.0,6.0,584729.5,6.0,5.0,443.0
75%,53799.0,3128.0,6.0,45001530.0,15.0,15.0,1769.0
max,65534.0,65534.0,17.0,120000000.0,453190.0,542196.0,678023600.0


In [42]:
network_traffic_analysis_dataframe.iloc[:,11:20].describe()

Unnamed: 0,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,Fwd.Packet.Length.Mean,Fwd.Packet.Length.Std,Bwd.Packet.Length.Max,Bwd.Packet.Length.Min,Bwd.Packet.Length.Mean,Bwd.Packet.Length.Std,Flow.Bytes.s
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,512.3645,9.340408,114.9212,152.0501,1103.231,11.13491,254.7845,289.8878,4048709.0
std,1039.319,82.99983,246.4707,240.4702,2352.374,105.5422,506.0731,485.3004,75510400.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,18.82429
50%,206.0,0.0,46.57143,74.21124,81.0,0.0,30.14286,32.42474,1140.944
75%,613.0,6.0,122.5,207.9035,1366.0,0.0,256.75,423.2105,23437.5
max,32832.0,16060.0,16060.0,6225.487,37648.0,13032.0,13032.0,8434.804,14396000000.0


In [43]:
network_traffic_analysis_dataframe.iloc[:,21:30].describe()

Unnamed: 0,Flow.IAT.Mean,Flow.IAT.Std,Flow.IAT.Max,Flow.IAT.Min,Fwd.IAT.Total,Fwd.IAT.Mean,Fwd.IAT.Std,Fwd.IAT.Max,Fwd.IAT.Min
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,1422201.0,3365395.0,12850200.0,88702.01,24187960.0,3124467.0,3649620.0,12096240.0,1271532.0
std,3550414.0,6260959.0,20765180.0,1605272.0,39625630.0,8358652.0,7390979.0,20491800.0,7279117.0
min,0.2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,415.0,8.485281,570.0,0.0,7.0,5.0,0.0,6.0,0.0
50%,33202.38,68364.44,281239.5,1.0,389264.5,37006.79,47175.96,207629.0,0.0
75%,936657.6,3980748.0,23915460.0,33.0,40011610.0,1549711.0,2932647.0,19269760.0,92.0
max,120000000.0,84852730.0,120000000.0,120000000.0,120000000.0,120000000.0,84852560.0,120000000.0,120000000.0


In [44]:
network_traffic_analysis_dataframe.iloc[:,31:40].describe()

Unnamed: 0,Bwd.IAT.Mean,Bwd.IAT.Std,Bwd.IAT.Max,Bwd.IAT.Min,Fwd.PSH.Flags,Bwd.PSH.Flags,Fwd.URG.Flags,Bwd.URG.Flags,Fwd.Header.Length
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,2476877.0,2932460.0,9830803.0,888999.1,0.1720414,0.0,0.0,0.0,1653.339
std,7578111.0,6666650.0,18835210.0,6231082.0,0.3774165,0.0,0.0,0.0,30088.9
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0
50%,15587.65,26175.95,95181.5,0.0,0.0,0.0,0.0,0.0,152.0
75%,334214.2,752634.2,7508778.0,1.0,0.0,0.0,0.0,0.0,392.0
max,119999900.0,84852750.0,119999900.0,119999900.0,1.0,0.0,0.0,0.0,15439500.0


In [45]:
network_traffic_analysis_dataframe.iloc[:,41:50].describe()

Unnamed: 0,Fwd.Packets.s,Bwd.Packets.s,Min.Packet.Length,Max.Packet.Length,Packet.Length.Mean,Packet.Length.Std,Packet.Length.Variance,FIN.Flag.Count,SYN.Flag.Count
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,77058.16,11905.22,3.043745,1333.25,198.8191,303.519,279273.6,0.007037159,0.1720414
std,368315.3,108020.6,41.45472,2453.395,332.7427,432.6083,725860.8,0.0835921,0.3774165
min,0.008333337,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.5417242,0.1009873,0.0,6.0,6.0,0.0,0.0,0.0,0.0
50%,15.63422,2.951696,0.0,355.0,62.83333,106.9828,11445.31,0.0,0.0
75%,2164.502,83.44459,6.0,1460.0,250.0,481.8125,232143.2,0.0,0.0
max,6000000.0,5000000.0,7063.0,37648.0,10708.67,9268.781,85910310.0,1.0,1.0


In [46]:
network_traffic_analysis_dataframe.iloc[:,51:60].describe()

Unnamed: 0,PSH.Flag.Count,ACK.Flag.Count,URG.Flag.Count,CWE.Flag.Count,ECE.Flag.Count,Down.Up.Ratio,Average.Packet.Size,Avg.Fwd.Segment.Size,Avg.Bwd.Segment.Size
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,0.405821,0.5995705,0.2773847,0.0,0.0006566412,0.9085471,207.563,114.9212,254.7845
std,0.4910503,0.4899855,0.447708,0.0,0.0256166,1.269945,343.227,246.4707,506.0731
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,9.0,6.0,0.0
50%,0.0,1.0,0.0,0.0,0.0,1.0,66.5,46.57143,30.14286
75%,1.0,1.0,1.0,0.0,0.0,1.0,263.7184,122.5,256.75
max,1.0,1.0,1.0,0.0,1.0,293.0,16063.0,16060.0,13032.0


In [47]:
network_traffic_analysis_dataframe.iloc[:,61:70].describe()

Unnamed: 0,Fwd.Avg.Bytes.Bulk,Fwd.Avg.Packets.Bulk,Fwd.Avg.Bulk.Rate,Bwd.Avg.Bytes.Bulk,Bwd.Avg.Packets.Bulk,Bwd.Avg.Bulk.Rate,Subflow.Fwd.Packets,Subflow.Fwd.Bytes,Subflow.Bwd.Packets
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,0.0,0.0,0.0,0.0,0.0,0.0,62.37799,46833.23,65.34083
std,0.0,0.0,0.0,0.0,0.0,0.0,1094.086,1816196.0,1108.092
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,2.0,12.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,6.0,443.0,5.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,15.0,1769.0,15.0
max,0.0,0.0,0.0,0.0,0.0,0.0,453190.0,678023600.0,542196.0


In [50]:
network_traffic_analysis_dataframe.iloc[:,71:80].describe()

Unnamed: 0,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,8984.691,2123.489,45.03535,25.69738,298199.0,183640.6,522937.2,167633.6,8524211.0
std,14101.26,7704.789,974.8192,6.025989,2349390.0,1325838.0,3266508.0,2064219.0,17065680.0
min,-1.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
25%,411.0,18.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0
50%,5840.0,262.0,2.0,20.0,0.0,0.0,0.0,0.0,0.0
75%,14600.0,660.0,9.0,32.0,45.0,0.0,57.0,2.0,7506747.0
max,65535.0,65535.0,328694.0,523.0,114695000.0,72971360.0,114695000.0,114695000.0,120000000.0


In [51]:
network_traffic_analysis_dataframe.iloc[:,81:].describe()

Unnamed: 0,Idle.Max,Idle.Min,L7Protocol
count,3577296.0,3577296.0,3577296.0
mean,9743845.0,7252097.0,102.9508
std,18885570.0,16007540.0,51.29198
min,0.0,0.0,1.0
25%,0.0,0.0,91.0
50%,0.0,0.0,126.0
75%,8034389.0,5369712.0,130.0
max,120000000.0,120000000.0,222.0


### Detect for any null values in the dataset

In [55]:
network_traffic_analysis_dataframe.isnull().sum()[:20]

Source.IP                      0
Source.Port                    0
Destination.IP                 0
Destination.Port               0
Protocol                       0
Timestamp                      0
Flow.Duration                  0
Total.Fwd.Packets              0
Total.Backward.Packets         0
Total.Length.of.Fwd.Packets    0
Total.Length.of.Bwd.Packets    0
Fwd.Packet.Length.Max          0
Fwd.Packet.Length.Min          0
Fwd.Packet.Length.Mean         0
Fwd.Packet.Length.Std          0
Bwd.Packet.Length.Max          0
Bwd.Packet.Length.Min          0
Bwd.Packet.Length.Mean         0
Bwd.Packet.Length.Std          0
Flow.Bytes.s                   0
dtype: int64

In [56]:
network_traffic_analysis_dataframe.isnull().sum()[21:40]

Flow.IAT.Mean        0
Flow.IAT.Std         0
Flow.IAT.Max         0
Flow.IAT.Min         0
Fwd.IAT.Total        0
Fwd.IAT.Mean         0
Fwd.IAT.Std          0
Fwd.IAT.Max          0
Fwd.IAT.Min          0
Bwd.IAT.Total        0
Bwd.IAT.Mean         0
Bwd.IAT.Std          0
Bwd.IAT.Max          0
Bwd.IAT.Min          0
Fwd.PSH.Flags        0
Bwd.PSH.Flags        0
Fwd.URG.Flags        0
Bwd.URG.Flags        0
Fwd.Header.Length    0
dtype: int64

In [57]:
network_traffic_analysis_dataframe.isnull().sum()[41:60]

Fwd.Packets.s             0
Bwd.Packets.s             0
Min.Packet.Length         0
Max.Packet.Length         0
Packet.Length.Mean        0
Packet.Length.Std         0
Packet.Length.Variance    0
FIN.Flag.Count            0
SYN.Flag.Count            0
RST.Flag.Count            0
PSH.Flag.Count            0
ACK.Flag.Count            0
URG.Flag.Count            0
CWE.Flag.Count            0
ECE.Flag.Count            0
Down.Up.Ratio             0
Average.Packet.Size       0
Avg.Fwd.Segment.Size      0
Avg.Bwd.Segment.Size      0
dtype: int64

In [58]:
network_traffic_analysis_dataframe.isnull().sum()[61:80]

Fwd.Avg.Bytes.Bulk         0
Fwd.Avg.Packets.Bulk       0
Fwd.Avg.Bulk.Rate          0
Bwd.Avg.Bytes.Bulk         0
Bwd.Avg.Packets.Bulk       0
Bwd.Avg.Bulk.Rate          0
Subflow.Fwd.Packets        0
Subflow.Fwd.Bytes          0
Subflow.Bwd.Packets        0
Subflow.Bwd.Bytes          0
Init_Win_bytes_forward     0
Init_Win_bytes_backward    0
act_data_pkt_fwd           0
min_seg_size_forward       0
Active.Mean                0
Active.Std                 0
Active.Max                 0
Active.Min                 0
Idle.Mean                  0
dtype: int64

In [59]:
network_traffic_analysis_dataframe.isnull().sum()[81:]

Idle.Max        0
Idle.Min        0
Label           0
L7Protocol      0
ProtocolName    0
dtype: int64

### Detect for any na values in the dataset

In [61]:
network_traffic_analysis_dataframe.isna().sum()[:20]

Source.IP                      0
Source.Port                    0
Destination.IP                 0
Destination.Port               0
Protocol                       0
Timestamp                      0
Flow.Duration                  0
Total.Fwd.Packets              0
Total.Backward.Packets         0
Total.Length.of.Fwd.Packets    0
Total.Length.of.Bwd.Packets    0
Fwd.Packet.Length.Max          0
Fwd.Packet.Length.Min          0
Fwd.Packet.Length.Mean         0
Fwd.Packet.Length.Std          0
Bwd.Packet.Length.Max          0
Bwd.Packet.Length.Min          0
Bwd.Packet.Length.Mean         0
Bwd.Packet.Length.Std          0
Flow.Bytes.s                   0
dtype: int64

In [62]:
network_traffic_analysis_dataframe.isna().sum()[21:40]

Flow.IAT.Mean        0
Flow.IAT.Std         0
Flow.IAT.Max         0
Flow.IAT.Min         0
Fwd.IAT.Total        0
Fwd.IAT.Mean         0
Fwd.IAT.Std          0
Fwd.IAT.Max          0
Fwd.IAT.Min          0
Bwd.IAT.Total        0
Bwd.IAT.Mean         0
Bwd.IAT.Std          0
Bwd.IAT.Max          0
Bwd.IAT.Min          0
Fwd.PSH.Flags        0
Bwd.PSH.Flags        0
Fwd.URG.Flags        0
Bwd.URG.Flags        0
Fwd.Header.Length    0
dtype: int64

In [63]:
network_traffic_analysis_dataframe.isna().sum()[41:60]

Fwd.Packets.s             0
Bwd.Packets.s             0
Min.Packet.Length         0
Max.Packet.Length         0
Packet.Length.Mean        0
Packet.Length.Std         0
Packet.Length.Variance    0
FIN.Flag.Count            0
SYN.Flag.Count            0
RST.Flag.Count            0
PSH.Flag.Count            0
ACK.Flag.Count            0
URG.Flag.Count            0
CWE.Flag.Count            0
ECE.Flag.Count            0
Down.Up.Ratio             0
Average.Packet.Size       0
Avg.Fwd.Segment.Size      0
Avg.Bwd.Segment.Size      0
dtype: int64

In [64]:
network_traffic_analysis_dataframe.isna().sum()[61:80]

Fwd.Avg.Bytes.Bulk         0
Fwd.Avg.Packets.Bulk       0
Fwd.Avg.Bulk.Rate          0
Bwd.Avg.Bytes.Bulk         0
Bwd.Avg.Packets.Bulk       0
Bwd.Avg.Bulk.Rate          0
Subflow.Fwd.Packets        0
Subflow.Fwd.Bytes          0
Subflow.Bwd.Packets        0
Subflow.Bwd.Bytes          0
Init_Win_bytes_forward     0
Init_Win_bytes_backward    0
act_data_pkt_fwd           0
min_seg_size_forward       0
Active.Mean                0
Active.Std                 0
Active.Max                 0
Active.Min                 0
Idle.Mean                  0
dtype: int64

In [65]:
network_traffic_analysis_dataframe.isna().sum()[81:]

Idle.Max        0
Idle.Min        0
Label           0
L7Protocol      0
ProtocolName    0
dtype: int64

### Determine the classes in data columns that are object data type

In [67]:
network_traffic_analysis_dataframe.select_dtypes("object")

Unnamed: 0,Source.IP,Destination.IP,Timestamp,Label,ProtocolName
0,172.19.1.46,10.200.7.7,26/04/201711:11:17,BENIGN,HTTP_PROXY
1,10.200.7.7,172.19.1.46,26/04/201711:11:17,BENIGN,HTTP_PROXY
2,50.31.185.39,10.200.7.217,26/04/201711:11:17,BENIGN,HTTP
3,50.31.185.39,10.200.7.217,26/04/201711:11:17,BENIGN,HTTP
4,192.168.72.43,10.200.7.7,26/04/201711:11:17,BENIGN,HTTP_PROXY
...,...,...,...,...,...
3577291,98.138.79.73,10.200.7.199,15/05/201705:43:40,BENIGN,SSL
3577292,98.138.79.73,10.200.7.217,15/05/201705:46:10,BENIGN,SSL
3577293,98.138.79.73,10.200.7.218,15/05/201705:45:39,BENIGN,SSL
3577294,98.138.79.73,10.200.7.195,15/05/201705:45:59,BENIGN,SSL


In [68]:
network_traffic_analysis_dataframe["Label"].unique()

array(['BENIGN'], dtype=object)

In [70]:
network_traffic_analysis_dataframe["L7Protocol"].unique()

array([131,   7, 130,  91, 126, 124, 119,  40, 121, 147, 178, 212, 163,
       122,  70,  68,  64, 125, 221, 114, 120, 143, 220, 169, 219, 176,
       201,   5,  60, 142, 145, 175, 132, 140, 222, 211, 179, 123,  81,
         9, 148, 156, 203,  51, 195, 133,  92, 200,  67, 135, 153,  36,
        69, 167, 210, 159, 170, 164, 213,  11, 174, 162,  14, 202,  48,
       185,   1, 150, 158, 139, 134,  85, 180,  13, 146, 172,  37, 191])

In [74]:
network_traffic_analysis_dataframe[network_traffic_analysis_dataframe["ProtocolName"] == "HTTP_CONNECT"].head(1)

Unnamed: 0,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,...,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
7,192.168.10.47,51848,10.200.7.6,3128,6,26/04/201711:11:17,11002,3,12,232,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,130,HTTP_CONNECT


In [69]:
network_traffic_analysis_dataframe["ProtocolName"].unique()

array(['HTTP_PROXY', 'HTTP', 'HTTP_CONNECT', 'SSL', 'GOOGLE', 'YOUTUBE',
       'FACEBOOK', 'CONTENT_FLASH', 'DROPBOX', 'WINDOWS_UPDATE', 'AMAZON',
       'MICROSOFT', 'TOR', 'GMAIL', 'YAHOO', 'MSN', 'SSL_NO_CERT',
       'SKYPE', 'MS_ONE_DRIVE', 'MSSQL', 'TWITTER', 'APPLE_ICLOUD',
       'CLOUDFLARE', 'UBUNTUONE', 'OFFICE_365', 'WIKIPEDIA', 'OPENSIGNAL',
       'DNS', 'HTTP_DOWNLOAD', 'WHATSAPP', 'APPLE_ITUNES', 'FTP_DATA',
       'CITRIX', 'APPLE', 'MQTT', 'INSTAGRAM', 'EBAY', 'GOOGLE_MAPS',
       'IP_ICMP', 'NTP', 'TEAMVIEWER', 'SPOTIFY', 'EASYTAXI',
       'MAIL_IMAPS', 'TWITCH', 'NETFLIX', 'SSH', 'SIMET',
       'UNENCRYPED_JABBER', 'WAZE', 'UPNP', 'EDONKEY', 'OSCAR', 'ORACLE',
       'DEEZER', 'OPENVPN', 'WHOIS_DAS', 'SKINNY', 'STARCRAFT', 'NFS',
       'RTMP', 'TEAMSPEAK', 'SNMP', '99TAXI', 'QQ', 'TELEGRAM',
       'FTP_CONTROL', 'LOTUS_NOTES', 'H323', 'CITRIX_ONLINE', 'LASTFM',
       'IP_OSPF', 'CNN', 'BGP', 'RADIUS', 'SOCKS', 'BITTORRENT', 'TIMMEU'],
      dtype=object)

### Mapping of L7Protocol and ProtocolName

In [None]:
# Create a dictionary based on
## - network_traffic_analysis_dataframe["L7Protocol"].unique()
## - network_traffic_analysis_dataframe["ProtocolName"].unique()

### Mapping of Protocol and OSI Model

In [None]:
# Create a mapping of the number in Protocol column and the OSI model

### Find a method to create new records of MALIGNANT data and put it into the dataset

In [None]:
#TODO

### Split TimeStamp into 2 different columns (Date and Time)

In [None]:
#TODO

# Data Preparation for ML Training

* Load and explore the dataset.
* Handle missing data and outliers.
* Perform data visualization to gain insights into the dataset.
* Preprocess the data for modeling, including feature scaling and encoding categorical variables.

### Remove Flow.ID column

This is because this column Flow.ID is an identifier for each row. There is no meaning in the data therefore it should be removed.

In [None]:
network_traffic_analysis_dataframe.drop(labels="Flow.ID", axis=1, inplace=True)

### TEST

In [66]:
network_traffic_analysis_dataframe

Unnamed: 0,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,...,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,132,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
1,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,12,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
2,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,674,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
3,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
4,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,1076,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577291,98.138.79.73,443,10.200.7.199,42135,6,15/05/201705:43:40,2290821,5,4,599,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577292,98.138.79.73,443,10.200.7.217,51546,6,15/05/201705:46:10,24,5,0,1448,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577293,98.138.79.73,443,10.200.7.218,44366,6,15/05/201705:45:39,2591653,6,5,1202,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL
3577294,98.138.79.73,443,10.200.7.195,52341,6,15/05/201705:45:59,2622421,4,3,632,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,91,SSL


In [None]:
network_traffic_analysis_dataframe[]

# Model Building

* Split the dataset into training and testing sets.
* Implement at least three different classification models (e.g., Decision Tree, Random Forest, SVM, etc.).
* Train and fine-tune each model using appropriate techniques.
* Discuss the choice of hyperparameters and the reasoning behind it

# Model Evaluation

* Evaluate the models using appropriate classification metrics (accuracy, precision, recall, F1-score, etc.).
* Visualize the model performance using ROC curves and confusion matrices.
* Compare the models and justify your choice of the best-performing model.

# Recommendation and Action Plan

# Conclusion

# Reference