# Network Traffic Classification using Machine Learning Techniques

## Overview

Develop classification models using Python programming to analyze a network-related dataset. 

The primary goal is to explore the dataset, preprocess it, create and evaluate different classification models, and report your findings. 

This assignment will enhance your understanding of machine learning techniques, data preprocessing, and model evaluation while applying them to a practical problem related to network security.

## Dataset

This is a real-world dataset created by collecting network data from Universidad Del Cauca, Popayn, Colombia over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017 using multiple packet capturing tools and data extracting tools. 

This dataset is consisting of 3,577,296 instances and 87 features and originally designed for application classification. Each row represents a traffic flow from a source to a destination and each column represents features of the traffic data.

This dataset is downloaded from Kaggle "IP Network Traffic Flows, Labeled with 75 Apps."

## Purpose

## Assumption Made

## Environment Setup

In [1]:
!pip install --upgrade pip
!pip install setuptools -U
!pip install pandas -U
!pip install -U scikit-learn
!pip install kagglehub -U

!pip install matplotlib -U
!pip install seaborn -U

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.9.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.54.1-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x

In [2]:
import kagglehub
import pandas
import numpy

import matplotlib
import seaborn

  from .autonotebook import tqdm as notebook_tqdm


## Data Retrieval

Retrieving data using kagglehub package to simplify the data retrieval process

In [3]:
path = kagglehub.dataset_download("jsrojas/ip-network-traffic-flows-labeled-with-87-apps")

Downloading from https://www.kaggle.com/api/v1/datasets/download/jsrojas/ip-network-traffic-flows-labeled-with-87-apps?dataset_version_number=1...


100%|██████████| 514M/514M [00:25<00:00, 20.9MB/s] 

Extracting files...





In [4]:
print(path)

/home/vscode/.cache/kagglehub/datasets/jsrojas/ip-network-traffic-flows-labeled-with-87-apps/versions/1


## Data Preparation

Attempting to load the data into pandas dataframe for the data exploration

In [5]:
network_traffic_analysis_dataframe: pandas.DataFrame = pandas.read_csv("Dataset-Unicauca-Version2-87Atts.csv", nrows=100000)

In [6]:
network_traffic_analysis_dataframe

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,...,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,...,0.0,0,0,0.0,0.0,0,0,BENIGN,131,HTTP_PROXY
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,...,0.0,0,0,0.0,0.0,0,0,BENIGN,131,HTTP_PROXY
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,...,0.0,0,0,0.0,0.0,0,0,BENIGN,7,HTTP
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,...,0.0,0,0,0.0,0.0,0,0,BENIGN,7,HTTP
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,...,0.0,0,0,0.0,0.0,0,0,BENIGN,131,HTTP_PROXY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,192.168.205.131-10.200.7.6-54082-3128-6,10.200.7.6,3128,192.168.205.131,54082,6,26/04/201711:14:24,404,1,1,...,0.0,0,0,0.0,0.0,0,0,BENIGN,7,HTTP
99996,10.200.7.217-91.216.63.241-37416-80-6,10.200.7.217,37416,91.216.63.241,80,6,26/04/201711:14:42,309,1,2,...,0.0,0,0,0.0,0.0,0,0,BENIGN,7,HTTP
99997,172.217.29.69-10.200.7.194-443-35657-6,172.217.29.69,443,10.200.7.194,35657,6,26/04/201711:16:37,148,1,1,...,0.0,0,0,0.0,0.0,0,0,BENIGN,126,GOOGLE
99998,192.168.29.55-10.200.7.8-61690-3128-6,192.168.29.55,61690,10.200.7.8,3128,6,26/04/201711:15:31,82,1,2,...,0.0,0,0,0.0,0.0,0,0,BENIGN,126,GOOGLE


## Data Exploration

* Load and explore the dataset.
* Handle missing data and outliers.
* Perform data visualization to gain insights into the dataset.
* Preprocess the data for modeling, including feature scaling and encoding categorical variables.

## Data Preprocessing

## Model Building

* Split the dataset into training and testing sets.
* Implement at least three different classification models (e.g., Decision Tree, Random Forest, SVM, etc.).
* Train and fine-tune each model using appropriate techniques.
* Discuss the choice of hyperparameters and the reasoning behind it

## Model Evaluation

* Evaluate the models using appropriate classification metrics (accuracy, precision, recall, F1-score, etc.).
* Visualize the model performance using ROC curves and confusion matrices.
* Compare the models and justify your choice of the best-performing model.

## Recommendation and Action Plan

## Conclusion

## Reference