# 1. Data Collection

## Dataset Description
The dataset used in this project is sourced from the Mendeley Data repository, specifically the DDOS attack SDN Dataset. This dataset is specifically tailored for traffic classification using machine learning and deep learning algorithms in Software-Defined Networking (SDN) environments. Below is a detailed overview of the dataset:

## Overview
- Source: Mendeley Data - DDOS attack SDN Dataset
- Network Simulation Tool: Mininet Emulator
- Controller: Single Ryu Controller
- Total Data Entries: 1,04,345
- Simulation Duration: 250 minutes

## Traffic Types
- Benign Traffic: TCP, UDP, ICMP
- Malicious Traffic: TCP SYN Flood, UDP Flood, ICMP Attack

## Features
- The dataset comprises 23 features, which can be categorized into extracted and calculated features:
## Extracted Features
- Switch-id: Identifier for the switch
- Packet_count: Total number of packets
- Byte_count: Total number of bytes
- Duration_sec: Duration in seconds
- Duration_nsec: Duration in nanoseconds
- Total Duration: Sum of duration_sec and duration_nsec
- Source IP: IP address of the source
- Destination IP: IP address of the destination
- Port number: Port number
- tx_bytes: Number of bytes transferred from the switch port
- rx_bytes: Number of bytes received on the switch port
- dt: Date and time converted into a number
## Calculated Features
- Packets per Flow: Packet count during a single flow
- Bytes per Flow: Byte count during a single flow
- Packet Rate: Number of packets sent per second, calculated by dividing packets per flow by the monitoring interval
- Packet_ins Messages: Number of Packet_in messages
- Total Flow Entries: Total flow entries in the switch
- tx_kbps: Data transfer rate in kilobits per second
- rx_kbps: Data receiving rate in kilobits per second
- Port Bandwidth: Sum of tx_kbps and rx_kbps

## Class Label
- 0: Benign Traffic
- 1: Malicious Traffic

## Usage
The dataset provides a robust foundation for training and evaluating various machine learning models aimed at detecting network anomalies. By leveraging this dataset, we aim to develop a Network Anomaly Detection System using hybrid machine learning models to enhance security in SDN environments.

# 2. Data Preprocessing

In [1]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/ajmal/network_anomaly_using_hybrid_ml/dataset_sdn.csv')
df.head()

Unnamed: 0,dt,switch,src,dst,pktcount,bytecount,dur,dur_nsec,tot_dur,flows,...,pktrate,Pairflow,Protocol,port_no,tx_bytes,rx_bytes,tx_kbps,rx_kbps,tot_kbps,label
0,11425,1,10.0.0.1,10.0.0.8,45304,48294064,100,716000000,101000000000.0,3,...,451,0,UDP,3,143928631,3917,0,0.0,0.0,0
1,11605,1,10.0.0.1,10.0.0.8,126395,134737070,280,734000000,281000000000.0,2,...,451,0,UDP,4,3842,3520,0,0.0,0.0,0
2,11425,1,10.0.0.2,10.0.0.8,90333,96294978,200,744000000,201000000000.0,3,...,451,0,UDP,1,3795,1242,0,0.0,0.0,0
3,11425,1,10.0.0.2,10.0.0.8,90333,96294978,200,744000000,201000000000.0,3,...,451,0,UDP,2,3688,1492,0,0.0,0.0,0
4,11425,1,10.0.0.2,10.0.0.8,90333,96294978,200,744000000,201000000000.0,3,...,451,0,UDP,3,3413,3665,0,0.0,0.0,0


In [2]:
df.shape

(104345, 23)

- The dataset has 104345 rows and 23 columns

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104345 entries, 0 to 104344
Data columns (total 23 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   dt           104345 non-null  int64  
 1   switch       104345 non-null  int64  
 2   src          104345 non-null  object 
 3   dst          104345 non-null  object 
 4   pktcount     104345 non-null  int64  
 5   bytecount    104345 non-null  int64  
 6   dur          104345 non-null  int64  
 7   dur_nsec     104345 non-null  int64  
 8   tot_dur      104345 non-null  float64
 9   flows        104345 non-null  int64  
 10  packetins    104345 non-null  int64  
 11  pktperflow   104345 non-null  int64  
 12  byteperflow  104345 non-null  int64  
 13  pktrate      104345 non-null  int64  
 14  Pairflow     104345 non-null  int64  
 15  Protocol     104345 non-null  object 
 16  port_no      104345 non-null  int64  
 17  tx_bytes     104345 non-null  int64  
 18  rx_bytes     104345 non-

- The dataset has 1 categorical feature and 22 numerical features

In [4]:
df.isna().sum()

dt               0
switch           0
src              0
dst              0
pktcount         0
bytecount        0
dur              0
dur_nsec         0
tot_dur          0
flows            0
packetins        0
pktperflow       0
byteperflow      0
pktrate          0
Pairflow         0
Protocol         0
port_no          0
tx_bytes         0
rx_bytes         0
tx_kbps          0
rx_kbps        506
tot_kbps       506
label            0
dtype: int64

- In this dataset features rx_kbp and tot_kbps has 506 null values.
- so drop it because it's small amount of data.


In [5]:
df.dropna(inplace=True)
df.isna().sum()

dt             0
switch         0
src            0
dst            0
pktcount       0
bytecount      0
dur            0
dur_nsec       0
tot_dur        0
flows          0
packetins      0
pktperflow     0
byteperflow    0
pktrate        0
Pairflow       0
Protocol       0
port_no        0
tx_bytes       0
rx_bytes       0
tx_kbps        0
rx_kbps        0
tot_kbps       0
label          0
dtype: int64

In [6]:
df.duplicated().sum()

5091

- The dataset has 5091 duplicated values

In [7]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

In [8]:
df.shape

(98748, 23)

# Features to Consider Dropping

## Identifiers and Date/Time Information:

- dt: This is likely just a timestamp and doesn’t contribute to predicting anomalies.

## Highly Correlated or Redundant Duration Information:

- dur_nsec and dur: Since dataset already have tot_dur, this might be redundant.

## IP Addresses:

- src: Source IP addresses are often not useful for machine learning models and can introduce high cardinality.
- dst: Destination IP addresses have the same issue as source IPs.

## Port Number:

- port_no: Depending on the context, it might not be directly useful. However, if certain port numbers are more prone to attacks, this feature might be useful.

In [9]:
# List of columns to drop
columns_to_drop = ['dt', 'src', 'dst', 'dur', 'dur_nsec', 'port_no']

# Drop the columns
df_cleaned = df.drop(columns=columns_to_drop)

# Verify the columns are dropped
print(df_cleaned.columns)


Index(['switch', 'pktcount', 'bytecount', 'tot_dur', 'flows', 'packetins',
       'pktperflow', 'byteperflow', 'pktrate', 'Pairflow', 'Protocol',
       'tx_bytes', 'rx_bytes', 'tx_kbps', 'rx_kbps', 'tot_kbps', 'label'],
      dtype='object')


In [10]:
df_cleaned.head()

Unnamed: 0,switch,pktcount,bytecount,tot_dur,flows,packetins,pktperflow,byteperflow,pktrate,Pairflow,Protocol,tx_bytes,rx_bytes,tx_kbps,rx_kbps,tot_kbps,label
0,1,45304,48294064,101000000000.0,3,1943,13535,14428310,451,0,UDP,143928631,3917,0,0.0,0.0,0
1,1,126395,134737070,281000000000.0,2,1943,13531,14424046,451,0,UDP,3842,3520,0,0.0,0.0,0
2,1,90333,96294978,201000000000.0,3,1943,13534,14427244,451,0,UDP,3795,1242,0,0.0,0.0,0
3,1,90333,96294978,201000000000.0,3,1943,13534,14427244,451,0,UDP,3688,1492,0,0.0,0.0,0
4,1,90333,96294978,201000000000.0,3,1943,13534,14427244,451,0,UDP,3413,3665,0,0.0,0.0,0


In [12]:
df_cleaned.duplicated().sum()

16058

- After dropping some features, there are some duplicate values, now drop that.

In [13]:
df_cleaned.drop_duplicates(inplace=True)
df_cleaned.duplicated().sum()

0

In [14]:
# prompt: save the df into a following path

df_cleaned.to_csv("/content/drive/MyDrive/ajmal/network_anomaly_using_hybrid_ml/df_cleaned.csv", index=False)
