<a href="https://colab.research.google.com/github/SiddharthDNathan/RT-IoT---Intrusion-Detection-Systems-IDS-/blob/main/RT_IoT2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RT-IoT - Intrusion Detection Systems (IDS)

The RT-IoT2022, a proprietary dataset derived from a real-time IoT infrastructure, is introduced as a comprehensive resource integrating a diverse range of IoT devices and sophisticated network attack methodologies. This dataset encompasses both normal and adversarial network behaviours, providing a general representation of real-world scenarios. Incorporating data from IoT devices such as ThingSpeak-LED, Wipro-Bulb, and MQTT-Temp, as well as simulated attack scenarios involving Brute-Force SSH attacks, DDoS attacks using Hping and Slowloris, and Nmap patterns, RT-IoT2022 offers a detailed perspective on the complex nature of network traffic. The bidirectional attributes of network traffic are meticulously captured using the Zeek network monitoring tool and the Flowmeter plugin. Researchers can leverage the RT-IoT2022 dataset to advance the capabilities of Intrusion Detection Systems (IDS), fostering the development of robust and adaptive security solutions for real-time IoT networks.

**DataSet Link** - https://archive.ics.uci.edu/dataset/942/rt-iot2022

The dataset provided for analysis in the engineering domain is tabular, sequential, and multivariate, encompassing a wide array of features relevant to network flow behavior. With 123,117 instances and 83 features, it offers a comprehensive glimpse into various aspects of network traffic. These features are of both real and categorical types, reflecting the diverse nature of data collected in engineering contexts. Researchers and practitioners can leverage this dataset for a range of tasks including classification, regression, and clustering, enabling insights into network performance, anomaly detection, and optimization strategies. Its multivariate nature and substantial instance count provide ample opportunities for exploratory analysis, model development, and algorithmic refinement within the engineering and network security domains.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Loading and Cleaning

In [None]:
df = pd.read_csv('/content/drive/MyDrive/CSV Files/RT_IOT2022')


In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
0,0,38667,1883,tcp,mqtt,32.011598,9,5,3,3,...,0.0,29729180.0,29729180.0,29729180.0,29729180.0,0.0,64240,26847,502,MQTT_Publish
1,1,51143,1883,tcp,mqtt,31.883584,9,5,3,3,...,0.0,29855280.0,29855280.0,29855280.0,29855280.0,0.0,64240,26847,502,MQTT_Publish
2,2,44761,1883,tcp,mqtt,32.124053,9,5,3,3,...,0.0,29842150.0,29842150.0,29842150.0,29842150.0,0.0,64240,26847,502,MQTT_Publish
3,3,60893,1883,tcp,mqtt,31.961063,9,5,3,3,...,0.0,29913770.0,29913770.0,29913770.0,29913770.0,0.0,64240,26847,502,MQTT_Publish
4,4,51087,1883,tcp,mqtt,31.902362,9,5,3,3,...,0.0,29814700.0,29814700.0,29814700.0,29814700.0,0.0,64240,26847,502,MQTT_Publish


In [None]:
df.tail()

Unnamed: 0.1,Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
123112,2005,59247,63331,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123113,2006,59247,64623,tcp,-,7e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123114,2007,59247,64680,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123115,2008,59247,65000,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123116,2009,59247,65129,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN


In [None]:
df.shape

(123117, 85)

In [None]:
df.columns

Index(['Unnamed: 0', 'id.orig_p', 'id.resp_p', 'proto', 'service',
       'flow_duration', 'fwd_pkts_tot', 'bwd_pkts_tot', 'fwd_data_pkts_tot',
       'bwd_data_pkts_tot', 'fwd_pkts_per_sec', 'bwd_pkts_per_sec',
       'flow_pkts_per_sec', 'down_up_ratio', 'fwd_header_size_tot',
       'fwd_header_size_min', 'fwd_header_size_max', 'bwd_header_size_tot',
       'bwd_header_size_min', 'bwd_header_size_max', 'flow_FIN_flag_count',
       'flow_SYN_flag_count', 'flow_RST_flag_count', 'fwd_PSH_flag_count',
       'bwd_PSH_flag_count', 'flow_ACK_flag_count', 'fwd_URG_flag_count',
       'bwd_URG_flag_count', 'flow_CWR_flag_count', 'flow_ECE_flag_count',
       'fwd_pkts_payload.min', 'fwd_pkts_payload.max', 'fwd_pkts_payload.tot',
       'fwd_pkts_payload.avg', 'fwd_pkts_payload.std', 'bwd_pkts_payload.min',
       'bwd_pkts_payload.max', 'bwd_pkts_payload.tot', 'bwd_pkts_payload.avg',
       'bwd_pkts_payload.std', 'flow_pkts_payload.min',
       'flow_pkts_payload.max', 'flow_pkts_payload.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123117 entries, 0 to 123116
Data columns (total 85 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Unnamed: 0                123117 non-null  int64  
 1   id.orig_p                 123117 non-null  int64  
 2   id.resp_p                 123117 non-null  int64  
 3   proto                     123117 non-null  object 
 4   service                   123117 non-null  object 
 5   flow_duration             123117 non-null  float64
 6   fwd_pkts_tot              123117 non-null  int64  
 7   bwd_pkts_tot              123117 non-null  int64  
 8   fwd_data_pkts_tot         123117 non-null  int64  
 9   bwd_data_pkts_tot         123117 non-null  int64  
 10  fwd_pkts_per_sec          123117 non-null  float64
 11  bwd_pkts_per_sec          123117 non-null  float64
 12  flow_pkts_per_sec         123117 non-null  float64
 13  down_up_ratio             123117 non-null  f

In [None]:
df.describe()

Unnamed: 0.1,Unnamed: 0,id.orig_p,id.resp_p,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,fwd_pkts_per_sec,bwd_pkts_per_sec,...,active.avg,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size
count,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,...,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0
mean,37035.089248,34639.258738,1014.305092,3.809566,2.268826,1.909509,1.471218,0.82026,351806.3,351762.0,...,148135.4,23535.99,1616655.0,1701956.0,3517644.0,1664985.0,45501.83,6118.905123,2739.776018,751.647514
std,30459.106367,19070.620354,5256.371994,130.005408,22.336565,33.018311,19.635196,32.293948,370764.5,370801.5,...,1613007.0,1477935.0,8809396.0,9252337.0,122950800.0,9007064.0,1091361.0,18716.313861,10018.848534,6310.183843
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6059.0,17702.0,21.0,1e-06,1.0,1.0,1.0,0.0,74.54354,72.88927,...,0.953674,0.0,0.0,0.0,0.0,0.0,0.0,64.0,0.0,64.0
50%,33100.0,37221.0,21.0,4e-06,1.0,1.0,1.0,0.0,246723.8,246723.8,...,4.053116,0.0,0.0,0.0,0.0,0.0,0.0,64.0,0.0,64.0
75%,63879.0,50971.0,21.0,5e-06,1.0,1.0,1.0,0.0,524288.0,524288.0,...,5.00679,0.0,0.0,0.0,0.0,0.0,0.0,64.0,0.0,64.0
max,94658.0,65535.0,65389.0,21728.335578,4345.0,10112.0,4345.0,10105.0,1048576.0,1048576.0,...,437493100.0,477486200.0,300000000.0,300000000.0,20967770000.0,300000000.0,120802900.0,65535.0,65535.0,65535.0


In [None]:
df.drop(['Unnamed: 0'], axis = 1 , inplace = True)

In [None]:
len(df['Attack_type'].unique())

12

In [None]:
len(df['proto'].unique())

3

In [None]:
len(df['service'].unique())

10

In [None]:
df['Attack_type'].value_counts()



Attack_type
DOS_SYN_Hping                 94659
Thing_Speak                    8108
ARP_poisioning                 7750
MQTT_Publish                   4146
NMAP_UDP_SCAN                  2590
NMAP_XMAS_TREE_SCAN            2010
NMAP_OS_DETECTION              2000
NMAP_TCP_scan                  1002
DDOS_Slowloris                  534
Wipro_bulb                      253
Metasploit_Brute_Force_SSH       37
NMAP_FIN_SCAN                    28
Name: count, dtype: int64

In [None]:
df['proto'].value_counts()

proto
tcp     110427
udp      12633
icmp        57
Name: count, dtype: int64

In [None]:
df['service'].value_counts()

service
-         102861
dns         9753
mqtt        4132
http        3464
ssl         2663
ntp          121
dhcp          50
irc           43
ssh           28
radius         2
Name: count, dtype: int64

In [None]:
df.isna().sum().sum()

0

In [None]:
df_duplicates = df[df.duplicated()]
df_duplicates

Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,fwd_pkts_per_sec,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
512,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
513,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
514,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
515,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
4324,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,Thing_Speak
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119217,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,NMAP_UDP_SCAN
119267,59342,80,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64240,0,64240,NMAP_UDP_SCAN
119706,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,NMAP_UDP_SCAN
119833,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,NMAP_UDP_SCAN


In [None]:
df.drop_duplicates(inplace = True)

In [None]:
df.duplicated().sum()

0

## Label Encoding Categorical Features

In [None]:
#there are total 3 categorical features including the target variable namely proto, service and attack_type(target)

In [None]:
le = LabelEncoder()
df.Attack_type = le.fit_transform(df.Attack_type)
df.proto = le.fit_transform(df.proto)
df.service = le.fit_transform(df.service)

## Normalization and Scaling

In [None]:
scaler = RobustScaler()
df_robust = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

In [None]:
df_robust

Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,fwd_pkts_per_sec,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
0,0.034666,1862.0,0.0,5.0,8002898.50,8.0,4.0,2.0,3.0,-0.470654,...,0.0,2.972918e+07,2.972918e+07,2.972918e+07,2.972918e+07,0.0,64176.0,26847.0,438.0,1.0
1,0.416384,1862.0,0.0,5.0,7970895.00,8.0,4.0,2.0,3.0,-0.470654,...,0.0,2.985528e+07,2.985528e+07,2.985528e+07,2.985528e+07,0.0,64176.0,26847.0,438.0,1.0
2,0.221119,1862.0,0.0,5.0,8031012.25,8.0,4.0,2.0,3.0,-0.470654,...,0.0,2.984215e+07,2.984215e+07,2.984215e+07,2.984215e+07,0.0,64176.0,26847.0,438.0,1.0
3,0.714698,1862.0,0.0,5.0,7990264.75,8.0,4.0,2.0,3.0,-0.470654,...,0.0,2.991377e+07,2.991377e+07,2.991377e+07,2.991377e+07,0.0,64176.0,26847.0,438.0,1.0
4,0.414671,1862.0,0.0,5.0,7975589.50,8.0,4.0,2.0,3.0,-0.470654,...,0.0,2.981470e+07,2.981470e+07,2.981470e+07,2.981470e+07,0.0,64176.0,26847.0,438.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117917,0.664336,63310.0,0.0,0.0,0.50,0.0,0.0,-1.0,0.0,-0.150609,...,0.0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,960.0,0.0,960.0,7.0
117918,0.664336,64602.0,0.0,0.0,0.75,0.0,0.0,-1.0,0.0,-0.194753,...,0.0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,960.0,0.0,960.0,7.0
117919,0.664336,64659.0,0.0,0.0,0.50,0.0,0.0,-1.0,0.0,-0.150609,...,0.0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,960.0,0.0,960.0,7.0
117920,0.664336,64979.0,0.0,0.0,0.50,0.0,0.0,-1.0,0.0,-0.150609,...,0.0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.0,960.0,0.0,960.0,7.0


## Feature Engineering