<a href="https://colab.research.google.com/github/SiddharthDNathan/RT-IoT---Intrusion-Detection-Systems-IDS-/blob/main/RT_IoT2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RT-IoT - Intrusion Detection Systems (IDS)

The RT-IoT2022, a proprietary dataset derived from a real-time IoT infrastructure, is introduced as a comprehensive resource integrating a diverse range of IoT devices and sophisticated network attack methodologies. This dataset encompasses both normal and adversarial network behaviours, providing a general representation of real-world scenarios. Incorporating data from IoT devices such as ThingSpeak-LED, Wipro-Bulb, and MQTT-Temp, as well as simulated attack scenarios involving Brute-Force SSH attacks, DDoS attacks using Hping and Slowloris, and Nmap patterns, RT-IoT2022 offers a detailed perspective on the complex nature of network traffic. The bidirectional attributes of network traffic are meticulously captured using the Zeek network monitoring tool and the Flowmeter plugin. Researchers can leverage the RT-IoT2022 dataset to advance the capabilities of Intrusion Detection Systems (IDS), fostering the development of robust and adaptive security solutions for real-time IoT networks.

**DataSet Link** - https://archive.ics.uci.edu/dataset/942/rt-iot2022

The dataset provided for analysis in the engineering domain is tabular, sequential, and multivariate, encompassing a wide array of features relevant to network flow behavior. With 123,117 instances and 83 features, it offers a comprehensive glimpse into various aspects of network traffic. These features are of both real and categorical types, reflecting the diverse nature of data collected in engineering contexts. Researchers and practitioners can leverage this dataset for a range of tasks including classification, regression, and clustering, enabling insights into network performance, anomaly detection, and optimization strategies. Its multivariate nature and substantial instance count provide ample opportunities for exploratory analysis, model development, and algorithmic refinement within the engineering and network security domains.

In [189]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from sklearn.model_selection import KFold, cross_val_score
import warnings
warnings.filterwarnings('ignore')

In [190]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Loading and Cleaning

In [191]:
df = pd.read_csv('/content/drive/MyDrive/CSV Files/RT_IOT2022')


In [192]:
df.head()

Unnamed: 0.1,Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
0,0,38667,1883,tcp,mqtt,32.011598,9,5,3,3,...,0.0,29729180.0,29729180.0,29729180.0,29729180.0,0.0,64240,26847,502,MQTT_Publish
1,1,51143,1883,tcp,mqtt,31.883584,9,5,3,3,...,0.0,29855280.0,29855280.0,29855280.0,29855280.0,0.0,64240,26847,502,MQTT_Publish
2,2,44761,1883,tcp,mqtt,32.124053,9,5,3,3,...,0.0,29842150.0,29842150.0,29842150.0,29842150.0,0.0,64240,26847,502,MQTT_Publish
3,3,60893,1883,tcp,mqtt,31.961063,9,5,3,3,...,0.0,29913770.0,29913770.0,29913770.0,29913770.0,0.0,64240,26847,502,MQTT_Publish
4,4,51087,1883,tcp,mqtt,31.902362,9,5,3,3,...,0.0,29814700.0,29814700.0,29814700.0,29814700.0,0.0,64240,26847,502,MQTT_Publish


In [193]:
df.tail()

Unnamed: 0.1,Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
123112,2005,59247,63331,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123113,2006,59247,64623,tcp,-,7e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123114,2007,59247,64680,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123115,2008,59247,65000,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN
123116,2009,59247,65129,tcp,-,6e-06,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1024,0,1024,NMAP_XMAS_TREE_SCAN


In [194]:
df.shape

(123117, 85)

In [195]:
df.columns

Index(['Unnamed: 0', 'id.orig_p', 'id.resp_p', 'proto', 'service',
       'flow_duration', 'fwd_pkts_tot', 'bwd_pkts_tot', 'fwd_data_pkts_tot',
       'bwd_data_pkts_tot', 'fwd_pkts_per_sec', 'bwd_pkts_per_sec',
       'flow_pkts_per_sec', 'down_up_ratio', 'fwd_header_size_tot',
       'fwd_header_size_min', 'fwd_header_size_max', 'bwd_header_size_tot',
       'bwd_header_size_min', 'bwd_header_size_max', 'flow_FIN_flag_count',
       'flow_SYN_flag_count', 'flow_RST_flag_count', 'fwd_PSH_flag_count',
       'bwd_PSH_flag_count', 'flow_ACK_flag_count', 'fwd_URG_flag_count',
       'bwd_URG_flag_count', 'flow_CWR_flag_count', 'flow_ECE_flag_count',
       'fwd_pkts_payload.min', 'fwd_pkts_payload.max', 'fwd_pkts_payload.tot',
       'fwd_pkts_payload.avg', 'fwd_pkts_payload.std', 'bwd_pkts_payload.min',
       'bwd_pkts_payload.max', 'bwd_pkts_payload.tot', 'bwd_pkts_payload.avg',
       'bwd_pkts_payload.std', 'flow_pkts_payload.min',
       'flow_pkts_payload.max', 'flow_pkts_payload.

In [196]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123117 entries, 0 to 123116
Data columns (total 85 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Unnamed: 0                123117 non-null  int64  
 1   id.orig_p                 123117 non-null  int64  
 2   id.resp_p                 123117 non-null  int64  
 3   proto                     123117 non-null  object 
 4   service                   123117 non-null  object 
 5   flow_duration             123117 non-null  float64
 6   fwd_pkts_tot              123117 non-null  int64  
 7   bwd_pkts_tot              123117 non-null  int64  
 8   fwd_data_pkts_tot         123117 non-null  int64  
 9   bwd_data_pkts_tot         123117 non-null  int64  
 10  fwd_pkts_per_sec          123117 non-null  float64
 11  bwd_pkts_per_sec          123117 non-null  float64
 12  flow_pkts_per_sec         123117 non-null  float64
 13  down_up_ratio             123117 non-null  f

In [197]:
df.describe()

Unnamed: 0.1,Unnamed: 0,id.orig_p,id.resp_p,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,fwd_pkts_per_sec,bwd_pkts_per_sec,...,active.avg,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size
count,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,...,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0,123117.0
mean,37035.089248,34639.258738,1014.305092,3.809566,2.268826,1.909509,1.471218,0.82026,351806.3,351762.0,...,148135.4,23535.99,1616655.0,1701956.0,3517644.0,1664985.0,45501.83,6118.905123,2739.776018,751.647514
std,30459.106367,19070.620354,5256.371994,130.005408,22.336565,33.018311,19.635196,32.293948,370764.5,370801.5,...,1613007.0,1477935.0,8809396.0,9252337.0,122950800.0,9007064.0,1091361.0,18716.313861,10018.848534,6310.183843
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6059.0,17702.0,21.0,1e-06,1.0,1.0,1.0,0.0,74.54354,72.88927,...,0.953674,0.0,0.0,0.0,0.0,0.0,0.0,64.0,0.0,64.0
50%,33100.0,37221.0,21.0,4e-06,1.0,1.0,1.0,0.0,246723.8,246723.8,...,4.053116,0.0,0.0,0.0,0.0,0.0,0.0,64.0,0.0,64.0
75%,63879.0,50971.0,21.0,5e-06,1.0,1.0,1.0,0.0,524288.0,524288.0,...,5.00679,0.0,0.0,0.0,0.0,0.0,0.0,64.0,0.0,64.0
max,94658.0,65535.0,65389.0,21728.335578,4345.0,10112.0,4345.0,10105.0,1048576.0,1048576.0,...,437493100.0,477486200.0,300000000.0,300000000.0,20967770000.0,300000000.0,120802900.0,65535.0,65535.0,65535.0


In [198]:
df.drop(['Unnamed: 0'], axis = 1 , inplace = True)

In [199]:
len(df['Attack_type'].unique())

12

In [200]:
len(df['proto'].unique())

3

In [201]:
len(df['service'].unique())

10

In [202]:
df['Attack_type'].value_counts()



Attack_type
DOS_SYN_Hping                 94659
Thing_Speak                    8108
ARP_poisioning                 7750
MQTT_Publish                   4146
NMAP_UDP_SCAN                  2590
NMAP_XMAS_TREE_SCAN            2010
NMAP_OS_DETECTION              2000
NMAP_TCP_scan                  1002
DDOS_Slowloris                  534
Wipro_bulb                      253
Metasploit_Brute_Force_SSH       37
NMAP_FIN_SCAN                    28
Name: count, dtype: int64

In [203]:
df['proto'].value_counts()

proto
tcp     110427
udp      12633
icmp        57
Name: count, dtype: int64

In [204]:
df['service'].value_counts()

service
-         102861
dns         9753
mqtt        4132
http        3464
ssl         2663
ntp          121
dhcp          50
irc           43
ssh           28
radius         2
Name: count, dtype: int64

In [205]:
df.isna().sum().sum()

0

In [206]:
df_duplicates = df[df.duplicated()]
df_duplicates

Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,fwd_pkts_per_sec,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
512,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
513,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
514,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
515,36685,1883,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,502,0,502,MQTT_Publish
4324,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,Thing_Speak
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119217,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,NMAP_UDP_SCAN
119267,59342,80,tcp,-,0.0,1,0,0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64240,0,64240,NMAP_UDP_SCAN
119706,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,NMAP_UDP_SCAN
119833,5353,5353,udp,dns,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,NMAP_UDP_SCAN


In [207]:
df.drop_duplicates(inplace = True)

In [208]:
df.duplicated().sum()

0

## Label Encoding Categorical Features

In [209]:
#there are total 3 categorical features including the target variable namely proto, service and attack_type(target)

In [210]:
le = LabelEncoder()
df.Attack_type = le.fit_transform(df.Attack_type)
df.proto = le.fit_transform(df.proto)
df.service = le.fit_transform(df.service)

## Normalization and Scaling

In [211]:
X = df.drop(['Attack_type'], axis = 1)
y = df['Attack_type']

In [212]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [213]:
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## ML Model Development and Evaluation without Feature Engineering


In [214]:
log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)

In [215]:
y_pred = log_model.predict(X_test_scaled)

In [216]:
accuracy = accuracy_score(y_test,y_pred)
print('Accuracy =', accuracy)

Accuracy = 0.5037736382395341


In [233]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.88      0.16      0.26      2285
           1       0.02      0.43      0.05       158
           2       1.00      0.49      0.66     27077
           3       0.98      0.94      0.96      1263
           4       0.00      0.00      0.00        12
           5       0.00      0.00      0.00         8
           6       0.00      0.00      0.00       577
           7       0.00      0.00      0.00       309
           8       0.35      0.91      0.50       760
           9       0.00      0.00      0.00       579
          10       0.14      0.98      0.25      2287
          11       0.13      0.05      0.07        62

    accuracy                           0.50     35377
   macro avg       0.29      0.33      0.23     35377
weighted avg       0.87      0.50      0.58     35377



Results:

  Accuracy: ~50.37%
  
  Model struggles with imbalanced classes.

### KNN

In [217]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_scaled,y_train)
y_pred_knn = knn_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred_knn)
print('Accuracy =', accuracy)

Accuracy = 0.9754360177516466


In [234]:
print(classification_report(y_test,y_pred_knn))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94      2285
           1       0.88      0.92      0.90       158
           2       1.00      1.00      1.00     27077
           3       0.99      0.99      0.99      1263
           4       0.64      0.75      0.69        12
           5       1.00      0.75      0.86         8
           6       0.52      0.52      0.52       577
           7       1.00      1.00      1.00       309
           8       0.98      0.96      0.97       760
           9       0.53      0.54      0.53       579
          10       0.94      0.95      0.95      2287
          11       0.75      0.44      0.55        62

    accuracy                           0.98     35377
   macro avg       0.85      0.81      0.83     35377
weighted avg       0.98      0.98      0.98     35377



Results:

  Accuracy: ~97.54%
  
  KNN performs significantly better, handling imbalances well.

## With Feature Engineering

In [218]:
df['Attack_type'].value_counts()

Attack_type
2     90089
10     7654
0      7625
3      4142
8      2584
9      2010
6      2000
7      1002
1       533
11      219
4        36
5        28
Name: count, dtype: int64

We can see a clear imbalance in class distribution within the data. Lets handle that.

### SMOTE

In [219]:
smote = SMOTE(sampling_strategy='auto', k_neighbors=5, random_state=42)
X_smote, y_smote = smote.fit_resample(X,y)
print('Original dataset shape', len(df))
print('Resampled dataset shape', len(y_smote))

Original dataset shape 117922
Resampled dataset shape 1081068


In [220]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X_smote, y_smote, test_size = 0.3, random_state = 42)

In [221]:
X1_train_scaled = scaler.fit_transform(X1_train)
X1_test_scaled = scaler.transform(X1_test)

In [222]:
log_model = LogisticRegression()
log_model.fit(X1_train_scaled, y1_train)

In [223]:
y1_pred = log_model.predict(X1_test_scaled)

In [224]:
accuracy = accuracy_score(y1_test,y1_pred)
print('Accuracy =', accuracy)

Accuracy = 0.34728247631204884


In [235]:
print(classification_report(y1_test,y1_pred))

              precision    recall  f1-score   support

           0       0.43      0.37      0.40     27249
           1       0.75      0.45      0.56     27009
           2       0.26      0.88      0.40     26713
           3       0.37      0.95      0.53     27089
           4       0.04      0.03      0.04     27174
           5       0.00      0.00      0.00     26938
           6       0.00      0.00      0.00     27135
           7       0.43      1.00      0.60     27216
           8       0.11      0.02      0.03     26999
           9       0.00      0.00      0.00     26918
          10       0.72      0.07      0.14     27107
          11       0.35      0.40      0.37     26774

    accuracy                           0.35    324321
   macro avg       0.29      0.35      0.26    324321
weighted avg       0.29      0.35      0.26    324321



Results:

  Accuracy: ~34.73%
  
  SMOTE improves recall for minority classes but overall accuracy decreases.

### TOMEK LINKS

In [225]:
tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl = tl.fit_resample(X,y)
print('Original dataset shape', len(df))
print('Resampled dataset shape', len(y_tl))

Original dataset shape 117922
Resampled dataset shape 117921


In [239]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X_tl, y_tl, test_size = 0.3, random_state = 42)

In [240]:
X2_train_scaled = scaler.fit_transform(X2_train)
X2_test_scaled = scaler.transform(X2_test)

In [241]:
log_model = LogisticRegression()
log_model.fit(X2_train_scaled, y2_train)

In [242]:
y2_pred = log_model.predict(X2_test_scaled)

In [243]:
accuracy = accuracy_score(y2_test, y2_pred)
print('Accuracy =', accuracy)

Accuracy = 0.5763631738134947


In [244]:
print(classification_report(y2_test,y2_pred))

              precision    recall  f1-score   support

           0       0.83      0.16      0.26      2285
           1       0.92      0.22      0.36       158
           2       1.00      0.59      0.74     27060
           3       0.97      0.94      0.96      1263
           4       0.00      0.00      0.00        10
           5       0.00      0.00      0.00         5
           6       0.00      0.00      0.00       574
           7       0.00      0.00      0.00       330
           8       0.35      0.91      0.51       774
           9       0.00      0.00      0.00       569
          10       0.14      0.98      0.25      2287
          11       0.22      0.08      0.12        62

    accuracy                           0.58     35377
   macro avg       0.37      0.32      0.27     35377
weighted avg       0.87      0.58      0.65     35377



Results:

  Accuracy: ~57.64%

  Tomek Links maintains class balance without a significant drop in accuracy.

### with Validation

In [231]:
log_model = LogisticRegression( penalty='l2')

k_fold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(log_model, X2_train_scaled, y2_train, cv=k_fold, scoring='accuracy')

avg_score = np.mean(cv_scores)

print("Cross-validation scores on training set:", cv_scores)
print("Average Score on training set:", avg_score)

log_model.fit(X2_train_scaled, y2_train)
test_score = log_model.score(X2_test_scaled, y2_test)
print("Test set accuracy:", test_score)

Cross-validation scores on training set: [0.57405052 0.58022897 0.57913865 0.20612999 0.57729586]
Average Score on training set: 0.5033687963638809
Test set accuracy: 0.5763631738134947


Results:

  Cross-validation scores show variation, with an average around 50.33%.
  
  Test set accuracy consistent at ~57.64%.

### KNN

In [237]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X2_train_scaled,y2_train)
y2_pred_knn = knn_model.predict(X2_test_scaled)
accuracy = accuracy_score(y2_test, y2_pred_knn)
print('Accuracy =', accuracy)

Accuracy = 0.9753794838454363


In [238]:
print(classification_report(y2_test,y2_pred_knn))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94      2285
           1       0.85      0.92      0.88       158
           2       1.00      1.00      1.00     27060
           3       0.99      0.99      0.99      1263
           4       0.62      0.80      0.70        10
           5       1.00      0.80      0.89         5
           6       0.53      0.51      0.52       574
           7       0.99      1.00      0.99       330
           8       0.98      0.96      0.97       774
           9       0.52      0.54      0.53       569
          10       0.94      0.95      0.95      2287
          11       0.75      0.44      0.55        62

    accuracy                           0.98     35377
   macro avg       0.84      0.82      0.83     35377
weighted avg       0.98      0.98      0.98     35377



Results:

  Accuracy: ~97.54%

  KNN maintains high performance with Tomek Links.

**Conclusion**

  **KNN** is **highly effective** for this dataset, consistently achieving **high accuracy**.

  **Logistic Regression** struggles with **imbalanced data**, but methods like **SMOTE** and **Tomek Links** help **improve performance**.

  **Feature engineering **and **data balancing techniques** are **crucial** for building robust IDS models in IoT networks.

**Next steps** could involve exploring more sophisticated models like **Random Forests**, **Gradient Boosting**, or **Deep Learning techniques** to further enhance detection accuracy.