# Introduction

Modern computer networks are continuously exposed to a wide range of cyber threats, making intrusion detection a critical component of network security. To develop and evaluate robust Intrusion Detection Systems (IDS), high-quality and up-to-date benchmark datasets are essential. The **UNSW-NB15** dataset, created by the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) in 2015, is one of the most comprehensive datasets designed for research in network traffic analysis and intrusion detection.

The dataset contains modern **synthetic & hybrid** network traffic generated using the IXIA PerfectStorm tool, including both benign activities and **nine categories of contemporary attacks**, such as Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic attacks, Reconnaissance, Shellcode, and Worms. Each network flow is described through **49 carefully engineered features**, covering packet-level characteristics, protocol behavior, temporal patterns, and content-based attributes. These diverse attributes enable detailed exploration of network behavior and serve as strong predictors for machine learning models.

The objective of this project is to perform **Exploratory Data Analysis (EDA)** and develop a **machine learning–based intrusion detection model** using the UNSW-NB15 dataset. The analysis will focus on understanding traffic patterns, identifying anomalies, evaluating feature importance, and building classification algorithms capable of distinguishing between normal and malicious network activity. This project also aims to gain deeper insights into the dataset’s structure, the nature of different attack categories, and the challenges associated with real-world intrusion detection.

---

## Dataset Overview

The UNSW-NB15 dataset is a modern intrusion detection benchmark created by the Australian Centre for Cyber Security (ACCS). It contains a mix of normal network traffic and nine contemporary attack types. Each network flow is represented by 49 features covering protocol behavior, packet-level statistics, timing information, and connection patterns.

The dataset is widely used for evaluating ML-based intrusion detection systems because it includes realistic traffic, updated attack families, and detailed flow-level attributes.
Attack categories include: Fuzzers, Analysis, Backdoor, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.

---

## Feature Description

The 49 features in UNSW-NB15 belong to several logical groups:

### Flow Attributes
Basic connection information such as source/destination IP, ports, protocol, service, and connection state.

### Content & Packet-Level Features
Statistics describing packets and bytes exchanged, flow rates, TTL values, and TCP handshake timings (e.g., ```sbytes```, ```dbytes```, ```spkts```, ```ackdat```).

### Time-Based Features

Connection start/end times and flow duration (```stime```, ```ltime```, ```dur```).

### Host-Based Behavioral Features
Count-based indicators capturing how a flow relates to previous flows (e.g., ```ct_srv_src```, ```ct_dst_ltm```, ```ct_src_dport_ltm```). These features help detect scanning, brute-force attempts, and repeated malicious behavior.

### Labels
```label```: 0 = Normal, 1 = Attack

```attack_cat```: Type of attack

# *Library Imports and Data Loading*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 80%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

## **Import Libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# cleaner output
import warnings
warnings.filterwarnings("ignore")


# Visuals
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
import plotly.express as px

# Models and Utils
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


from xgboost import XGBClassifier

from sklearn import metrics

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## *Load and Inspect Data*

In [None]:
# Load data

# Feature names
feature_names = pd.read_csv("/kaggle/input/unsw-nb15/NUSW-NB15_features.csv" , encoding = 'cp1252')

# Total dataset - 2.56M rows (divided as parts)
first_part = pd.read_csv("/kaggle/input/unsw-nb15/UNSW-NB15_1.csv")
sec_part = pd.read_csv("/kaggle/input/unsw-nb15/UNSW-NB15_2.csv")
third_part = pd.read_csv("/kaggle/input/unsw-nb15/UNSW-NB15_3.csv")
fourth_part = pd.read_csv("/kaggle/input/unsw-nb15/UNSW-NB15_4.csv")

# Attack Events
events_list = pd.read_csv("/kaggle/input/unsw-nb15/UNSW-NB15_LIST_EVENTS.csv")

In [None]:
# Feature names and description 
feature_names

In [None]:
feature_names["Name"] = feature_names["Name"].replace("ct_src_ ltm" , "ct_src_ltm")

In [None]:
# Shape of each part
print("Shape : " , first_part.shape)
print("Shape : " , sec_part.shape)
print("Shape : " , third_part.shape)
print("Shape : " , fourth_part.shape)

In [None]:
# preview of data (one part)
first_part.head()

In [None]:
# preview of data (one part)
sec_part.head()

In [None]:
# preview of data (one part)
third_part.head()

In [None]:
# preview of data (one part)
fourth_part.head()

In [None]:
# LIST_EVENTS (list of all attack events)
events_list.head()

In [None]:
# Setting column names for the dataset
first_part.columns = feature_names["Name"]
sec_part.columns = feature_names["Name"]
third_part.columns = feature_names["Name"]
fourth_part.columns = feature_names["Name"]

In [None]:
# Combining data
df = pd.concat([first_part , sec_part , third_part , fourth_part] , ignore_index = True)

df

# *Data Quality Checks*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

In [None]:
# Shape of data (after combining all parts)
df.shape

In [None]:
# characteristics of the dataset
df.info(memory_usage = "deep")

In [None]:
# Summary Metrics
df.describe().T

In [None]:
# Categorical and numerical features in the data.
print("No.of categorical columns :" , len(df.select_dtypes(["object"]).columns))
print("No.of numerical features in the data :", len(df.select_dtypes(["int" , "float"]).columns))

## *Missing Value analysis*
<div style="
    margin: 20px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #e5e7eb;
    border-bottom: 3px solid #22d3ee;
">
</div>

In [None]:
# Null values on columns 
print("Columns with missing values and count: ")
df.isna().sum()[df.isna().sum() > 0]

### 1. **Missing values on ct_flw_http_mthd**

In [None]:
df["ct_flw_http_mthd"].value_counts(dropna = False)

In [None]:
# http requests , with null verifyig the destination port and protocol 
df.loc[df["ct_flw_http_mthd"].isna() , ["dsport" , "proto"]]

In [None]:
# http request and the http protocol
df.loc[(df["ct_flw_http_mthd"].isna()) & (df["proto"] == "http") , ["dsport" , "proto"]]["proto"].value_counts()

In [None]:
df.loc[df["ct_flw_http_mthd"].isna() , ["Label"]].value_counts()

In [None]:
# Filling the null of ct_flw_http_mthd
df["ct_flw_http_mthd"].fillna(0 , inplace = True)
# Converting the format from float to int
df["ct_flw_http_mthd"] = df["ct_flw_http_mthd"].astype(int)

### 2. **Missing values on is_ftp_login** 

In [None]:
df["is_ftp_login"].value_counts(dropna = False)

In [None]:
df[df["is_ftp_login"].isna()]["dsport"].value_counts().head(20)

In [None]:
df[df["dsport"] == 21]["is_ftp_login"].value_counts(dropna=False)

In [None]:
df.loc[df["is_ftp_login"].isna() , ["Label"]].value_counts()

In [None]:
# Filling the null values 
df["is_ftp_login"] = df["is_ftp_login"].fillna(0)
df["is_ftp_login"] = (df["is_ftp_login"] > 0).astype(int)

### 3. **Missing values on attack_cat** 

In [None]:
df["attack_cat"].value_counts(dropna = False)

In [None]:
print( "Count :" , df[(df["attack_cat"].isna()) & (df["Label"] == 0)].shape[0]) 

In [None]:
# Filling null of attack_cat with "Nothing"
df["attack_cat"].fillna("Normal" , inplace =True)
df["attack_cat"] = df["attack_cat"].str.lower()

In [None]:
df.head()

In [None]:
# Verfiying if the missing are filled 
df.isna().sum()

### Inference about missing values:

##### Amount of missing values according to column:
- ct_flw_http_mthd is having missing values 1146790.
- is_ftp_login is having missing values 1227022.
- attack_cat is having missing values 1959771.

---

**ct_flw_http_mthd**
    
- Verified with ```proto``` to check if the protocol is http or others.
- Filled missing values as **0**.
    
---

**is_ftp_login**

- Verified with ```dsport``` to check the port number **21** (```service - ftp```).
- Filled missing values as **0**.

---

**attack_cat**

- Verified with ```Label``` that the missing values in the ```attack_cat``` are only for ```Label = 0```.
- Filled missing value as **nothing**.

## *Categorical & Discrete Feature Analysis: Unique Value Exploration*
<div style="
    margin: 20px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #e5e7eb;
    border-bottom: 3px solid #22d3ee;
">
</div>




In [None]:
print("No.of categorical or discrete Features: " , len(df.select_dtypes(["object" , "int"]).columns))

In [None]:
df.select_dtypes(["object" , "int"]).columns

In [None]:
for col in df.select_dtypes(["object" , "int"]).columns:
    print(f"\nUnique values of {col}.")
    print(df[col].unique())
    print(f"\n Total no.of unique values in {col} = {df[col].nunique()}.")
    print("\n","-"*45)

### Inference about unique value analysis: 
#### **scrip and dstip:**

The IP address features (```srcip```, ```dstip```) exhibit extremely high cardinality and function as identifier-like variables rather than behavioral features; therefore, they are excluded from further analysis and modeling to prevent overfitting and ensure better generalization.

---

#### **sport and dport:**

- The `sport` and `dsport` features exhibit very high cardinality (>100k unique values), making them unsuitable for direct modeling. To avoid overfitting and incorrect numeric assumptions, ports are transformed into functional categories rather than used as raw values.

- ports are **transformed into semantically meaningful categories** (e.g., well-known, registered, dynamic) to preserve behavioral patterns while reducing dimensionality and noise.

---

#### **proto:**

The `proto` column represents the **network protocol** used for each connection (e.g., TCP, UDP, ICMP). Unlike IP addresses or port numbers, the protocol feature has **low cardinality** and **well-defined categorical values**, making it suitable for encoding.

---

#### **state:**

- The `state` feature captures protocol-level connection behavior and has low cardinality. As the values are nominal rather than ordinal, no hierarchical mapping is applied. The feature is retained and encoded as a categorical variable due to its relevance in distinguishing normal and malicious traffic.

- Either can be done label encoding or one-hot encoding.

---

#### **Service:**

- The `service` column has low cardinality and captures application-level behavior. It is retained as a categorical feature and encoded appropriately, with `'-'` treated as a valid category.

- `'-'` denotes flows where no application service could be identified and is retained as a valid category.

- Can be done label encoding or one-hot encoding

---

##### **stcpb & dtcpb:**

`stcpb` and `dtcpb` exhibit extremely high cardinality and represent randomized TCP sequence numbers. As identifier-like features with no behavioral significance, they are dropped.As they do not encode meaningful network behavior and pose a high risk of overfitting, these features are excluded from further analysis and modeling.


---

#### **Stime and Ltime:**

Since flow duration (`dur`) is already available, raw timestamps (`stime`, `ltime`) are excluded as they provide limited behavioral insight and may introduce temporal bias in this historical dataset.

---

#### **ct_ftp_cmd**:

- Integers: `0, 1, 2, 3, 4, 5, 6, 8`
- Strings: `'0', '1', '2', '4'`
- Blank string: `' '`

`ct_ftp_cmd` is a discrete count feature indicating FTP command activity. Formatting inconsistencies are present, and the feature is to be retained as a numeric variable.

---

#### **attack_cat:**

```c

['nothing' 'generic' 'dos' ' fuzzers ' 'exploits' 'reconnaissance'
 ' reconnaissance ' 'backdoor' ' fuzzers' 'backdoors' ' shellcode '
 'analysis' 'shellcode' 'worms']
 
 ```

#### The problems are:
1. **Leading/trailing spaces**
    - `' fuzzers '`, `' reconnaissance '`, `' shellcode '`
2. **Plural vs singular**
    - `'backdoor'` vs `'backdoors'`
3. **Duplicate semantic labels**
    - `'fuzzers'` appears multiple times with spacing issues
    - `'shellcode'` appears with and without spaces

So although it shows **14 unique values**, the **actual number of classes is 10 meaningfull classes**.

**Apparent extra categories in `attack_cat` are caused by formatting inconsistencies. After cleaning, the column contains the standard UNSW-NB15 attack classes and is used as the target label.**

---

#### **Other Features:**

- The numerical features are grouped into discrete count-based variables (`sloss`, `dloss`, `spkts`, `dpkts`, `trans_depth`, and all `ct_*` features), 
- Continuous magnitude-based variables (`sbytes`, `dbytes`, `smeansz`, `dmeansz`, `res_bdy_len`), and protocol-specific numeric variables (`sttl`, `dttl`, `swin`, `dwin`), each retained and processed according to its semantic role in capturing network behavior.
- Binomial features : `is_sm_ips_ports` , `is_ftp_login` , `Label`.

In [None]:
# Columns to be dropped
df.drop(columns = ["srcip" , "dstip" , "sport" , "dsport", "stcpb" , "dtcpb" , "Stime" , "Ltime"] , inplace = True)

df.head()

**Based on the analysis, the following columns were removed:  `srcip` , `dstip` , `stcpb` , `dtcpb` , `Stime` & `Ltime`**

- `sport` and `dsport` were dropped due to high cardinality. Additionally, since attack traffic spans a wide range of ports, behavioral and volumetric features are more informative than raw port identifiers for intrusion detection.

In [None]:
# Verifying the case of "-" with label = 0

mask = (df["service"] == "-") & (df["Label"] == 0)
print(df[mask][["service" , "proto" , "Label"]]["proto"].value_counts())

In [None]:
# Verifying the case of "-" with label = 1

mask = (df["service"] == "-") & (df["Label"] == 1)
print(df[mask][["service" , "proto" , "Label"]]["proto"].value_counts())

In [None]:
df["service"] = df["service"].replace("-" , "other")

In [None]:
df["service"].value_counts()

**After verification `service` with  `-`  is set to `other`.**

In [None]:
df["ct_ftp_cmd"] = df["ct_ftp_cmd"].replace([" " , "0" , "1" , "2" , "4"] , [0 , 0 , 1 , 2, 4])

In [None]:
df["ct_ftp_cmd"].value_counts()

**Fixed  anomalies in the column.**

In [None]:
df["attack_cat"].value_counts()

```c

['nothing' 'generic' 'dos' ' fuzzers ' 'exploits' 'reconnaissance'
 ' reconnaissance ' 'backdoor' ' fuzzers' 'backdoors' ' shellcode '
 'analysis' 'shellcode' 'worms']
 
 ```

In [None]:
df["attack_cat"] = df["attack_cat"].replace([' fuzzers ' , ' fuzzers' ,' reconnaissance ', 'backdoors' , ' shellcode ' ],
                                            ['fuzzers' , 'fuzzers' , 'reconnaissance' , 'backdoor' , 'shellcode' ])


In [None]:
print("No.of uniques : ",df["attack_cat"].nunique())
print("Uniques: " , df["attack_cat"].unique())

**Fixed anomalies in the values.**

## Numerical - Continues Features Analysis
<div style="
    margin: 20px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #e5e7eb;
    border-bottom: 3px solid #22d3ee;
">
</div>

In [None]:
df.select_dtypes(["float64"]).describe().T

**Inconsistancy found: is_ftp_login is having wrong dtype.**

In [None]:
# Value counts
df["is_ftp_login"].value_counts()

In [None]:
df["is_ftp_login"] = df["is_ftp_login"].astype(int)

df["is_ftp_login"].value_counts()

**is_ftp_login comes under categorical.**

# ***Univariate Analaysis***
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

## 1. **Visualization based on the Numerical / Magnitude Features**

In [None]:
# Columns that is of int and float , ie: Numerical : discrete & continues
selected_columns = np.concatenate((["sbytes" , "dbytes" , "smeansz" , "dmeansz" , "res_bdy_len" , "sloss" , "dloss"] ,  
                                   df.select_dtypes("float").columns))
selected_columns

In [None]:
plt.figure(figsize = (20 , 16))
for  i , col in enumerate(selected_columns , 1):
    
    plt.subplot(6 , 3 , i)
    sns.boxplot(data = df , x = col)
    
plt.tight_layout()
plt.show()

### Inferences from visuals:
- From the visual it is able to see that there is extreme level of skewness / heavly tailed and outliers present.
- They are behaving exactly as network traffic should be.
- Transformation is required.

## 2. **Categorical, Discrete and Count-Based Feature Distribution Analysis**

### 2.1 Network Identfiers 
- **Based-on** :  proto, state, service
- **Excluded (Due to high cardinality & these are not behavioral measurements)** : sport , dsport , swin , dwin , sttl, dttl
- For `proto` the cardinality of unqiue values is high , only top 10 protocol are taken.

In [None]:
# State column   
count_df = df['state'].value_counts().reset_index()
count_df.columns = ['state', 'count']
fig = px.bar(count_df, x="state", y='count', text='count' , title = "Countplot of States",  color = "state")
fig.show()

In [None]:
# service

count_df = df['service'].value_counts().reset_index()
count_df.columns = ['service', 'count']
fig = px.bar(count_df, x="service", y='count', text='count' , title = "Countplot of Service",  color = "service")
fig.show()

In [None]:
# Top 10 protocol in the data

count_df = df['proto'].value_counts().reset_index()[:10]
count_df.columns = ['protocol', 'count']

fig = px.bar(count_df, x='protocol', y='count', text='count' , title = "Countplot of top-10 protocols",  color = "protocol")

fig.show()


### 2.2 Count-based numerical features

```c
['Spkts', 'Dpkts', 'trans_depth',
 'ct_state_ttl', 'ct_flw_http_mthd', 'ct_ftp_cmd',
 'ct_srv_src', 'ct_srv_dst',
 'ct_dst_ltm', 'ct_src_ltm',
 'ct_src_dport_ltm', 'ct_dst_sport_ltm',
 'ct_dst_src_ltm']
```

In [None]:
# numerical feature columns 
cb_numerical_features = ['Spkts', 'Dpkts', 'trans_depth','ct_state_ttl', 'ct_flw_http_mthd', 'ct_ftp_cmd',
                         'ct_srv_src', 'ct_srv_dst','ct_dst_ltm', 'ct_src_ltm','ct_src_dport_ltm', 'ct_dst_sport_ltm','ct_dst_src_ltm']

plt.figure(figsize = (20 , 16))
for  i , col in enumerate(cb_numerical_features, 1):
    
    plt.subplot(5 , 3 , i)
    sns.kdeplot(data = df , x = col , fill = True)
    
plt.tight_layout()
plt.show()


### Inferences from visual:

- The raw(untransformed) data cannot be able give much idea about the data.
- The data is required to transform.

---

#### About the both the boxplots and distribution plots:

**Both discrete and continuous network traffic features exhibit strong right skewness due to the heavy-tailed nature of network flows, where the majority of connections are short and low-volume, while a small fraction corresponds to long-lasting or high-volume flows often associated with anomalous or attack behavior.**

### 2.3 Binary Features

```c
 'is_ftp_login',
 'is_sm_ips_ports'

```

Excluding the target variable `Label`.

In [None]:
# is_ftp_login
count_df = df['is_ftp_login'].value_counts().reset_index()
count_df.columns = ['is_ftp_login', 'count']
fig = px.bar(count_df, x="is_ftp_login", y='count', text='count' , title = "Countplot of is_ftp_login",  color = "is_ftp_login" , width = 800 , height = 500)
fig.update_coloraxes(showscale=False)
fig.show()


In [None]:
# is_sm_ips_ports
count_df = df['is_sm_ips_ports'].value_counts().reset_index()
count_df.columns = ['is_sm_ips_ports', 'count']
fig = px.bar(count_df, x="is_sm_ips_ports", y='count', text='count' , title = "Countplot of is_sm_ips_ports",  color = "is_sm_ips_ports" , width = 800 , height = 500)
fig.update_coloraxes(showscale=False)
fig.show()

### Inferences 

#### About `is_sm_ips_ports`:
- It is a binary flag that indicates whether the source and destination IPs and ports are the same.
- Less amount of case with same ips and ports are present.

### Target Distribution : `attack_cat` , `Label`

In [None]:
# attack category

count_df = df['attack_cat'].value_counts().reset_index()
count_df.columns = ['attack_cat', 'count']
fig = px.bar(count_df, x="attack_cat", y='count', text='count' , title = "Countplot of attack_cat",  color = "attack_cat")
fig.show()

### Inference about attack category: 
- `Nothing` , `Generic` , `Exploits` , `Fuzzers` , `Dos` , `reconnaissance`. are having most frequancy as compared to other attack categories.

In [None]:
# Label Distribution pie chart
label_map = {0: "Normal", 1: "Attack"}
# mapped for visualization purpose
df['Label'] = df['Label'].map(label_map)


fig = px.pie(
    df,
    names='Label',
    title='Distribution of Network Traffic (Normal vs Attack)',
    hole=0.4,
    width = 900 ,
    height = 600
)

fig.update_traces(
    textinfo='percent+label',
    pull=[0, 0.05]
)

fig.update_layout(
    legend_title_text='Traffic Type'
)

fig.show()

### Inference from pie chart: 
 - Upper hand is for non-threat

# *Feature Preprocessing and Transformation*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

## **Transforming Features**

In [None]:
transformed_df = df.copy()

In [None]:
# Columns to be transformed
log_cols = [
    'sbytes','dbytes','Spkts','Dpkts','smeansz','dmeansz','res_bdy_len',
    'sloss','dloss','dur',
    'Sload','Dload','Sjit','Djit',
    'Sintpkt','Dintpkt',
    'tcprtt','synack','ackdat',
    'ct_srv_src','ct_srv_dst','ct_dst_ltm','ct_src_ltm',
    'ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm'
]

for col in log_cols:
    transformed_df[col] = np.log1p(transformed_df[col])

In [None]:
transformed_df[log_cols].head()

### Inference:

#### Why natural logarithm of 1 + x  was applied ?

- Many numerical features in the dataset exhibited heavy right skew and extreme outliers, typical of network traffic behavior. To improve interpretability and enable meaningful bivariate analysis, a log1p transformation was applied to count-, size-, and rate-based features.
- Discrete indicators, protocol identifiers, and binary features were left untransformed.

## **Traffic Composition & Distribution : Protocol**

In [None]:
# Taking the top 5 protocol 
top_protocols = transformed_df["proto"].value_counts().reset_index()[:5]

top_protocols

In [None]:
transformed_df["proto"].value_counts()

In [None]:
selected_protocols = transformed_df['proto'].where( transformed_df['proto'].isin(top_protocols["proto"]) , 'other')

transformed_df["proto"] = selected_protocols

transformed_df["proto"].value_counts()

In [None]:
count_df = transformed_df['proto'].value_counts().reset_index()
count_df.columns = ['protocol', 'count']

fig = px.bar(count_df, x='protocol', y='count', text='count' , title = "Countplot of top-10 protocols",  color = "protocol")

fig.show()


### Inference: 

- The protocol feature was high cardinality unique values. 
- The  protocols like `tcp` , `upd` , `unas` , `arp` , `osfp` are having the top 5 values.
- The `other` would be the rest of all the protocols which were having less frequancy than these 5 protocols.

# *Bivariate Analysis*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

## **A. Traffic volume & intensity**

In [None]:
selected_columns = ["Spkts" , "Dpkts" , "Sload" , "Dload"]

plt.figure(figsize = (15 , 10))
for  i , col in enumerate(selected_columns , 1):
    
    plt.subplot(2, 2 , i)
    plt.title(f"{col} vs Label")
    sns.boxplot(data = transformed_df , x = "Label" , y = col)
    
    plt.xticks(ticks=[0, 1],labels=["Normal", "Attack"])

    
plt.tight_layout(pad = 3)
plt.show()

### Insights:

- Attacks are characterized by high source-side activity (Spkts, Sload) combined with irregular or suppressed destination-side responses (Dpkts, Dload), highlighting strong traffic asymmetry.

- Normal traffic outliers represent rare but legitimate high-usage events, not malicious behavior.

## **B. Behavioral aggregation**

In [None]:
selected_columns = ['ct_srv_src', 'ct_dst_ltm', 'ct_src_ltm', 'ct_dst_src_ltm']

plt.figure(figsize = (15 , 10))
for  i , col in enumerate(selected_columns , 1):
    
    plt.subplot(2, 2 , i)
    plt.title(f"{col} vs Label" , fontsize = 12)
    sns.boxplot(data = transformed_df , x = "Label" , y = col)
    plt.xticks(ticks=[0, 1],labels=["Normal", "Attack"])

    
plt.tight_layout(pad = 3)
plt.show()

### Inference:

#### About ct_srv_src Vs Label:
Attack traffic repeatedly targets the same service from a single source, indicating automated probing or flooding behavior.

---
#### About ct_dst_ltm Vs Label:
Attack flows frequently interact with the same destination in short time windows, reflecting concentrated targeting of specific hosts.

---
#### About ct_src_ltm Vs Label:
Malicious sources generate a large number of connections in short periods, consistent with automated attack tools rather than human behavior.

---
#### About ct_dst_src_ltm Vs Label:
Attack traffic shows persistent communication between the same source and destination, revealing sustained and focused attack attempts.

## **C. Timing behavior**

In [None]:
selected_columns = ["dur" , "Sintpkt" , "Dintpkt"]

plt.figure(figsize = (15 , 6))
for  i , col in enumerate(selected_columns , 1):
    
    plt.subplot(1, 3 , i)
    plt.title(f"{col} vs Label" , fontsize = 12)
    sns.boxplot(data = transformed_df , x = "Label" , y = col)
    plt.xticks(ticks=[0, 1],labels=["Normal", "Attack"])

    
plt.tight_layout(pad = 3)
plt.show()

### Inference:

#### About dur vs Label:

Attack flows are often short-lived, reflecting scans, probes, and failed connection attempts rather than sustained sessions.

---

#### About Sintpkt vs Label:

Attack traffic exhibits highly regular and rapid packet emission, with very small inter-packet intervals, consistent with automated tools.

---

#### About Dintpkt vs Label:

Destination-side timing during attacks is irregular and often compressed, reflecting incomplete handshakes or suppressed responses.


>#### **Cross-feature insight**:
>Attacks are characterized by short-lived flows and tightly spaced packet transmissions, whereas normal traffic shows more variable and human->driven timing patterns.

## **Binary vs Target (Label)**

In [None]:
plt.figure(figsize=(10, 5))

# is_sm_ips_ports
plt.subplot(1, 2, 1)
ax1 = sns.countplot(data=transformed_df, x="is_sm_ips_ports", hue="Label")
plt.title("is_sm_ips_ports vs Label")

handles, labels = ax1.get_legend_handles_labels()
ax1.legend(handles, ["Normal", "Attack"], title="Traffic Type")

# is_ftp_login
plt.subplot(1, 2, 2)
ax2 = sns.countplot(data=transformed_df, x="is_ftp_login", hue="Label")
plt.title("is_ftp_login vs Label")

handles, labels = ax2.get_legend_handles_labels()
ax2.legend(handles, ["Normal", "Attack"], title="Traffic Type")

plt.subplots_adjust(wspace=0.4)
plt.show()


### Inference:

#### About is_sm_ips_ports vs Label: 

While the is_sm_ips_ports (If source and destination IP addresses equal and port numbers equal) flag is infrequent, its activation is disproportionately associated with attack traffic, indicating that abnormal source–destination port combinations are strong indicators of malicious behavior.

---

#### About is_ftp_login vs Label:

Although the absolute count of these flags is higher in normal traffic due to class imbalance, their relative occurrence rate within attack traffic is elevated, indicating that these features act as risk indicators rather than standalone discriminators.


# *Multivariate Analaysis*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

In [None]:
features = ["Spkts", "Sload", "dur", "ct_srv_src", "ct_dst_ltm"]

profile = transformed_df.groupby("attack_cat")[features].median()

plt.figure(figsize=(10, 6))
sns.heatmap(profile,annot=True,fmt=".2f",cmap="viridis",linewidths=0.5)

plt.title("Behavioral Profiles by Attack Category (Log-Transformed Medians)")
plt.xlabel("Features")
plt.ylabel("Attack Category")
plt.tight_layout()
plt.show()

### Inference:
Log-transformed median behavioral profiles reveal that attack categories exhibit distinct traffic load and service access patterns. Volumetric attacks are characterized by high source load and repeated service interactions, while reconnaissance and exploit-based attacks demonstrate lower, stealth-oriented behaviors. Significant overlap with benign traffic highlights the necessity of multi-feature and non-linear classification approaches

In [None]:
# taking out sample for a
sample_df = transformed_df.sample(50000 , random_state = 42)

In [None]:

# Plot
fig = px.scatter(
    sample_df,
    x="smeansz",
    y="Sload",
    color="Label",   
    title="Mean Source Packet Size vs Source Load",
    labels={
        "smeansz": "Mean Source Packet Size",
        "Sload": "Source Load",
        "Label_name": "Traffic Class"
    }
)

fig.update_layout(
    template="plotly_white",
    legend_title_text="Traffic Class"
)

fig.show()


### Inference:

The relationship between mean source packet size and source load reveals clear behavioral differences between normal and attack traffic. Attack flows consistently operate at higher load levels and exhibit more structured packet sizing patterns, indicative of automated transmission mechanisms. In contrast, normal traffic shows greater variability in packet size and operates predominantly at moderate load levels. The presence of distinct attack clusters further suggests multiple attack strategies with differing packet-size and rate characteristics.

In [None]:
fig = px.scatter(
    sample_df,
    x="Sintpkt",
    y="Spkts",
    color="Label",
    title="Source Inter-Packet Time vs Packet Count by Traffic Class",
    labels={
        "Sintpkt": "Source interpacket arrival time",
        "Spkts": "Source to Destination packet count",
        "Label_name": "Traffic Class"
    }
)

fig.update_layout(
    template="plotly_white",
    legend_title_text="Traffic Class"
)

fig.show()


### Inference:

The relationship between source inter-packet time and packet count reveals clear behavioral differences between normal and attack traffic. Attack flows are heavily concentrated at low inter-packet intervals while maintaining elevated packet volumes, indicating automated and sustained transmission patterns. In contrast, normal traffic exhibits greater variability in timing and a rapid decline in packet volume as inter-packet intervals increase, consistent with human-driven communication behavior.

In [None]:
fig = px.scatter(
    sample_df,
    x="sbytes",
    y="dbytes",
    color="Label",
    title="Source vs Destination Byte Volume by Traffic Class",
    labels={
        "sbytes_log": "Source Bytes",
        "dbytes_log": "Destination Bytes",
        "Label_name": "Traffic Class"
    }
)

fig.update_layout(
    template="plotly_white",
    legend_title_text="Traffic Class"
)

fig.show()

### Inference:

Attack flows are characterized by higher packet loss at elevated packet volumes, suggesting aggressive transmission that overwhelms network capacity. Normal traffic maintains low loss levels, reflecting controlled and adaptive communication behavior.

In [None]:
fig = px.scatter(
    sample_df,
    x="sbytes",
    y="dbytes",
    size="Spkts",
    color="attack_cat",
    title="Directional Byte Asymmetry and Volume by Traffic Class",
    labels={
        "x": "Source Bytes",
        "y": "Destination Bytes"
    }
)
fig.update_layout(template="plotly_white")
fig.show()

### Inference:

Directional byte analysis reveals that normal traffic maintains a relatively balanced source–destination byte exchange, consistent with legitimate bidirectional communication. In contrast, multiple attack categories exhibit pronounced asymmetry, with either source- or destination-dominated byte volumes. Reconnaissance and fuzzing attacks show minimal response traffic, while DoS and exploit attacks demonstrate high-volume, asymmetric exchange, reflecting fundamentally different attack strategies.

In [None]:
fig = px.scatter(
    sample_df,
    x="Sintpkt",
    y="Sload",
    size="Spkts",
    color="Label",
    title="Timing, Load, and Volume Characteristics by Traffic Class",
    labels={
        "x": "Inter-packet Time",
        "y": "Source Load",
        "Label_name": "Traffic Class"
    }
)
fig.update_layout(template="plotly_white")
fig.show()

### Inference:
Attack traffic exhibits a strong concentration at low inter-packet intervals combined with elevated source load and higher traffic volume, indicating automated and sustained transmission behavior. Normal traffic, in contrast, demonstrates adaptive timing and decreasing load as inter-packet intervals increase, consistent with congestion-aware and human-driven communication.

# *Categorical Feature Analysis and Preprocessing*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

In [None]:
transformed_df[["state" , "service" , "proto", "attack_cat" , "Label"]]

In [None]:
# Reversing Label column
# Transformed df
transformed_df["Label"] = transformed_df["Label"].map({
    "Normal":0,
    "Attack":1
}).astype(int)

# Original df
df["Label"] = df["Label"].map({
    "Normal":0,
    "Attack":1
}).astype(int)

In [None]:
# Encoding categorical features

# Dictionary to store the label encoders
label_encoders = {}

# categorical features 
categorical_features = ["service", "state", "proto"]

for col in categorical_features:
    le = LabelEncoder()
    transformed_df[col] = le.fit_transform(transformed_df[col])
    label_encoders[col] = le


# attack_cat
attack_le = LabelEncoder()
transformed_df["attack_cat"] = attack_le.fit_transform(transformed_df["attack_cat"])
label_encoders["attack_cat"] = attack_le

In [None]:
# encoded classes  
print("Service: " , label_encoders["service"].classes_)
print("\nState: " , label_encoders["state"].classes_)
print("\nProtocol: " , label_encoders["proto"].classes_)
print("\nAttack Category: " , label_encoders["attack_cat"].classes_)

In [None]:
# Attack classes :  saved classes according to encoded labels.
attack_classes = {}
for i , acat in enumerate(label_encoders["attack_cat"].classes_ , 0):
    attack_classes[i] = acat

attack_classes

### Inference: 
Categorical features are label-encoded to obtain numerical representations suitable for machine learning models, while the attack category is encoded separately as the target variable.

# *Feature Correlation Analaysis*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

In [None]:
def plot_spearman_corr(df, cols, title):
    plt.figure(figsize=(12, 10))
    corr = df[cols].corr(method='spearman')
    sns.heatmap(
        corr,
        cmap='mako',
        center=0,
        annot=True,
        linewidths=0.5
    )
    plt.title(title)
    plt.show()

## BLOCK 1 — Volume & Packet Statistics (Traffic Magnitude)

In [None]:
block_volume = [
    'sbytes', 'dbytes',
    'Spkts', 'Dpkts',
    'Sload', 'Dload',
    'smeansz', 'dmeansz',
    'sloss', 'dloss'
]

plot_spearman_corr(
    transformed_df,
    block_volume + ['Label'],
    "Spearman Correlation – Traffic Volume Features"
)

### Inference — Traffic Volume Features:

- Strong internal redundancy is observed among byte- and packet-level features (sbytes, dbytes, Spkts, Dpkts, sloss, dloss), with correlations consistently above 0.85, indicating they capture closely related aspects of traffic volume.

- Load features behave differently: Sload shows negative association with packet and byte counts, while Dload remains moderately aligned with volume, suggesting asymmetric source–destination behavior.

- Mean packet size features (smeansz, dmeansz) exhibit weaker correlations, implying they contribute complementary information rather than pure volume effects.

- The Label variable is negatively correlated with most volume metrics, indicating that attack traffic tends to exhibit distinct volume patterns compared to normal traffic, rather than simply higher raw counts.

## BLOCK 2 — Time, Latency & Flow Dynamics

In [None]:
block_timing = [
    'dur',
    'Sjit', 'Djit',
    'Sintpkt', 'Dintpkt',
    'tcprtt', 'synack', 'ackdat',
    'trans_depth', 'res_bdy_len'
]

plot_spearman_corr(
    transformed_df,
    block_timing + ['Label'],
    "Spearman Correlation – Timing & Latency Features"
)

### Inference — Timing & Latency Features:

- Strong internal coherence is observed among duration, inter-packet times, and jitter metrics (dur, Sjit, Djit, Sintpkt, Dintpkt), with correlations exceeding 0.85, indicating these features jointly describe temporal flow behavior.

- TCP handshake timing features (tcprtt, synack, ackdat) form a tightly coupled subgroup, reflecting shared TCP latency dynamics.

- Transaction-level features (trans_depth, res_bdy_len) are highly correlated with each other but show only moderate association with core timing variables, suggesting they capture application-level behavior rather than pure network timing.

- The Label variable exhibits weak-to-moderate negative correlations with most timing metrics, implying that attack traffic tends to follow distinct temporal patterns, such as shorter, more structured, or burst-oriented flows.

## BLOCK 3 — Behavioral Aggregation (ct_* features)

In [None]:
block_behavior = [
    'ct_srv_src', 'ct_srv_dst',
    'ct_src_ltm', 'ct_dst_ltm',
    'ct_src_dport_ltm',
    'ct_dst_sport_ltm',
    'ct_dst_src_ltm',
    'ct_state_ttl',
    'ct_flw_http_mthd',
    'ct_ftp_cmd'
]

plot_spearman_corr(
    transformed_df,
    block_behavior + ['Label'],
    "Spearman Correlation – Behavioral Aggregation Features"
)

### Inference — Behavioral Aggregation Features:
- Strong positive correlations are observed among repetition-based features (ct_srv_src, ct_srv_dst, ct_src_ltm, ct_dst_ltm, ct_dst_src_ltm), indicating these variables collectively capture connection recurrence and host interaction intensity.

- Port-based locality features (ct_src_dport_ltm, ct_dst_sport_ltm) form a tightly coupled subgroup, reflecting focused probing or service-targeting behavior, which is characteristic of scanning and brute-force attacks.

- The TTL aggregation feature (ct_state_ttl) shows an exceptionally strong correlation with the Label, suggesting that state–TTL interaction patterns are highly discriminative between normal and attack traffic.

- Protocol-specific command counters (ct_flw_http_mthd, ct_ftp_cmd) exhibit weak correlations with both behavioral aggregates and the Label, implying that high-level behavioral repetition is more informative than application command counts.

- Overall, behavioral aggregation features demonstrate stronger alignment with the target compared to raw volume or timing metrics.

## BLOCK 4 — Protocol, State & Binary Indicators

In [None]:
block_proto = [
    'proto', 'state', 'service',
    'sttl', 'dttl',
    'swin', 'dwin',
    'is_sm_ips_ports',
    'is_ftp_login'
]

plot_spearman_corr(
    transformed_df,
    block_proto + ['Label'],
    "Spearman Correlation – Protocol & State Features"
)

### Inference — Protocol & State Features

- TTL-related features (sttl, dttl) show the strongest association with the Label, particularly sttl, indicating that state–TTL behavior is a highly discriminative signal for separating normal and attack traffic.

- Protocol identifiers (proto, service) exhibit moderate correlations with the Label, suggesting that attack traffic tends to concentrate around specific protocol/service patterns, but these features alone are not decisive.

- Window size features (swin, dwin) are perfectly correlated with each other, indicating complete redundancy and implying that only one of them is necessary for modeling.

- Binary indicators (is_sm_ips_ports, is_ftp_login) show near-zero correlation with the Label, confirming that they act as rare event flags rather than strong standalone predictors.

- Strong negative correlations between proto and window-size features reflect structural protocol constraints rather than behavioral differences.

### Conclusion

- Combining all the correlation analysis most of the features are inter-realated(multicollinearity).
- features like `trans_depth` , `res_bdy_len` , `ct_flw_http_mthd` , `ct_ftp_cmd` , `is_sm_ips_ports` , `is_ftp_login` are having zero correlation with the target.

## Correlation of Label and attack_cat

In [None]:
plt.figure(figsize=(8, 5))
corr = transformed_df[["Label" , "attack_cat"]].corr(method = "spearman")
sns.heatmap(
        corr,
        cmap='mako',
        center=0,
        annot=True,
        linewidths=0.5
    )
plt.title("Spearman Correlation - Label & attack_cat")
plt.show()

### Inference:

The strong Spearman correlation (**|ρ| ≈ 0.90**) confirms that `attack_cat` is a fine-grained extension of the binary `Label`, with the negative sign arising solely from label encoding rather than any inverse semantic relationship.

# *Model Building*
<div style="
    margin: 16px 0;
    text-align: center;
    font-size: 180%;
    font-weight: bold;
    color: #2c3e50;
    border-bottom: 4px solid #6c63ff;
">
</div>

## **Stage 1: Binary Classification (Label)**

In [None]:
# Data Splitting  
features = transformed_df.drop(columns = ["Label" , "attack_cat"])
X = features
y_label = transformed_df["Label"]

X_train , X_test , y_train , y_test = train_test_split(X , y_label , test_size = 0.2 ,
                                                       stratify = y_label ,  random_state = 42)

In [None]:
# Shape of Train - Test data
print("Shape of X-Train: ",X_train.shape)
print("Shape of X-Test: " , X_test.shape)
print("Shape of y-train: ",  y_train.shape)
print("Shape of y-test: " , y_test.shape)

In [None]:
# Model Evaluation metrics
def ReportEncapsulator(y_test , y_pred):
    
    accuracy = metrics.accuracy_score(y_test , y_pred) # Accuracy score
    precision = metrics.precision_score(y_test , y_pred,average="macro") # Precision score
    recall = metrics.recall_score(y_test , y_pred,average="macro") # Recall score
    f1 = metrics.f1_score(y_test , y_pred,average="macro") # F1 Score
    classification_report = metrics.classification_report(y_test , y_pred) # Classification Report
    

    # Confusion matrix 
    cf_matrix = metrics.confusion_matrix(y_test, y_pred)
    sns.heatmap(cf_matrix, annot=True, fmt="d", cbar=False,cmap="Blues", 
            xticklabels=np.unique(y_test), 
            yticklabels=np.unique(y_test))
    plt.ylabel("Actual")
    plt.xlabel("Predicted")
    plt.title("Confusion Matrix")
    plt.show()


    print("\n======================================================")
    print(f"\nAccuracy Score : {accuracy:.3f}")
    print(f"Precision Score : {precision:.3f}")
    print(f"Recall Score : {recall:.3f}")
    print(f"F1 Score : {f1:.3f}")
    print("\nClassification Report : \n",classification_report)
    print("\n======================================================\n")

### XGBoost (Binary)

In [None]:
xgb_bin = XGBClassifier(
    objective="binary:logistic",
    n_estimators=300,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)

xgb_bin.fit(X_train, y_train)

# Evaluation 
y_pred = xgb_bin.predict(X_test)
ReportEncapsulator(y_test , y_pred)

### **Stage 2: Multiclass Classification (attack_cat)**

In [None]:
# Prepare attack-only data
# Filter attack-only samples
df_attack = transformed_df[transformed_df["Label"] == 1].copy()
# Remove "nothing" class if present
df_attack = df_attack[df_attack["attack_cat"] != 6]

X_attack = df_attack.drop(columns=["Label", "attack_cat"])
y_attack = df_attack["attack_cat"]

# Splitting data
Xa_train, Xa_test, ya_train, ya_test = train_test_split(
    X_attack, y_attack, test_size=0.2, stratify=y_attack, random_state=42
)

In [None]:
# Filter attack-only samples
df_attack = transformed_df[transformed_df["Label"] == 1].copy()

# Remove "nothing" class if present
df_attack = df_attack[df_attack["attack_cat"] != 6]

# Re-encode attack_cat
attack_le = LabelEncoder()
df_attack["attack_cat_enc"] = attack_le.fit_transform(df_attack["attack_cat"])

In [None]:
X_attack = df_attack.drop(columns=["Label", "attack_cat", "attack_cat_enc"])
y_attack = df_attack["attack_cat_enc"]

Xa_train, Xa_test, ya_train, ya_test = train_test_split(
    X_attack,
    y_attack,
    test_size=0.2,
    stratify=y_attack,
    random_state=42
)

In [None]:
# Shape
print("Shape of Xa-Train: ",Xa_train.shape)
print("Shape of Xa-Test: " , Xa_test.shape)
print("Shape of ya-train: ",  ya_train.shape)
print("Shape of ya-test: " , ya_test.shape)

### XGBoost (Multi-Class)

In [None]:
xgb_multi = XGBClassifier(
    objective="multi:softprob",
    num_class=ya_train.nunique(),
    n_estimators=400,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="mlogloss",
    random_state=42
)

xgb_multi.fit(Xa_train, ya_train)

y_pred = xgb_multi.predict(Xa_test)

In [None]:
ReportEncapsulator(ya_test , y_pred)