# Notebook 01 - Dataset Understanding

## Objective
The purpose of this notebook is to explore and understand the structure, meaning, and characteristics of the features contained in the TabularIoTAttack-2024 dataset.  
Here, we will:

- Load sample CSV files from the Bronze layer  
- Inspect all feature names and data types  
- Understand the meaning of each variable generated by CICFlowMeter  
- Verify column consistency across all available CSV files  
- Build a concise data dictionary to guide later cleaning, preprocessing, and modeling steps  

This notebook does *not* modify the data. It focuses solely on understanding the dataset before any cleaning or transformation.

# Libraries

In [None]:
import pandas as pd
from pathlib import Path

# Load data

In [5]:
# Path to Bronze layer
bronze_path = Path("../data/bronze")

# List all CSV files in Bronze
csv_files = sorted(list(bronze_path.glob("*.csv")))

print(f"Found {len(csv_files)} CSV files in Bronze:")
for f in csv_files:
    print(" -", f.name)

# Dictionary to store each loaded CSV
datasets = {}

for file_path in csv_files:
    df = pd.read_csv(file_path)
    datasets[file_path.stem] = df
    
    print(f"\nLoaded: {file_path.name}")
    print(f"Shape: {df.shape}")
    print(f"Columns ({len(df.columns)} total):")
    print(df.columns.tolist())

Found 15 CSV files in Bronze:
 - Benign Traffic.csv
 - DDoS ICMP Flood.csv
 - DDoS UDP Flood.csv
 - DoS ICMP Flood.csv
 - DoS TCP Flood.csv
 - DoS UDP Flood.csv
 - MITM ARP Spoofing.csv
 - MQTT DDoS Publish Flood.csv
 - MQTT DoS Connect Flood.csv
 - MQTT DoS Publish Flood.csv
 - MQTT Malformed.csv
 - Recon OS Scan.csv
 - Recon Ping Sweep.csv
 - Recon Port Scan.csv
 - Recon Vulnerability Scan.csv

Loaded: Benign Traffic.csv
Shape: (32620, 85)
Columns (85 total):
['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Total Fwd Packet', 'Total Bwd packets', 'Total Length of Fwd Packet', 'Total Length of Bwd Packet', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Mean', 'Fwd Packet Length Std', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', '

# Understanding Attacks

In [9]:
attack_names = set() # To store unique attack names
attack_counts = {}  # To store class counts

csv_files = list(bronze_path.glob("*.csv"))

for file in csv_files:
    df = pd.read_csv(file)
    
    # Drop NaN attack names if present
    names = df['Attack Name'].dropna().unique()
    
    for n in names:
        attack_names.add(n)
    
    # Count occurrences
    counts = df['Attack Name'].value_counts()
    for cls, count in counts.items():
        if cls not in attack_counts:
            attack_counts[cls] = count
        else:
            attack_counts[cls] += count

print("Unique Attack Names:")
print(attack_names)

print("\n Attack Counts:")
print(len(attack_names))

Unique Attack Names:
{'DDoS ICMP Flood', 'Benign Traffic', 'DoS TCP Flood', 'Recon Port Scan', 'MQTT Malformed', 'MQTT DoS Connect Flood', 'Recon Ping Sweep', 'Recon OS Scan', 'DoS UDP Flood', 'MQTT DoS Publish Flood', 'Recon Vulnerability Scan', 'DoS ICMP Flood', 'DDoS UDP Flood', 'MQTT DDoS Publish Flood', 'MITM ARP Spoofing'}

 Attack Counts:
15


## WARNING!: Possible desbalanced class

In [8]:
print("\nAttack Counts:")
print(attack_counts)


Attack Counts:
{'Benign Traffic': 32620, 'DDoS ICMP Flood': 2552, 'DDoS UDP Flood': 2576, 'DoS ICMP Flood': 2107, 'DoS TCP Flood': 2106916, 'DoS UDP Flood': 3115, 'MITM ARP Spoofing': 1053, 'MQTT DDoS Publish Flood': 413913, 'MQTT DoS Connect Flood': 238031, 'MQTT DoS Publish Flood': 953, 'MQTT Malformed': 2246, 'Recon OS Scan': 85317, 'Recon Ping Sweep': 71, 'Recon Port Scan': 485522, 'Recon Vulnerability Scan': 8321}


## Attack Types Summary (15 Classes)

This table summarizes all attack types present in the dataset, including objective, behavior, and detectable flow-level patterns.

| Attack Name | Objective | How It Works | Detectable Patterns (Flow Features) |
|-------------|-----------|--------------|-------------------------------------|
| **Benign Traffic** | Normal IoT communication | Regular IoT device activity; periodic and lightweight traffic | Stable IAT, balanced Fwd/Bwd packets, low packet sizes, predictable intervals |
| **DDoS ICMP Flood** | Overwhelm target with ICMP echo requests from multiple sources | Sends massive distributed ICMP ping floods | Protocol = ICMP, extremely high Fwd packets, almost no Bwd response, very low IAT Min, repetitive packet length |
| **DDoS UDP Flood** | Overload target with high-rate distributed UDP traffic | Multiple hosts send continuous UDP packets | High `Flow Bytes/s`, high `Total Fwd Packets`, unidirectional traffic, low IAT, irregular packet length |
| **DoS TCP Flood** | Exhaust server resources using TCP packets | Single attacker sends a large number of TCP SYN/ACK/PSH packets | High `SYN Flag Count`, massive Fwd packets, minimal Bwd packets, high throughput, IAT Min near zero |
| **DoS UDP Flood** | Overload host with UDP datagrams | Single attacker sends continuous UDP traffic | Large packet sizes, high `Flow Packets/s`, unidirectional traffic, almost no Bwd packets |
| **DoS ICMP Flood** | Overload host with ICMP pings | Sends ICMP Echo Requests repeatedly | Protocol = ICMP, uniform packet sizes, near-zero IAT, high Fwd packet rate |
| **MQTT DoS Connect Flood** | Overload broker by opening too many connections | Sends thousands of MQTT CONNECT packets | Very short flows, low Flow Duration, repetitive packet length, high number of flows/sec |
| **MQTT DoS Publish Flood** | Flood broker with PUBLISH messages | Sends high-rate MQTT PUBLISH frames | High `Total Fwd Packets`, repetitive payload sizes, elevated `Flow Bytes/s` |
| **MQTT DDoS Publish Flood** | Distributed publish flooding | Multiple devices send PUBLISH floods simultaneously | Extremely high packet rate, very low IAT Min, repetitive Fwd packet sizes |
| **MQTT Malformed** | Crash or destabilize broker with malformed packets | Sends invalid MQTT frames or corrupted headers | Abnormal packet sizes (min/max/std), high RST flags, short failed flows |
| **Recon Port Scan** | Discover open ports on a host | Attacker probes many ports sequentially | Very short flows, `Total Fwd Packet` often = 1, minimal/no Bwd traffic, low Flow Duration, repetitive packet lengths |
| **Recon Ping Sweep** | Identify active hosts on a network | Sends ping requests to many IPs | Protocol = ICMP, consistent packet lengths, very low Bwd packets, very short flows |
| **Recon OS Scan** | Identify target OS | Sends specially crafted TCP packets to detect OS fingerprints | Rare TCP flags (FIN, URG, CWR, ECE), stable IAT patterns, low packet counts |
| **Recon Vulnerability Scan** | Identify services and weaknesses | Uses advanced scanners (Nmap/Nessus) to probe services | More complex flows, greater Bwd responses, structured Fwd packet lengths, mid-range Flow Duration |
| **MITM ARP Spoofing** | Intercept or modify victim traffic | Attacker poisons ARP tables to redirect traffic though the attacker | Abnormal bidirectional traffic, long flows, low idle time, inconsistent header lengths, unusual Fwd/Bwd packet ratios |



# Understand Characteristics

In [6]:
df.columns

Index(['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol',
       'Timestamp', 'Flow Duration', 'Total Fwd Packet', 'Total Bwd packets',
       'Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
       'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
       'Packet Length Min', 'Packet Length Max', 'Packet Length Mean',
       'Packet Length Std', 'Packet Len

# Overview of Features and Their Importance in Attack Detection

The TabularIoT-2024 dataset contains **85 flow-based features** generated by CICFlowMeter.  
These features describe the behavior of a network flow across three main dimensions:

---

## **A. Flow Metadata (Identifiers and Addresses)**  
These include Flow ID, IPs, Ports, and Timestamp.  
They are **not behavioral features** and mostly describe *where* the flow comes from, not **how it behaves**.  
They tend to create **overfitting** because ML models memorize specific IPs or ports rather than learning attack patterns.  
Therefore, they are generally **removed**.

---

## **B. Traffic Volume and Packet Statistics**  
These features quantify how many packets and bytes moved in each direction (forward/backward).  
Examples:  
- Total Fwd Packets  
- Total Bwd Packets  
- Packet Length Mean / Max / Std  

These features are **highly related to flooding attacks**, where packet counts skyrocket.  
They are also effective in detecting scanning and malformed traffic.  
These are among the **most important features**.

---

## **C. Inter-Arrival Times (IAT)**  
IAT features measure how frequently packets are exchanged.  
Examples:  
- Flow IAT Mean  
- Fwd IAT Std  
- Bwd IAT Max  

These features are essential for detecting:  
- **DoS/DDoS** (very short IAT)  
- **Port scans** (very regular IAT)  
- **Benign IoT traffic** (periodic patterns)  

Therefore, they are **kept**.

---

## **D. TCP Flag-Based Features**  
Examples:  
- SYN Flag Count  
- FIN Flag Count  
- PSH Flag Count  

These are critical for detecting:  
- TCP SYN floods  
- Half-open scans  
- Connection-reset attacks  

These features are always **kept**.

---

## **E. Flow Timing: Active / Idle Intervals**  
These describe how long a flow is active vs idle.  
Attacks often have extremely short or repetitive active periods.  
They strongly contribute to distinguishing benign periodic IoT traffic from attack bursts.

---

## **F. Bulk Features and Subflows**  
Bulk averages often remain **zero** across IoT flows.  
Subflow features tend to correlate highly with other packet-count features.  
Thus, some may be redundant and removed unless they show unique information during EDA.

---

## **G. Behavioral Labels**  
- Attack Name  
- Label  

These are the **targets**, not predictive features.  
"Label" is the prediction target.  
"Attack Name" is unnecessary for binary classification.

---

# Feature Retention Table (85 Features)

Below is a complete table listing all 85 features, whether to keep or eliminate them, and the justification for each decision.

---

## **Feature Retention Summary Table**

| Feature | Keep / Drop | Justification |
|--------|--------------|---------------|
| Flow ID | Drop | Unique identifier; no behavioral value. |
| Src IP | Drop | Causes overfitting; environment-specific. |
| Src Port | Drop | Not a behavioral indicator; risk of leakage. |
| Dst IP | Drop | Same as Src IP; environment-dependent. |
| Dst Port | Drop | Often correlated to service, not attack behavior. |
| Protocol | Keep | Critical for distinguishing ICMP/TCP/UDP attacks. |
| Timestamp | Drop | Not useful; induces temporal leakage. |
| Flow Duration | Keep | Strongly tied to DoS and scan behaviors. |
| Total Fwd Packet | Keep | Key metric for floods/scans. |
| Total Bwd Packets | Keep | Indicates server responses; useful for scans. |
| Total Length of Fwd Packet | Keep | Important for volume-based attacks. |
| Total Length of Bwd Packet | Keep | Same rationale as above. |
| Fwd Packet Length Max | Keep | Detects abnormal payload structures. |
| Fwd Packet Length Min | Keep | Behavioral fingerprints. |
| Fwd Packet Length Mean | Keep | Critical for distinguishing benign IoT patterns. |
| Fwd Packet Length Std | Keep | High variance indicates anomalies. |
| Bwd Packet Length Max | Keep | Useful for malformed packet detection. |
| Bwd Packet Length Min | Keep | Behavioral pattern. |
| Bwd Packet Length Mean | Keep | Important for flow characterization. |
| Bwd Packet Length Std | Keep | Helps detect jittery attacks. |
| Flow Bytes/s | Keep | Strong indicator of DDoS volume. |
| Flow Packets/s | Keep | Crucial for flood attack detection. |
| Flow IAT Mean | Keep | Distinguishes scanning vs benign periodic IoT. |
| Flow IAT Std | Keep | Measures irregularity typical in attacks. |
| Flow IAT Max | Keep | Useful for detecting idle behavior. |
| Flow IAT Min | Keep | Detects high-speed bursts. |
| Fwd IAT Total | Keep | Helps characterize directional behavior. |
| Fwd IAT Mean | Keep | Very strong attack indicator. |
| Fwd IAT Std | Keep | Detects disruptions in packet timing. |
| Fwd IAT Max | Keep | Distinguishes flooding vs scanning. |
| Fwd IAT Min | Keep | Shows extremely fast forward traffic. |
| Bwd IAT Total | Keep | Captures server-side timing. |
| Bwd IAT Mean | Keep | Important for response-based attacks. |
| Bwd IAT Std | Keep | Detects unstable back-and-forth flows. |
| Bwd IAT Max | Keep | High values usually benign; helpful. |
| Bwd IAT Min | Keep | Very small during DoS floods. |
| Fwd PSH Flags | Keep | Used for certain TCP attacks. |
| Bwd PSH Flags | Keep | Behavioral indicator. |
| Fwd URG Flags | Keep | Rare → anomaly detection. |
| Bwd URG Flags | Keep | Same rationale. |
| Fwd Header Length | Keep | Reflects packet structure. |
| Bwd Header Length | Keep | Same rationale. |
| Fwd Packets/s | Keep | Critical for floods. |
| Bwd Packets/s | Keep | Important for scans. |
| Packet Length Min | Keep | Encodes traffic pattern. |
| Packet Length Max | Keep | Useful for malformed packets. |
| Packet Length Mean | Keep | Summarizes flow characteristics. |
| Packet Length Std | Keep | Measures stability of packet sizes. |
| Packet Length Variance | Drop | Redundant with Std. |
| FIN Flag Count | Keep | Good for detecting half-closed scans. |
| SYN Flag Count | Keep | Essential for SYN flood detection. |
| RST Flag Count | Keep | Detects reset-based attacks. |
| PSH Flag Count | Keep | Identifies forced data push attempts. |
| ACK Flag Count | Keep | Reflects normal vs abnormal TCP behavior. |
| URG Flag Count | Keep | Rare → anomaly cue. |
| CWR Flag Count | Keep | Useful but rare; can mark anomalies. |
| ECE Flag Count | Keep | Marks congestion events indicating attacks. |
| Down/Up Ratio | Drop | Derived; redundant with forward/backward counts. |
| Average Packet Size | Drop | Duplicate information from packet length features. |
| Fwd Segment Size Avg | Keep | Reflects segment structure. |
| Bwd Segment Size Avg | Keep | Same rationale. |
| Fwd Bytes/Bulk Avg | Drop | Usually always 0; IoT rarely uses bulk. |
| Fwd Packet/Bulk Avg | Drop | Same as above. |
| Fwd Bulk Rate Avg | Drop | Mostly zeros while adding noise. |
| Bwd Bytes/Bulk Avg | Drop | Same as above. |
| Bwd Packet/Bulk Avg | Drop | Same as above. |
| Bwd Bulk Rate Avg | Drop | Same as above. |
| Subflow Fwd Packets | Keep (tentative) | Sometimes redundant; evaluated during EDA. |
| Subflow Fwd Bytes | Keep (tentative) | Same rationale. |
| Subflow Bwd Packets | Keep (tentative) | Same rationale. |
| Subflow Bwd Bytes | Keep (tentative) | Same rationale. |
| Fwd Init Win Bytes | Drop | Varies by OS; noisy and environment-specific. |
| Bwd Init Win Bytes | Drop | Same rationale. |
| Fwd Act Data Pkts | Keep | Helps detect incomplete TCP handshakes. |
| Fwd Seg Size Min | Keep | Structural packet feature. |
| Active Mean | Keep | Captures periodic IoT behavior. |
| Active Std | Keep | Detects jittery attack patterns. |
| Active Max | Keep | Characterizes maximum flow activity. |
| Active Min | Keep | Distinguishes active vs idle attacks. |
| Idle Mean | Keep | Reflects flow quiet periods. |
| Idle Std | Keep | Helps detect irregular idle behavior. |
| Idle Max | Keep | Important for long flows. |
| Idle Min | Keep | Useful for burst attacks. |
| Attack Name | Drop | Not used for prediction; only metadata. |
| Label | Keep | Ground truth target. |



# Summary of this stage

In this notebook, we thoroughly explored the structure and content of the TabularIoTAttack-2024 dataset. We reviewed all CSV files from the Bronze layer, examined their feature names, data types, and ensured consistency across files. We also studied the meaning of each variable generated by CICFlowMeter, identifying how these features represent network-flow behavior relevant to intrusion detection. This stage allowed us to build a clear understanding of the dataset's composition before performing any cleaning or transformations. With this foundation, the data is now ready for the detailed cleaning and preprocessing performed in the next notebook.