In [1]:
"""# üìä Data Acquisition & Sampling  
This notebook loads **lightweight network traffic datasets** and prepares **small, reproducible subsets** for hybrid flow‚Äì and packet‚Äìlevel analysis.  

We support three data modes:
1. **Public datasets** (NSL-KDD, UNSW-NB15, CICIDS2017, N-BaIoT)  
2. **Synthetic dataset generation** (using Scapy recipe)  
3. **Live capture (optional)** ‚Äì for real-time experiments  

At the end of this notebook:
- `/data/sample/flows.csv` ‚Üí cleaned flow-level data  
- `/data/sample/packets.pcap` ‚Üí small packet-level traces  
- `/data/sample/metadata.json` ‚Üí reproducibility metadata  
"""

'# üìä Data Acquisition & Sampling  \nThis notebook loads **lightweight network traffic datasets** and prepares **small, reproducible subsets** for hybrid flow‚Äì and packet‚Äìlevel analysis.  \n\nWe support three data modes:\n1. **Public datasets** (NSL-KDD, UNSW-NB15, CICIDS2017, N-BaIoT)  \n2. **Synthetic dataset generation** (using Scapy recipe)  \n3. **Live capture (optional)** ‚Äì for real-time experiments  \n\nAt the end of this notebook:\n- `/data/sample/flows.csv` ‚Üí cleaned flow-level data  \n- `/data/sample/packets.pcap` ‚Üí small packet-level traces  \n- `/data/sample/metadata.json` ‚Üí reproducibility metadata  \n'

In [2]:
"""## üîÅ Inputs & Outputs

**Inputs:**
- Dataset URLs (KaggleHub / GitHub links)  
- `extras/live_capture_config.yaml` (for optional live capture)  
- Random seed from `reproducibility_info.json`

**Outputs:**
- `/data/sample/flows.csv`  
- `/data/sample/packets.pcap`  
- `/data/sample/dataset_summary.json`  
"""

'## üîÅ Inputs & Outputs\n\n**Inputs:**\n- Dataset URLs (KaggleHub / GitHub links)  \n- `extras/live_capture_config.yaml` (for optional live capture)  \n- Random seed from `reproducibility_info.json`\n\n**Outputs:**\n- `/data/sample/flows.csv`  \n- `/data/sample/packets.pcap`  \n- `/data/sample/dataset_summary.json`  \n'

In [4]:
%pip install kagglehub

Collecting kagglehub
  Downloading kagglehub-0.3.13-py3-none-any.whl.metadata (38 kB)
Downloading kagglehub-0.3.13-py3-none-any.whl (68 kB)
Installing collected packages: kagglehub
Successfully installed kagglehub-0.3.13
Note: you may need to restart the kernel to use updated packages.


In [5]:
import os, json, random, pandas as pd, numpy as np
from datetime import datetime
from pathlib import Path

# For Scapy synthetic generation and PCAP parsing
from scapy.all import IP, TCP, UDP, DNS, DNSQR, Raw, wrpcap

# KaggleHub for dataset imports
import kagglehub

# Directory setup
Path("data/sample").mkdir(parents=True, exist_ok=True)
Path("data/raw").mkdir(parents=True, exist_ok=True)

print("‚úÖ Directories ready: data/raw, data/sample")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Directories ready: data/raw, data/sample


In [6]:
# üé≤ Load reproducibility config
with open("reproducibility_info.json", "r") as f:
    repro_info = json.load(f)
random.seed(repro_info["seed"])
np.random.seed(repro_info["seed"])
print("Using seed:", repro_info["seed"])


Using seed: 42


In [7]:
## üåê Public Dataset Options
"""Below are lightweight datasets suitable for this project.  
Each can be downloaded directly via KaggleHub (if credentials available) or manually from GitHub/official sources.

| Dataset | KaggleHub Import | Size | Use |
|----------|------------------|------|-----|
| NSL-KDD | `kagglehub.dataset_download("hassan06/nslkdd")` | ~18 MB | Flow-level baseline |
| UNSW-NB15 | `kagglehub.dataset_download("mrwellsdavid/unsw-nb15")` | <100 MB | Modern labeled flows |
| CICIDS2017 | `kagglehub.dataset_download("hcavsi/cicids2017-dataset")` | variable | Hybrid flow + packet |
| N-BaIoT | `kagglehub.dataset_download("mkashifn/nbaiot-dataset")` | few MB | IoT anomaly analysis |
| TON-IoT | `kagglehub.dataset_download("programmer3/ton-iot-network-intrusion-dataset")` | <50 MB | Agentic orchestration |
"""

'Below are lightweight datasets suitable for this project.  \nEach can be downloaded directly via KaggleHub (if credentials available) or manually from GitHub/official sources.\n\n| Dataset | KaggleHub Import | Size | Use |\n|----------|------------------|------|-----|\n| NSL-KDD | `kagglehub.dataset_download("hassan06/nslkdd")` | ~18 MB | Flow-level baseline |\n| UNSW-NB15 | `kagglehub.dataset_download("mrwellsdavid/unsw-nb15")` | <100 MB | Modern labeled flows |\n| CICIDS2017 | `kagglehub.dataset_download("hcavsi/cicids2017-dataset")` | variable | Hybrid flow + packet |\n| N-BaIoT | `kagglehub.dataset_download("mkashifn/nbaiot-dataset")` | few MB | IoT anomaly analysis |\n| TON-IoT | `kagglehub.dataset_download("programmer3/ton-iot-network-intrusion-dataset")` | <50 MB | Agentic orchestration |\n'

In [8]:
# üîß Choose dataset: 'NSL-KDD', 'UNSW-NB15', 'CICIDS2017', 'N-BaIoT', or 'synthetic'
DATASET_CHOICE = "NSL-KDD"

print(f"Dataset choice set to: {DATASET_CHOICE}")


Dataset choice set to: NSL-KDD


In [15]:
def download_dataset(choice, target_dir="data/raw"):
    """
    Download a dataset from KaggleHub and save it to the target directory.
    
    Args:
        choice (str): Dataset name (e.g., "NSL-KDD", "UNSW-NB15", etc.)
        target_dir (str): Directory to save the downloaded dataset (default: "data/raw")
        
    Returns:
        str: Path to the downloaded dataset directory
    """
    import shutil
    from pathlib import Path
    
    # Create target directory if it doesn't exist
    target_path = Path(target_dir) / choice
    target_path.mkdir(parents=True, exist_ok=True)
    
    try:
        print(f"‚è≥ Downloading {choice} dataset...")
        
        # Download the dataset
        if choice == "NSL-KDD":
            src_path = Path(kagglehub.dataset_download("hassan06/nslkdd"))
        elif choice == "UNSW-NB15":
            src_path = Path(kagglehub.dataset_download("mrwellsdavid/unsw-nb15"))
        elif choice == "CICIDS2017":
            src_path = Path(kagglehub.dataset_download("hcavsi/cicids2017-dataset"))
        elif choice == "N-BaIoT":
            src_path = Path(kagglehub.dataset_download("mkashifn/nbaiot-dataset"))
        elif choice == "TON-IoT":
            src_path = Path(kagglehub.dataset_download("programmer3/ton-iot-network-intrusion-dataset"))
        else:
            print(f"‚ùå Unsupported dataset: {choice}")
            return None
        
        # Copy all files from source to target directory
        for item in src_path.glob('*'):
            if item.is_file():
                shutil.copy2(item, target_path / item.name)
            elif item.is_dir():
                shutil.copytree(item, target_path / item.name, dirs_exist_ok=True)
        
        print(f"‚úÖ Dataset {choice} downloaded and saved to: {target_path}")
        return str(target_path)
        
    except Exception as e:
        print(f"‚ùå Error downloading {choice}: {str(e)}")
        return None

# Download the dataset
data_path = download_dataset(DATASET_CHOICE)
if data_path:
    print(f"üìÅ Dataset available at: {data_path}")
    
    # List the downloaded files
    print("\nüìÇ Downloaded files:")
    for f in Path(data_path).glob('*'):
        print(f"  - {f.name}")
else:
    print("‚ùå Failed to download the dataset. Please check your internet connection and Kaggle credentials.")

‚è≥ Downloading NSL-KDD dataset...
‚úÖ Dataset NSL-KDD downloaded and saved to: data\raw\NSL-KDD
üìÅ Dataset available at: data\raw\NSL-KDD

üìÇ Downloaded files:
  - index.html
  - KDDTest+.arff
  - KDDTest+.txt
  - KDDTest-21.arff
  - KDDTest-21.txt
  - KDDTest1.jpg
  - KDDTrain+.arff
  - KDDTrain+.txt
  - KDDTrain+_20Percent.arff
  - KDDTrain+_20Percent.txt
  - KDDTrain1.jpg
  - nsl-kdd


In [16]:
## üßÆ Downsampling & Sampling Strategy
"""
To keep datasets small and reproducible:
- Use **random_state = 42** for deterministic sampling.
- Limit to **‚â§ 10 000 flows** for training.
- Select a balanced subset across classes (normal vs attack).
- Save subset to `/data/sample/flows.csv`.

For large CSVs: use `chunksize` loading in pandas to avoid memory overflow.
"""

'\nTo keep datasets small and reproducible:\n- Use **random_state = 42** for deterministic sampling.\n- Limit to **‚â§ 10 000 flows** for training.\n- Select a balanced subset across classes (normal vs attack).\n- Save subset to `/data/sample/flows.csv`.\n\nFor large CSVs: use `chunksize` loading in pandas to avoid memory overflow.\n'

In [17]:
# üìä Example: sample 10 000 records deterministically
if DATASET_CHOICE == "NSL-KDD":
    full_csv = Path(data_path) / "KDDTrain+.txt"
    if full_csv.exists():
        df = pd.read_csv(full_csv, header=None)
        sample_df = df.sample(n=10000, random_state=42)
        sample_df.to_csv("data/sample/flows.csv", index=False)
        print("‚úÖ Saved sample to data/sample/flows.csv, rows:", sample_df.shape[0])
    else:
        print("‚ö†Ô∏è Dataset file not found. Check path:", full_csv)


‚úÖ Saved sample to data/sample/flows.csv, rows: 10000


In [18]:
"""## üß™ Synthetic Dataset (Scapy Recipe)
If no public dataset is chosen or for quick tests, generate small, labeled synthetic flows.

Traffic templates:
1. Normal web browsing (HTTP GETs)
2. DNS queries
3. IoT telemetry (UDP)
4. SYN flood bursts (DoS)
5. Suspicious payload injections

Each sample includes:
- Flow ID, packet count, byte count, duration, avg packet size, flags, label.
"""

'## üß™ Synthetic Dataset (Scapy Recipe)\nIf no public dataset is chosen or for quick tests, generate small, labeled synthetic flows.\n\nTraffic templates:\n1. Normal web browsing (HTTP GETs)\n2. DNS queries\n3. IoT telemetry (UDP)\n4. SYN flood bursts (DoS)\n5. Suspicious payload injections\n\nEach sample includes:\n- Flow ID, packet count, byte count, duration, avg packet size, flags, label.\n'

In [19]:
def generate_synthetic_pcaps(num_flows=50, packets_per_flow=20):
    flows_meta = []
    packets = []
    for i in range(num_flows):
        label = "normal" if i % 5 != 0 else "attack"
        src_ip = f"192.168.0.{i%255}"
        dst_ip = f"10.0.0.{(i*2)%255}"
        for j in range(packets_per_flow):
            payload = b"GET /index.html" if label == "normal" else b"GET /cmd?malicious=1"
            pkt = IP(src=src_ip, dst=dst_ip)/TCP(dport=80, sport=random.randint(1000,60000))/Raw(load=payload)
            packets.append(pkt)
        flows_meta.append({
            "flow_id": i,
            "src_ip": src_ip,
            "dst_ip": dst_ip,
            "pkt_count": packets_per_flow,
            "avg_pkt_size": np.mean([len(p) for p in packets[-packets_per_flow:]]),
            "label": label
        })
    wrpcap("data/sample/packets.pcap", packets)
    pd.DataFrame(flows_meta).to_csv("data/sample/flows.csv", index=False)
    print("‚úÖ Synthetic PCAP + flow CSV generated.")

generate_synthetic_pcaps()


‚úÖ Synthetic PCAP + flow CSV generated.


In [20]:
## ‚úÖ Verification of Sample Data
"""Check structure and small preview of sampled dataset."""


'Check structure and small preview of sampled dataset.'

In [21]:
df = pd.read_csv("data/sample/flows.csv")
print("Rows:", len(df))
df.head(5)


Rows: 50


Unnamed: 0,flow_id,src_ip,dst_ip,pkt_count,avg_pkt_size,label
0,0,192.168.0.0,10.0.0.0,20,60.0,attack
1,1,192.168.0.1,10.0.0.2,20,55.0,normal
2,2,192.168.0.2,10.0.0.4,20,55.0,normal
3,3,192.168.0.3,10.0.0.6,20,55.0,normal
4,4,192.168.0.4,10.0.0.8,20,55.0,normal


In [22]:
summary = {
    "dataset": DATASET_CHOICE,
    "rows": len(df),
    "columns": list(df.columns),
    "timestamp": datetime.now().isoformat(),
    "source": str(data_path) if data_path else "synthetic",
}
with open("data/sample/dataset_summary.json", "w") as f:
    json.dump(summary, f, indent=4)
print("üìÑ Summary saved ‚Üí data/sample/dataset_summary.json")


üìÑ Summary saved ‚Üí data/sample/dataset_summary.json


In [23]:
"""## üõ∞Ô∏è Live Capture Integration (Optional)
If configured in `extras/live_capture_config.yaml`,  
run `extras/live_capture.py` to collect short live traces for testing.
The capture will:
- Store packets to `/data/live_capture/`
- Append summary flows to `/data/sample/flows_live.csv`
"""

'## üõ∞Ô∏è Live Capture Integration (Optional)\nIf configured in `extras/live_capture_config.yaml`,  \nrun `extras/live_capture.py` to collect short live traces for testing.\nThe capture will:\n- Store packets to `/data/live_capture/`\n- Append summary flows to `/data/sample/flows_live.csv`\n'

In [24]:
"""## üèÅ Notebook Summary
You have now prepared small datasets ready for preprocessing and feature extraction.

**Next:**  
‚û°Ô∏è Proceed to [02_preprocessing_and_feature_engineering.ipynb](02_preprocessing_and_feature_engineering.ipynb)

**Artifacts Created:**
- `data/sample/flows.csv`
- `data/sample/packets.pcap`
- `data/sample/dataset_summary.json`
"""

'## üèÅ Notebook Summary\nYou have now prepared small datasets ready for preprocessing and feature extraction.\n\n**Next:**  \n‚û°Ô∏è Proceed to [02_preprocessing_and_feature_engineering.ipynb](02_preprocessing_and_feature_engineering.ipynb)\n\n**Artifacts Created:**\n- `data/sample/flows.csv`\n- `data/sample/packets.pcap`\n- `data/sample/dataset_summary.json`\n'

In [25]:
print("‚úÖ Data Acquisition Complete. Proceed to 02_preprocessing_and_feature_engineering.ipynb.")


‚úÖ Data Acquisition Complete. Proceed to 02_preprocessing_and_feature_engineering.ipynb.
