In [1]:
# üß† Network Traffic Analysis & Packet Inspection using ML & Agentic AI  
#Welcome to **CompactNetTrace**, a lightweight, reproducible research project for **network traffic analysis**, **intrusion detection**, **anomaly localization**, and **policy enforcement** using compact ML models and an **agentic finite-state controller**.

#This notebook initializes the environment, defines reproducibility parameters, and provides setup and run instructions for the full project pipeline.


In [6]:
## üéØ Project Goals
"""
This project aims to:
1. Implement lightweight ML models for flow- and packet-level intrusion detection.  
2. Integrate a simple, deterministic **Agentic FSM** for autonomous responses.  
3. Enable **explainability** for each detection (feature importances, payload segments).  
4. Support both **offline dataset** and **live packet capture** using Scapy/PyShark.  
5. Provide a **Streamlit Cloud demo** showcasing the detection + agentic workflow.  
6. Maintain full reproducibility under modest compute (‚â§ 8 GB RAM, dual-core CPU).  """


'\nThis project aims to:\n1. Implement lightweight ML models for flow- and packet-level intrusion detection.  \n2. Integrate a simple, deterministic **Agentic FSM** for autonomous responses.  \n3. Enable **explainability** for each detection (feature importances, payload segments).  \n4. Support both **offline dataset** and **live packet capture** using Scapy/PyShark.  \n5. Provide a **Streamlit Cloud demo** showcasing the detection + agentic workflow.  \n6. Maintain full reproducibility under modest compute (‚â§ 8 GB RAM, dual-core CPU).  '

In [7]:
## üìò Notebook Execution Order
"""
| Notebook | Description |
|-----------|--------------|
| **00_environment_and_instructions** | Setup environment, define seeds, install dependencies |
| **01_data_acquisition_and_sampling** | Load/synthesize data, sample lightweight subsets |
| **02_preprocessing_and_feature_engineering** | Feature extraction, payload slicing, normalization |
| **03_baseline_models_training** | Train flow-level models (LogReg, RandomForest, IsolationForest) |
| **04_packet_level_model_training** | Train tiny 1D-CNN on packet payloads |
| **05_evaluation_and_explainability** | Evaluate models and generate SHAP-lite explanations |
| **06_agentic_orchestration_and_simulation** | Implement finite-state agent and run simulations |
| **07_deployment_and_demo_setup** | Prepare Streamlit demo app & artifacts |
| **08_short_paper_and_results** | Auto-generate research paper with results and figures |"""


'\n| Notebook | Description |\n|-----------|--------------|\n| **00_environment_and_instructions** | Setup environment, define seeds, install dependencies |\n| **01_data_acquisition_and_sampling** | Load/synthesize data, sample lightweight subsets |\n| **02_preprocessing_and_feature_engineering** | Feature extraction, payload slicing, normalization |\n| **03_baseline_models_training** | Train flow-level models (LogReg, RandomForest, IsolationForest) |\n| **04_packet_level_model_training** | Train tiny 1D-CNN on packet payloads |\n| **05_evaluation_and_explainability** | Evaluate models and generate SHAP-lite explanations |\n| **06_agentic_orchestration_and_simulation** | Implement finite-state agent and run simulations |\n| **07_deployment_and_demo_setup** | Prepare Streamlit demo app & artifacts |\n| **08_short_paper_and_results** | Auto-generate research paper with results and figures |'

In [8]:
## üíª System Requirements
"""- Minimum: 8 GB RAM, Dual-Core CPU  
- OS: Windows / macOS / Linux  
- Recommended Python: 3.10 or higher  
- Runtime: Jupyter Notebook / VSCode / Google Colab

## üì¶ Dependencies Overview
| Category | Libraries |
|-----------|------------|
| Core | `numpy`, `pandas`, `scikit-learn` |
| Deep Learning | `tensorflow-cpu` or `torch` (optional, lightweight CNN) |
| Packet Parsing | `scapy`, `pyshark`, `dpkt` |
| Visualization | `matplotlib`, `seaborn` |
| Explainability | `shap`, `lime` (optional: SHAP-lite local version) |
| Agentic Logic | `transitions` or custom FSM in Python |
| Web UI | `streamlit` |
| Utilities | `joblib`, `onnx`, `yaml`, `tqdm` |
"""

'- Minimum: 8 GB RAM, Dual-Core CPU  \n- OS: Windows / macOS / Linux  \n- Recommended Python: 3.10 or higher  \n- Runtime: Jupyter Notebook / VSCode / Google Colab\n\n## üì¶ Dependencies Overview\n| Category | Libraries |\n|-----------|------------|\n| Core | `numpy`, `pandas`, `scikit-learn` |\n| Deep Learning | `tensorflow-cpu` or `torch` (optional, lightweight CNN) |\n| Packet Parsing | `scapy`, `pyshark`, `dpkt` |\n| Visualization | `matplotlib`, `seaborn` |\n| Explainability | `shap`, `lime` (optional: SHAP-lite local version) |\n| Agentic Logic | `transitions` or custom FSM in Python |\n| Web UI | `streamlit` |\n| Utilities | `joblib`, `onnx`, `yaml`, `tqdm` |\n'

In [9]:
# üì¶ Environment Setup
import sys, os, platform, random, json
import numpy as np
import pandas as pd
from datetime import datetime

print("Python version:", sys.version)
print("Platform:", platform.platform())
print("Working Directory:", os.getcwd())

# Ensure project folders exist
for folder in ["data", "artifacts", "notebooks", "extras"]:
    os.makedirs(folder, exist_ok=True)
print("‚úÖ Folder structure verified.")


Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
Platform: Windows-11-10.0.26200-SP0
Working Directory: d:\project\Network Traffic Analysis and Packet Inspection using ML and Agentic AI\notebooks
‚úÖ Folder structure verified.


In [10]:
# üé≤ Set random seeds for reproducibility
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

# Save reproducibility metadata
reproducibility_info = {
    "seed": SEED,
    "timestamp": datetime.now().isoformat(),
    "python_version": sys.version,
    "platform": platform.platform()
}

with open("reproducibility_info.json", "w") as f:
    json.dump(reproducibility_info, f, indent=4)

print("‚úÖ Reproducibility config saved to reproducibility_info.json")


‚úÖ Reproducibility config saved to reproducibility_info.json


In [12]:
# ‚öôÔ∏è Install Required Packages (run only once)
# Uncomment below if packages are not installed
%pip install numpy pandas scikit-learn scapy pyshark matplotlib seaborn streamlit joblib onnx tqdm tensorflow-cpu shap lime


Defaulting to user installation because normal site-packages is not writeable
Collecting onnx
  Using cached onnx-1.19.1-cp312-cp312-win_amd64.whl.metadata (7.2 kB)
Collecting tensorflow-cpu
  Downloading tensorflow_cpu-2.20.0-cp312-cp312-win_amd64.whl.metadata (4.6 kB)
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting scikit-image>=0.12 (from lime)
  Downloading scikit_image-0.25.2-cp312-cp312-win_amd64.whl.metadata (14 kB)
Collecting imageio!=2.35.0,>=2.33 (from scikit-image>=0.12->lime)
  Downloading imageio-2.37.2-py3-none-any.whl.metadata (9.7 kB)
Collecting tifffile>=2022.8.12 (from scikit-image>=0.12->lime)
  Downloading tifffile-2025.10.16-py3-none-any.whl.metadata (31 kB)
Collecting lazy-loader>=0.4 (from scikit-image>=0.12->lime)
  Downloading lazy_loader-0.4-py3-none-any.whl.metadata (7.6 kB)
Using cached onnx-1.19.1-cp312-cp312-win_amd64.whl (16.5 MB)
Do

  DEPRECATION: Building 'lime' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'lime'. Discussion can be found at https://github.com/pypa/pip/issues/6334
ERROR: Could not install packages due to an OSError: [WinError 206] The filename or extension is too long: 'C:\\Users\\HP\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python312\\site-packages\\onnx\\backend\\test\\data\\node\\test_attention_4d_with_past_and_present_qk_matmul_bias_3d_mask_causal\\test_data_set_0'


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: C:\Users\HP\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz

In [13]:
## üìä Datasets Overview
"""
This project supports small, reproducible network traffic datasets:

| Dataset | Description | Size | Usage |
|----------|-------------|------|--------|
| NSL-KDD | Classic flow-level IDS dataset | ~18MB | Quick baseline training |
| UNSW-NB15 (subset) | Modern dataset with labeled flows | <100MB | Flow-level ML |
| CICIDS2017 (sampled) | Real traffic with attack types | variable | Hybrid flow + packet |
| N-BaIoT (IoT botnet) | IoT device traffic | few MB | IoT anomaly detection |
| Synthetic (Scapy) | Programmatic pcap generation | Custom | Live-capture simulation |
"""

'\nThis project supports small, reproducible network traffic datasets:\n\n| Dataset | Description | Size | Usage |\n|----------|-------------|------|--------|\n| NSL-KDD | Classic flow-level IDS dataset | ~18MB | Quick baseline training |\n| UNSW-NB15 (subset) | Modern dataset with labeled flows | <100MB | Flow-level ML |\n| CICIDS2017 (sampled) | Real traffic with attack types | variable | Hybrid flow + packet |\n| N-BaIoT (IoT botnet) | IoT device traffic | few MB | IoT anomaly detection |\n| Synthetic (Scapy) | Programmatic pcap generation | Custom | Live-capture simulation |\n'

In [22]:
# üîç Check if sample datasets are available
data_dir = "data/sample"
if os.path.exists(data_dir) and len(os.listdir(data_dir)) > 0:
    print(f"‚úÖ Sample datasets found in {data_dir}")
else:
    print("‚ö†Ô∏è No sample data found. Run notebook '01_data_acquisition_and_sampling.ipynb' next.")


‚úÖ Sample datasets found in data/sample


In [23]:
"""## üõ∞Ô∏è Live Packet Capture Option

If you wish to enable **live capture** via Scapy/PyShark:
- Ensure you have administrative privileges (run as Administrator or use `sudo` on Linux).  
- Capture duration and interface can be configured in `extras/live_capture_config.yaml`.  
- Captured packets are stored in `/data/live_capture/` and can be analyzed using later notebooks.  

**‚ö†Ô∏è Safety Tip:** Do not run live capture on production or public networks without permission.
"""

'## üõ∞Ô∏è Live Packet Capture Option\n\nIf you wish to enable **live capture** via Scapy/PyShark:\n- Ensure you have administrative privileges (run as Administrator or use `sudo` on Linux).  \n- Capture duration and interface can be configured in `extras/live_capture_config.yaml`.  \n- Captured packets are stored in `/data/live_capture/` and can be analyzed using later notebooks.  \n\n**‚ö†Ô∏è Safety Tip:** Do not run live capture on production or public networks without permission.\n'

In [24]:
# ‚úçÔ∏è Create default live capture config
import yaml

live_capture_config = {
    "interface": "eth0",        # change per system
    "capture_duration": 30,     # seconds
    "output_dir": "data/live_capture",
    "file_prefix": "capture_",
    "max_packets": 1000
}

os.makedirs(live_capture_config["output_dir"], exist_ok=True)

with open("extras/live_capture_config.yaml", "w") as f:
    yaml.dump(live_capture_config, f)

print("‚úÖ Default live capture config written to extras/live_capture_config.yaml")


‚úÖ Default live capture config written to extras/live_capture_config.yaml


In [25]:
"""## üîÅ Reproducibility & How to Run the Project

1. Run all notebooks **in order** from `00_` to `08_`.  
2. Each notebook saves artifacts (datasets, models, logs) in `/artifacts/`.  
3. To reproduce experiments exactly, use the same `reproducibility_info.json` and `extras/live_capture_config.yaml`.  
4. The entire pipeline can be executed offline (no network access required once datasets are cached).  
5. Use **joblib.load()** or **onnxruntime.InferenceSession()** to load trained models later.  
"""

'## üîÅ Reproducibility & How to Run the Project\n\n1. Run all notebooks **in order** from `00_` to `08_`.  \n2. Each notebook saves artifacts (datasets, models, logs) in `/artifacts/`.  \n3. To reproduce experiments exactly, use the same `reproducibility_info.json` and `extras/live_capture_config.yaml`.  \n4. The entire pipeline can be executed offline (no network access required once datasets are cached).  \n5. Use **joblib.load()** or **onnxruntime.InferenceSession()** to load trained models later.  \n'

In [26]:
"""## üóÇÔ∏è Project Folder Structure

project-root/
‚îú‚îÄ‚îÄ notebooks/
‚îÇ ‚îú‚îÄ‚îÄ 00_environment_and_instructions.ipynb
‚îÇ ‚îú‚îÄ‚îÄ 01_data_acquisition_and_sampling.ipynb
‚îÇ ‚îî‚îÄ‚îÄ ...
‚îú‚îÄ‚îÄ data/
‚îÇ ‚îú‚îÄ‚îÄ sample/
‚îÇ ‚îú‚îÄ‚îÄ live_capture/
‚îú‚îÄ‚îÄ artifacts/
‚îÇ ‚îú‚îÄ‚îÄ models/
‚îÇ ‚îú‚îÄ‚îÄ eval/
‚îú‚îÄ‚îÄ extras/
‚îÇ ‚îú‚îÄ‚îÄ live_capture_config.yaml
‚îÇ ‚îú‚îÄ‚îÄ pcap_to_flow.py
‚îÇ ‚îî‚îÄ‚îÄ synth_gen_config.yaml
‚îú‚îÄ‚îÄ streamlit_app.py
‚îú‚îÄ‚îÄ requirements.txt
‚îú‚îÄ‚îÄ README.md
‚îî‚îÄ‚îÄ short_paper.md
"""

'## üóÇÔ∏è Project Folder Structure\n\nproject-root/\n‚îú‚îÄ‚îÄ notebooks/\n‚îÇ ‚îú‚îÄ‚îÄ 00_environment_and_instructions.ipynb\n‚îÇ ‚îú‚îÄ‚îÄ 01_data_acquisition_and_sampling.ipynb\n‚îÇ ‚îî‚îÄ‚îÄ ...\n‚îú‚îÄ‚îÄ data/\n‚îÇ ‚îú‚îÄ‚îÄ sample/\n‚îÇ ‚îú‚îÄ‚îÄ live_capture/\n‚îú‚îÄ‚îÄ artifacts/\n‚îÇ ‚îú‚îÄ‚îÄ models/\n‚îÇ ‚îú‚îÄ‚îÄ eval/\n‚îú‚îÄ‚îÄ extras/\n‚îÇ ‚îú‚îÄ‚îÄ live_capture_config.yaml\n‚îÇ ‚îú‚îÄ‚îÄ pcap_to_flow.py\n‚îÇ ‚îî‚îÄ‚îÄ synth_gen_config.yaml\n‚îú‚îÄ‚îÄ streamlit_app.py\n‚îú‚îÄ‚îÄ requirements.txt\n‚îú‚îÄ‚îÄ README.md\n‚îî‚îÄ‚îÄ short_paper.md\n'

In [27]:
# ‚úÖ Final summary cell
print("‚úÖ Environment setup complete.")
print("Next step ‚Üí Run `01_data_acquisition_and_sampling.ipynb` to load or generate dataset samples.")


‚úÖ Environment setup complete.
Next step ‚Üí Run `01_data_acquisition_and_sampling.ipynb` to load or generate dataset samples.


In [28]:
"""---
**Notebook Runtime Info:**  
Generated: {{current timestamp}}  
Python version: {{sys.version}}  
Seed: 42  
All results reproducible with provided configs.

**Next:** Proceed to [01_data_acquisition_and_sampling.ipynb](01_data_acquisition_and_sampling.ipynb)
"""

'---\n**Notebook Runtime Info:**  \nGenerated: {{current timestamp}}  \nPython version: {{sys.version}}  \nSeed: 42  \nAll results reproducible with provided configs.\n\n**Next:** Proceed to [01_data_acquisition_and_sampling.ipynb](01_data_acquisition_and_sampling.ipynb)\n'