# üåê Phase-0.2: IoT-23 Dataset Understanding
## Quantum-RAG IDS ‚Äî Multi-Dataset Generalization Study

---

### üìã Phase-0.2 Objective

**Strict Read-Only Exploratory Data Analysis on the IoT-23 Dataset.**

This notebook observes, counts, measures, and documents. It makes **zero modifications** to the data.

| Rule | Status |
|------|--------|
| ‚ùå No feature dropping | Strict |
| ‚ùå No encoding or transformation | Strict |
| ‚ùå No model training | Strict |
| ‚úÖ Read-only structural analysis | Required |
| ‚úÖ Statistical observation only | Required |

### üîë IoT-23 Key Differences from TON-IoT

| Property | TON-IoT | IoT-23 |
|----------|---------|--------|
| **Format** | Standard CSV | Zeek/Bro TSV log files |
| **Column Headers** | Row 0 | `#fields` line (line 7) |
| **Metadata Lines** | None | 8 header lines starting with `#` |
| **Footer** | None | `#close` line at end of file |
| **Label Structure** | Single column | Dual: `label` + `detailed-label` |
| **Null Sentinel** | `"-"` placeholder | `"-"` placeholder (same) |
| **Structure** | Flat CSVs | 23 scenario subfolders with `bro/conn.log.labeled` |

---

## üì¶ Cell 1: Imports and Data Path Setup

In [26]:
# ============================================================
# CELL 1: Imports and Data Path Setup
# Phase 0 ‚Äî Read-Only. No transformations.
# ============================================================

import os
import warnings
from pathlib import Path

import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

# ‚îÄ‚îÄ Configure pandas display ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", 60)
pd.set_option("display.float_format", "{:,.2f}".format)

# ‚îÄ‚îÄ Data path ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Resolve relative to this notebook's location
NOTEBOOK_DIR = Path.cwd()
BASE_DATA_PATH = NOTEBOOK_DIR.parent.parent / "main_folder" / "data" / "iot_23"

# Fallback: try common root structures
if not BASE_DATA_PATH.exists():
    BASE_DATA_PATH = NOTEBOOK_DIR.parent / "data" / "iot_23"
if not BASE_DATA_PATH.exists():
    BASE_DATA_PATH = NOTEBOOK_DIR / "data" / "iot_23"

# ‚îÄ‚îÄ Verify path ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("=" * 65)
print("üåê Phase-0.2: IoT-23 Dataset Understanding")
print("=" * 65)
print(f"\nüìÇ Notebook directory : {NOTEBOOK_DIR}")
print(f"üìÇ IoT-23 data path   : {BASE_DATA_PATH}")

if BASE_DATA_PATH.exists():
    scenario_dirs = [d for d in BASE_DATA_PATH.iterdir() if d.is_dir()]
    print(f"\n‚úÖ Path exists!")
    print(f"   Scenario folders found : {len(scenario_dirs)}")
    for d in sorted(scenario_dirs):
        print(f"     ‚Ä¢ {d.name}")
else:
    print(f"\n‚ùå Path does NOT exist: {BASE_DATA_PATH}")
    print("   Please update BASE_DATA_PATH to point to the iot_23/ folder.")

üåê Phase-0.2: IoT-23 Dataset Understanding

üìÇ Notebook directory : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\Phase_0
üìÇ IoT-23 data path   : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\data\iot_23

‚úÖ Path exists!
   Scenario folders found : 23
     ‚Ä¢ CTU-Honeypot-Capture-4-1
     ‚Ä¢ CTU-Honeypot-Capture-5-1
     ‚Ä¢ CTU-Honeypot-Capture-7-1
     ‚Ä¢ CTU-IoT-Malware-Capture-1-1
     ‚Ä¢ CTU-IoT-Malware-Capture-17-1
     ‚Ä¢ CTU-IoT-Malware-Capture-20-1
     ‚Ä¢ CTU-IoT-Malware-Capture-21-1
     ‚Ä¢ CTU-IoT-Malware-Capture-3-1
     ‚Ä¢ CTU-IoT-Malware-Capture-33-1
     ‚Ä¢ CTU-IoT-Malware-Capture-34-1
     ‚Ä¢ CTU-IoT-Malware-Capture-35-1
     ‚Ä¢ CTU-IoT-Malware-Capture-36-1
     ‚Ä¢ CTU-IoT-Malware-Capture-39-1
     ‚Ä¢ CTU-IoT-Malware-Capture-42-1
     ‚Ä¢ CTU-IoT-Malware-Capture-43-1
     ‚Ä¢ CTU-IoT-Malware-Capture-44-1
     ‚Ä¢ CTU-IoT-Malware-Capture-48

---

## üóÇÔ∏è Cell 2: Directory Traversal & File Inventory

**Goal:** Locate all `conn.log.labeled` files across the 23 scenario subfolders.

**Strategy:**
- Use `Path.rglob()` to recursively find all target files
- Peek at the first 10 lines using standard `open()` (no pandas) to extract the `#fields` column names
- Record file size in MB

**‚ö†Ô∏è Why peek instead of load?** Reading 10 lines costs ~microseconds. Loading entire files into RAM before knowing the schema would risk OOM crashes.

In [27]:
# ============================================================
# CELL 2: Directory Traversal & File Inventory
# Read-Only: open() peek only ‚Äî no pandas loading here
# ============================================================

def extract_columns_from_bro_header(filepath):
    """
    Read only the first 10 lines of a Zeek/Bro log file using
    standard Python open() to extract column names from the
    '#fields' metadata line.

    Bro log header structure (lines 0-7):
        #separator \\x09
        #set_separator ,
        #empty_field (empty)
        #unset_field -
        #path conn
        #open YYYY-MM-DD-HH-MM-SS
        #fields ts uid ...   <-- LINE 6 (index 6)
        #types  time string ...

    Returns:
        list[str] : column names (empty list if not found)
    """
    columns = []
    try:
        with open(filepath, "r", encoding="utf-8", errors="replace") as f:
            for i, line in enumerate(f):
                if i >= 10:
                    break
                if line.startswith("#fields"):
                    # Split by tab, strip the '#fields' token itself
                    parts = line.strip().split("\t")
                    columns = [c.strip() for c in parts[1:]]  # skip '#fields'
                    break
    except Exception as e:
        print(f"    ‚ö†Ô∏è  Could not read header from {filepath}: {e}")
    return columns


def build_file_inventory(base_path):
    """
    Recursively find all conn.log.labeled files under base_path.
    For each file:
      - Extract column names from the #fields header line
      - Compute file size in MB

    Returns:
        pd.DataFrame with columns:
            scenario_name, file_path, file_size_mb,
            num_columns, column_names
    """
    target_filename = "conn.log.labeled"
    records = []

    print(f"üîç Scanning: {base_path}")
    print(f"   Looking for: '{target_filename}'\n")

    found_files = sorted(base_path.rglob(target_filename))

    if not found_files:
        print("‚ùå No conn.log.labeled files found!")
        print("   Check that BASE_DATA_PATH is correct and files exist.")
        return pd.DataFrame()

    for fpath in found_files:
        scenario_name = fpath.parts[-3]          # e.g. CTU-Honeypot-Capture-4-1
        file_size_mb  = os.path.getsize(fpath) / (1024 ** 2)
        columns       = extract_columns_from_bro_header(fpath)

        records.append({
            "scenario_name" : scenario_name,
            "file_path"     : str(fpath),
            "file_size_mb"  : round(file_size_mb, 2),
            "num_columns"   : len(columns),
            "column_names"  : columns,
        })
        print(f"   ‚úÖ {scenario_name:<35s}  {file_size_mb:>8.1f} MB  |  {len(columns)} cols")

    return pd.DataFrame(records)


# ‚îÄ‚îÄ Run inventory ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("=" * 65)
print("üìÅ FILE INVENTORY")
print("=" * 65)

inventory_df = build_file_inventory(BASE_DATA_PATH)

print(f"\n{'='*65}")
print(f"‚úÖ Total files found  : {len(inventory_df)}")
if not inventory_df.empty:
    print(f"‚úÖ Total dataset size : {inventory_df['file_size_mb'].sum():.1f} MB  "
          f"({inventory_df['file_size_mb'].sum()/1024:.2f} GB)")
    print(f"‚úÖ Column count check : "
          f"{inventory_df['num_columns'].nunique()} unique schema(s) detected")
    if inventory_df["num_columns"].nunique() == 1:
        print(f"   ‚Üí All files share the same {inventory_df['num_columns'].iloc[0]}-column schema ‚úÖ")
    else:
        print("   ‚ö†Ô∏è  WARNING: files have different schemas ‚Äî investigate before loading!")
    print()
    display(inventory_df[["scenario_name", "file_size_mb", "num_columns"]].reset_index(drop=True))

    # ‚îÄ‚îÄ Print the column list from the first file ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    first_cols = inventory_df["column_names"].iloc[0]
    print(f"\nüìã Column names ({len(first_cols)} total):")
    for i, col in enumerate(first_cols, 1):
        print(f"   {i:>2}. {col}")

üìÅ FILE INVENTORY
üîç Scanning: c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\data\iot_23
   Looking for: 'conn.log.labeled'

   ‚úÖ CTU-Honeypot-Capture-4-1                  0.1 MB  |  21 cols
   ‚úÖ CTU-Honeypot-Capture-5-1                  0.2 MB  |  21 cols
   ‚úÖ Somfy-01                                  0.0 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-1-1             141.4 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-17-1           7762.3 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-20-1              0.4 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-21-1              0.4 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-3-1              23.3 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-33-1           7503.4 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-34-1              2.9 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-35-1           1278.1 MB  |  21 cols
   ‚úÖ CTU-IoT-Malware-Capture-36-1           1705.0 MB  |  21 cols

Unnamed: 0,scenario_name,file_size_mb,num_columns
0,CTU-Honeypot-Capture-4-1,0.06,21
1,CTU-Honeypot-Capture-5-1,0.17,21
2,Somfy-01,0.02,21
3,CTU-IoT-Malware-Capture-1-1,141.44,21
4,CTU-IoT-Malware-Capture-17-1,7762.31,21
5,CTU-IoT-Malware-Capture-20-1,0.4,21
6,CTU-IoT-Malware-Capture-21-1,0.41,21
7,CTU-IoT-Malware-Capture-3-1,23.25,21
8,CTU-IoT-Malware-Capture-33-1,7503.43,21
9,CTU-IoT-Malware-Capture-34-1,2.88,21



üìã Column names (21 total):
    1. ts
    2. uid
    3. id.orig_h
    4. id.orig_p
    5. id.resp_h
    6. id.resp_p
    7. proto
    8. service
    9. duration
   10. orig_bytes
   11. resp_bytes
   12. conn_state
   13. local_orig
   14. local_resp
   15. missed_bytes
   16. history
   17. orig_pkts
   18. orig_ip_bytes
   19. resp_pkts
   20. resp_ip_bytes
   21. tunnel_parents   label   detailed-label


---

## üîß Cell 3: Robust TSV Loader Function

**Critical edge cases handled:**

| Edge Case | Handling Strategy |
|-----------|-------------------|
| 8 metadata header lines | `skiprows=8` in `pd.read_csv` |
| `#fields` column names | Passed explicitly as `names=columns` |
| `#close` footer line | Drop any row where first column starts with `#` |
| Memory crash (OOM) | `nrows=max_rows` caps rows per file |
| Encoding errors | `encoding_errors='replace'` |
| Malformed rows | `on_bad_lines='skip'` |

In [28]:
# ============================================================
# CELL 3: Robust TSV Loader Function
# Phase 0 ‚Äî Read-Only. No encoding, no dropping features.
# ============================================================

def load_conn_log(filepath, columns, max_rows=100_000):
    """
    Load a Zeek/Bro conn.log.labeled file safely into a DataFrame.

    Parameters
    ----------
    filepath : str | Path
        Path to the conn.log.labeled file.
    columns : list[str]
        Column names extracted from the #fields metadata header.
    max_rows : int
        Maximum rows to read per file (memory safety cap).
        Default: 100,000 rows.

    Returns
    -------
    pd.DataFrame
        Cleaned DataFrame with proper column names and a
        'source_scenario' column. Returns empty DataFrame on failure.

    Edge Cases Handled
    ------------------
    - skiprows=8  : skips the 8 Zeek metadata header lines
    - header=None : prevents pandas from treating row 8 as header
    - names=columns : applies the #fields column names explicitly
    - nrows=max_rows : caps memory usage
    - Drops rows where first column starts with '#' (catches #close footer)
    - encoding_errors='replace' : handles non-UTF8 characters
    - on_bad_lines='skip' : skips malformed/truncated rows
    """
    filepath = Path(filepath)
    scenario_name = filepath.parts[-3]

    print(f"  üìÇ Loading: {scenario_name:<38s}", end="", flush=True)

    try:
        df = pd.read_csv(
            filepath,
            sep="\t",
            skiprows=8,           # skip 8 Zeek metadata lines
            header=None,          # row 8 is DATA not a header
            names=columns,        # apply our extracted column names
            nrows=max_rows,       # OOM safety cap
            low_memory=False,
            encoding="utf-8",
            encoding_errors="replace",
            on_bad_lines="skip",  # skip malformed rows silently
        )

        # ‚îÄ‚îÄ Drop #close footer and any stray metadata rows ‚îÄ‚îÄ‚îÄ
        # The last line of a Bro file is:   #close YYYY-MM-DD...
        # If nrows didn't truncate before it, it appears as a row
        # where the first column value starts with '#'
        first_col = df.columns[0]
        before = len(df)
        df = df[~df[first_col].astype(str).str.startswith("#")].copy()
        dropped = before - len(df)

        # ‚îÄ‚îÄ Add provenance column ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        df["source_scenario"] = scenario_name

        print(f"  {len(df):>7,} rows  (dropped {dropped} metadata rows)")
        return df

    except Exception as e:
        print(f"  ‚ùå FAILED ‚Äî {e}")
        return pd.DataFrame()


# ‚îÄ‚îÄ Quick sanity test on the first file ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("=" * 65)
print("üîß LOADER SANITY CHECK (first file only)")
print("=" * 65)

if not inventory_df.empty:
    first_row   = inventory_df.iloc[0]
    test_df     = load_conn_log(
        filepath  = first_row["file_path"],
        columns   = first_row["column_names"],
        max_rows  = 5_000,          # tiny sample for sanity check
    )
    print(f"\n   Shape        : {test_df.shape}")
    print(f"   Columns      : {list(test_df.columns)}")
    print(f"   dtype sample : {dict(list(test_df.dtypes.items())[:5])}")
    print("\nüìã First 3 rows:")
    display(test_df.head(3))
    del test_df   # free memory
else:
    print("‚ö†Ô∏è  Skipping sanity check ‚Äî inventory is empty.")

üîß LOADER SANITY CHECK (first file only)
  üìÇ Loading: CTU-Honeypot-Capture-4-1                    452 rows  (dropped 1 metadata rows)

   Shape        : (452, 22)
   Columns      : ['ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'local_resp', 'missed_bytes', 'history', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents   label   detailed-label', 'source_scenario']
   dtype sample : {'ts': dtype('O'), 'uid': dtype('O'), 'id.orig_h': dtype('O'), 'id.orig_p': dtype('float64'), 'id.resp_h': dtype('O')}

üìã First 3 rows:


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents label detailed-label,source_scenario
0,1540469302.53864,CGm6jB4dXK71ZDWUDh,192.168.1.132,58687.0,216.239.35.4,123.0,udp,-,0.114184,48,48,SF,-,-,0.0,Dd,1.0,76.0,1.0,76.0,- benign -,CTU-Honeypot-Capture-4-1
1,1540469197.400159,CnaDAG3n5r8eiG4su2,192.168.1.132,1900.0,239.255.255.250,1900.0,udp,-,160.367579,7536,0,S0,-,-,0.0,D,24.0,8208.0,0.0,0.0,- benign -,CTU-Honeypot-Capture-4-1
2,1540469385.734089,CUrxU238nt0m6yTgKf,192.168.1.132,32893.0,216.239.35.8,123.0,udp,-,0.016986,48,48,SF,-,-,0.0,Dd,1.0,76.0,1.0,76.0,- benign -,CTU-Honeypot-Capture-4-1


---
## Cell 4: Load and Combine All Scenarios

We iterate over every row in `inventory_df` and call `load_conn_log()` with
`max_rows=100_000`. Results are collected as a list and concatenated once at the
end ‚Äî a single `pd.concat` is much more efficient than repeatedly growing a
DataFrame in a loop.

### Why 100k rows per file?
IoT-23 files vary wildly in size (some > 4 GB). Loading even 10 files fully
would exhaust typical laptop RAM. 100k rows per file gives us ~2.3M rows total
at the cost of ‚âà 1‚Äì2 GB ‚Äî manageable and still representative for Phase 0 EDA.

In [29]:
# ============================================================
# CELL 4: Load and Combine All Scenarios
# ============================================================

MAX_ROWS_PER_FILE = 100_000   # ‚Üê OOM safety cap (tunable)

print("=" * 65)
print(f"üöÄ Loading {len(inventory_df)} IoT-23 scenario files")
print(f"   (max {MAX_ROWS_PER_FILE:,} rows per file)")
print("=" * 65)

chunks = []

for i, row in inventory_df.iterrows():
    chunk = load_conn_log(
        filepath  = row["file_path"],
        columns   = row["column_names"],
        max_rows  = MAX_ROWS_PER_FILE,
    )
    if not chunk.empty:
        chunks.append(chunk)

# ‚îÄ‚îÄ Concatenate all chunks into one master DataFrame ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if chunks:
    df = pd.concat(chunks, axis=0, ignore_index=True)
    del chunks   # free intermediate memory immediately
else:
    df = pd.DataFrame()
    print("‚ö†Ô∏è  No data loaded ‚Äî check IoT-23 base path and file structure.")

# ‚îÄ‚îÄ Summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üìä COMBINED DATASET SUMMARY")
print("=" * 65)
print(f"  Total rows          : {len(df):,}")
print(f"  Total columns       : {len(df.columns)}")
print(f"  Scenarios loaded    : {df['source_scenario'].nunique() if 'source_scenario' in df.columns else 'N/A'}")

mem_mb = df.memory_usage(deep=True).sum() / 1024 / 1024
print(f"  Memory usage        : {mem_mb:.1f} MB")

print("\nüìã Rows per scenario:")
if "source_scenario" in df.columns:
    scenario_counts = (
        df["source_scenario"]
        .value_counts()
        .rename_axis("Scenario")
        .reset_index(name="Row Count")
    )
    display(scenario_counts)

üöÄ Loading 23 IoT-23 scenario files
   (max 100,000 rows per file)
  üìÇ Loading: CTU-Honeypot-Capture-4-1                    452 rows  (dropped 1 metadata rows)
  üìÇ Loading: CTU-Honeypot-Capture-5-1                  1,374 rows  (dropped 1 metadata rows)
  üìÇ Loading: Somfy-01                                    130 rows  (dropped 1 metadata rows)
  üìÇ Loading: CTU-IoT-Malware-Capture-1-1             100,000 rows  (dropped 0 metadata rows)
  üìÇ Loading: CTU-IoT-Malware-Capture-17-1            100,000 rows  (dropped 0 metadata rows)
  üìÇ Loading: CTU-IoT-Malware-Capture-20-1              3,209 rows  (dropped 1 metadata rows)
  üìÇ Loading: CTU-IoT-Malware-Capture-21-1              3,286 rows  (dropped 1 metadata rows)
  üìÇ Loading: CTU-IoT-Malware-Capture-3-1             100,000 rows  (dropped 0 metadata rows)
  üìÇ Loading: CTU-IoT-Malware-Capture-33-1            100,000 rows  (dropped 0 metadata rows)
  üìÇ Loading: CTU-IoT-Malware-Capture-34-1             23,145 row

Unnamed: 0,Scenario,Row Count
0,CTU-IoT-Malware-Capture-1-1,100000
1,CTU-IoT-Malware-Capture-17-1,100000
2,CTU-IoT-Malware-Capture-43-1,100000
3,CTU-IoT-Malware-Capture-36-1,100000
4,CTU-IoT-Malware-Capture-35-1,100000
5,CTU-IoT-Malware-Capture-33-1,100000
6,CTU-IoT-Malware-Capture-3-1,100000
7,CTU-IoT-Malware-Capture-9-1,100000
8,CTU-IoT-Malware-Capture-7-1,100000
9,CTU-IoT-Malware-Capture-52-1,100000


---
## Cell 5: Master Column Inventory

For every column we record:

| Metric | Meaning |
|--------|---------|
| `dtype` | Pandas inferred type (object = mixed/string) |
| `non_null_count` | Rows with a non-NaN value |
| `null_count` | Rows that are NaN (true Python/pandas null) |
| `null_pct` | Percentage of true nulls |
| `unique_count` | Cardinality ‚Äî distinguishes IDs vs categoricals |

> **Note:** Zeek uses `-` (dash) as its "not applicable" sentinel ‚Äî this is
> **not** a pandas null. It will appear as `unique_count` variants.
> We analyse it explicitly in Cell 6.

In [30]:
# ============================================================
# NEW CELL: IoT-23 Feature Semantics (The Generalization Bridge)
# ============================================================

IOT23_FEATURE_SEMANTICS = {
    # TEMPORAL
    'ts': {'maps_to': 'ts', 'role': 'Contextual'},
    'duration': {'maps_to': 'duration', 'role': 'Behavioral'},
    
    # IDENTIFIERS
    'id.orig_h': {'maps_to': 'src_ip', 'role': 'Identifier'},
    'id.orig_p': {'maps_to': 'src_port', 'role': 'Contextual'},
    'id.resp_h': {'maps_to': 'dst_ip', 'role': 'Identifier'},
    'id.resp_p': {'maps_to': 'dst_port', 'role': 'Contextual'},
    'uid': {'maps_to': 'uid', 'role': 'Identifier'},
    
    # PROTOCOL & STATE
    'proto': {'maps_to': 'proto', 'role': 'Contextual'},
    'service': {'maps_to': 'service', 'role': 'Contextual'},
    'conn_state': {'maps_to': 'conn_state', 'role': 'Behavioral'},
    'history': {'maps_to': 'history', 'role': 'Behavioral'},
    
    # VOLUME & BEHAVIOR
    'orig_bytes': {'maps_to': 'src_bytes', 'role': 'Behavioral'},
    'resp_bytes': {'maps_to': 'dst_bytes', 'role': 'Behavioral'},
    'orig_pkts': {'maps_to': 'src_pkts', 'role': 'Behavioral'},
    'resp_pkts': {'maps_to': 'dst_pkts', 'role': 'Behavioral'},
    'orig_ip_bytes': {'maps_to': 'src_ip_bytes', 'role': 'Behavioral'},
    'resp_ip_bytes': {'maps_to': 'dst_ip_bytes', 'role': 'Behavioral'},
    
    # METADATA
    'local_orig': {'maps_to': 'local_orig', 'role': 'Metadata'},
    'local_resp': {'maps_to': 'local_resp', 'role': 'Metadata'},
    'missed_bytes': {'maps_to': 'missed_bytes', 'role': 'Metadata'},
    'tunnel_parents': {'maps_to': 'tunnel', 'role': 'Metadata'}
}

print("=" * 65)
print("üåâ SEMANTIC MAPPING TO TON-IoT (Generalization Prep)")
print("=" * 65)
mapping_df = pd.DataFrame.from_dict(IOT23_FEATURE_SEMANTICS, orient='index')
display(mapping_df)

üåâ SEMANTIC MAPPING TO TON-IoT (Generalization Prep)


Unnamed: 0,maps_to,role
ts,ts,Contextual
duration,duration,Behavioral
id.orig_h,src_ip,Identifier
id.orig_p,src_port,Contextual
id.resp_h,dst_ip,Identifier
id.resp_p,dst_port,Contextual
uid,uid,Identifier
proto,proto,Contextual
service,service,Contextual
conn_state,conn_state,Behavioral


In [31]:
# ============================================================
# CELL 5: Master Column Inventory
# ============================================================

if df.empty:
    print("‚ö†Ô∏è  df is empty ‚Äî skipping column inventory.")
else:
    total_rows = len(df)

    col_inventory = pd.DataFrame({
        "column":       df.columns,
        "dtype":        [str(df[c].dtype) for c in df.columns],
        "non_null_count": [int(df[c].notna().sum()) for c in df.columns],
        "null_count":   [int(df[c].isna().sum())   for c in df.columns],
        "null_pct":     [round(df[c].isna().mean() * 100, 2) for c in df.columns],
        "unique_count": [int(df[c].nunique(dropna=False)) for c in df.columns],
    })

    col_inventory = col_inventory.sort_values("null_pct", ascending=False).reset_index(drop=True)

    print("=" * 65)
    print("üìã MASTER COLUMN INVENTORY")
    print(f"   Total rows: {total_rows:,}  |  Total columns: {len(df.columns)}")
    print("=" * 65)
    display(col_inventory)

    # ‚îÄ‚îÄ Quick summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    all_null_cols  = col_inventory[col_inventory["null_pct"] == 100]["column"].tolist()
    high_null_cols = col_inventory[col_inventory["null_pct"] > 50]["column"].tolist()
    zero_null_cols = col_inventory[col_inventory["null_pct"] == 0]["column"].tolist()

    print(f"\n  Columns with 0% true nulls      : {len(zero_null_cols)}")
    print(f"  Columns with >50% true nulls    : {len(high_null_cols)}")
    print(f"  Columns with 100% true nulls    : {len(all_null_cols)}")
    if all_null_cols:
        print(f"    ‚Üí All-null columns: {all_null_cols}")

üìã MASTER COLUMN INVENTORY
   Total rows: 1,446,662  |  Total columns: 22


Unnamed: 0,column,dtype,non_null_count,null_count,null_pct,unique_count
0,ts,object,1446662,0,0.0,1446662
1,uid,object,1446662,0,0.0,1446662
2,id.orig_h,object,1446662,0,0.0,3297
3,id.orig_p,float64,1446662,0,0.0,62295
4,id.resp_h,object,1446662,0,0.0,1096204
5,id.resp_p,float64,1446662,0,0.0,29265
6,proto,object,1446662,0,0.0,3
7,service,object,1446662,0,0.0,7
8,duration,object,1446662,0,0.0,60004
9,orig_bytes,object,1446662,0,0.0,480



  Columns with 0% true nulls      : 22
  Columns with >50% true nulls    : 0
  Columns with 100% true nulls    : 0


---
## Section 0.3 ‚Äî Categorical Column Value Analysis

### Objectives
1. Separate columns into **categorical** (object dtype) vs **numerical** (int/float) groups
2. Extract all distinct values with frequency counts and percentages per categorical column
3. Identify all sentinel/placeholder types: `-`, `?`, `(empty)`, `F`, `T`, `null`
4. Determine semantic meaning ‚Äî these are **not** missing data, they carry information

> ‚ö†Ô∏è **Critical:** Zeek uses multiple sentinel patterns. Treating them as NaN would destroy
> semantics. Count them here; decide handling strategy in Phase 1.

In [43]:
# ============================================================
# SECTION 0.3: Categorical Column Value Analysis
# ============================================================

# ‚îÄ‚îÄ Separate columns by inferred dtype ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
categorical_columns = [c for c in df.columns if df[c].dtype == object]
numerical_columns   = [c for c in df.columns if df[c].dtype in ['int64','float64','int32','float32']]
other_columns       = [c for c in df.columns if c not in categorical_columns + numerical_columns]

print("=" * 65)
print("üìã COLUMN TYPE BREAKDOWN")
print("=" * 65)
print(f"  Categorical (object)  : {len(categorical_columns)}  ‚Üí {categorical_columns}")
print(f"  Numerical (int/float) : {len(numerical_columns)}  ‚Üí {numerical_columns}")
print(f"  Other                 : {len(other_columns)}  ‚Üí {other_columns}")

# ‚îÄ‚îÄ Per-column value analysis function ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def analyze_categorical_column(df_in, col, max_display=30):
    """Return value counts with semantic interpretation."""
    col_data = df_in[col]
    total    = len(col_data)
    vc       = col_data.value_counts(dropna=False)
    rows     = []
    for val, cnt in vc.items():
        pct = cnt / total * 100
        v_str = str(val) if not pd.isna(val) else "<NaN>"
        if v_str == "-":
            interpretation = "PLACEHOLDER: Feature not applicable (Zeek sentinel)"
        elif v_str == "?":
            interpretation = "PLACEHOLDER: Unknown / could not be determined"
        elif v_str in ("", "None", "null", "N/A"):
            interpretation = "PLACEHOLDER: Empty / null string"
        elif v_str == "(empty)":
            interpretation = "PLACEHOLDER: Empty field (Zeek sentinel)"
        elif v_str == "F":
            interpretation = "BOOLEAN: False"
        elif v_str == "T":
            interpretation = "BOOLEAN: True"
        elif v_str == "<NaN>":
            interpretation = "NULL: pandas NaN (true missing)"
        else:
            interpretation = "DATA: Actual value"
        rows.append({"Value": v_str, "Count": int(cnt), "Pct": round(pct, 2), "Interpretation": interpretation})
    result = pd.DataFrame(rows)
    return result if len(result) <= max_display else result.head(max_display), len(vc)

# ‚îÄ‚îÄ Analyse all categorical columns ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üî§ CATEGORICAL COLUMN VALUE ANALYSIS")
print("=" * 65)

cat_analyses = {}
for col in categorical_columns:
    print(f"\n{'‚îÄ'*65}")
    print(f"üìä  COLUMN: {col}")
    print(f"{'‚îÄ'*65}")
    analysis_df, n_unique = analyze_categorical_column(df, col)
    cat_analyses[col] = {"df": analysis_df, "n_unique": n_unique}
    if n_unique > 30:
        print(f"   ‚ö†Ô∏è High cardinality: {n_unique} unique values (showing top 30)")
    display(analysis_df)

# ‚îÄ‚îÄ Comprehensive sentinel summary across all categoricals ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üìä COMPREHENSIVE SENTINEL SUMMARY ACROSS ALL CATEGORICAL COLUMNS")
print("    (counts of each placeholder/sentinel pattern per column)")
print("=" * 65)

sent_rows = []
for col in categorical_columns:
    ser  = df[col]
    s    = ser.astype(str)
    row  = {"Column": col, "Total": len(ser)}
    row["dash (-)"]  = int((s == "-").sum())
    row["quest (?)"] = int((s == "?").sum())
    row["empty"]     = int((s == "").sum())
    row["(empty)"]   = int((s == "(empty)").sum())
    row["F"]         = int((s == "F").sum())
    row["T"]         = int((s == "T").sum())
    row["NaN"]       = int(ser.isna().sum())
    sent_rows.append(row)

sentinel_df = pd.DataFrame(sent_rows)
sentinel_df["any_sentinel"] = sentinel_df[["dash (-)", "quest (?)", "empty", "(empty)", "F", "T", "NaN"]].sum(axis=1)
sentinel_df = sentinel_df.sort_values("any_sentinel", ascending=False).reset_index(drop=True)
display(sentinel_df)

print("\nüìù KEY INSIGHTS:")
print("   ‚Ä¢ dash '-'  = Zeek 'not applicable' (NOT missing ‚Äî carries semantic meaning)")
print("   ‚Ä¢ quest '?' = Zeek 'unknown' (NOT missing ‚Äî indicates uncertainty)")
print("   ‚Ä¢ F / T     = Zeek boolean flags (False / True)")
print("   ‚Ä¢ NaN       = true pandas null (rare in Zeek logs; possible parse artifact)")
print("   ‚Ä¢ (empty)   = Zeek explicit empty-field sentinel (distinct from dash and NaN)")
print("   ‚Ä¢ NEVER impute these without per-column domain reasoning in Phase 1!")

üìã COLUMN TYPE BREAKDOWN
  Categorical (object)  : 15  ‚Üí ['ts', 'uid', 'id.orig_h', 'id.resp_h', 'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'local_resp', 'history', 'tunnel_parents   label   detailed-label', 'source_scenario']
  Numerical (int/float) : 7  ‚Üí ['id.orig_p', 'id.resp_p', 'missed_bytes', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes']
  Other                 : 0  ‚Üí []

üî§ CATEGORICAL COLUMN VALUE ANALYSIS

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: ts
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 1446662 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,1532526102.004511,1,0.0,DATA: Actual value
1,1540469302.53864,1,0.0,DATA: Actual value
2,1540469197.400159,1,0.0,DATA: Actual value
3,1540469385.734089,1,0.0,DATA: Actual value
4,1540469831.302625,1,0.0,DATA: Actual value
5,1540469831.265405,1,0.0,DATA: Actual value
6,1532526102.003759,1,0.0,DATA: Actual value
7,1532526102.003758,1,0.0,DATA: Actual value
8,1532526102.003754,1,0.0,DATA: Actual value
9,1532526102.003512,1,0.0,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: uid
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 1446662 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,CC83RoUd9RLFuTL81,1,0.0,DATA: Actual value
1,CGm6jB4dXK71ZDWUDh,1,0.0,DATA: Actual value
2,CnaDAG3n5r8eiG4su2,1,0.0,DATA: Actual value
3,CUrxU238nt0m6yTgKf,1,0.0,DATA: Actual value
4,CGQf8t1kjdxB5PHXL4,1,0.0,DATA: Actual value
5,CUo9DH2QDnCaBIGjkg,1,0.0,DATA: Actual value
6,C4zhdf1Z9vWKrJdxW8,1,0.0,DATA: Actual value
7,C2rOxT16GHf55b3qJ8,1,0.0,DATA: Actual value
8,C4KPubzbZiw1vmZRe,1,0.0,DATA: Actual value
9,CybRGc3uj46VtUsqSb,1,0.0,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: id.orig_h
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 3297 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,192.168.1.197,204398,14.13,DATA: Actual value
1,192.168.1.198,199989,13.82,DATA: Actual value
2,192.168.100.111,199935,13.82,DATA: Actual value
3,192.168.1.195,123110,8.51,DATA: Actual value
4,192.168.100.103,100651,6.96,DATA: Actual value
5,192.168.1.194,100000,6.91,DATA: Actual value
6,192.168.1.196,99982,6.91,DATA: Actual value
7,192.168.1.193,99974,6.91,DATA: Actual value
8,192.168.100.108,99925,6.91,DATA: Actual value
9,192.168.1.200,99916,6.91,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: id.resp_h
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 1096204 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,162.248.88.215,99401,6.87,DATA: Actual value
1,200.168.87.203,14665,1.01,DATA: Actual value
2,123.59.209.185,14260,0.99,DATA: Actual value
3,185.244.25.235,6771,0.47,DATA: Actual value
4,178.128.185.250,4112,0.28,DATA: Actual value
5,128.185.250.50,4110,0.28,DATA: Actual value
6,147.231.100.5,3034,0.21,DATA: Actual value
7,192.168.100.1,2564,0.18,DATA: Actual value
8,192.168.100.103,2558,0.18,DATA: Actual value
9,192.168.1.1,2235,0.15,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: proto
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Value,Count,Pct,Interpretation
0,tcp,1388045,95.95,DATA: Actual value
1,udp,54831,3.79,DATA: Actual value
2,icmp,3786,0.26,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: service
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Value,Count,Pct,Interpretation
0,-,1434765,99.18,PLACEHOLDER: Feature not applicable (Zeek sentinel)
1,dns,5447,0.38,DATA: Actual value
2,ssh,3794,0.26,DATA: Actual value
3,irc,1652,0.11,DATA: Actual value
4,http,728,0.05,DATA: Actual value
5,dhcp,181,0.01,DATA: Actual value
6,ssl,95,0.01,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: duration
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 60004 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,-,900408,62.24,PLACEHOLDER: Feature not applicable (Zeek sentinel)
1,2e-06,166973,11.54,DATA: Actual value
2,5e-06,68344,4.72,DATA: Actual value
3,6e-06,37533,2.59,DATA: Actual value
4,1e-06,26748,1.85,DATA: Actual value
5,0.000255,9757,0.67,DATA: Actual value
6,4e-06,6040,0.42,DATA: Actual value
7,9e-06,4042,0.28,DATA: Actual value
8,0.000254,3444,0.24,DATA: Actual value
9,3e-06,2849,0.2,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: orig_bytes
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 480 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,-,900408,62.24,PLACEHOLDER: Feature not applicable (Zeek sentinel)
1,0.0,399789,27.64,DATA: Actual value
2,0,123625,8.55,DATA: Actual value
3,48,7164,0.5,DATA: Actual value
4,589,3756,0.26,DATA: Actual value
5,29,1463,0.1,DATA: Actual value
6,45,1352,0.09,DATA: Actual value
7,78,1182,0.08,DATA: Actual value
8,75,947,0.07,DATA: Actual value
9,67,850,0.06,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: resp_bytes
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 598 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,-,900408,62.24,PLACEHOLDER: Feature not applicable (Zeek sentinel)
1,0.0,400009,27.65,DATA: Actual value
2,0,126235,8.73,DATA: Actual value
3,48,7166,0.5,DATA: Actual value
4,45,2808,0.19,DATA: Actual value
5,1801,1057,0.07,DATA: Actual value
6,243,896,0.06,DATA: Actual value
7,233,805,0.06,DATA: Actual value
8,269,565,0.04,DATA: Actual value
9,96,299,0.02,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: conn_state
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Value,Count,Pct,Interpretation
0,S0,1305757,90.26,DATA: Actual value
1,OTH,117521,8.12,DATA: Actual value
2,SF,17178,1.19,DATA: Actual value
3,REJ,2718,0.19,DATA: Actual value
4,S3,2484,0.17,DATA: Actual value
5,RSTR,607,0.04,DATA: Actual value
6,RSTO,155,0.01,DATA: Actual value
7,SH,109,0.01,DATA: Actual value
8,S1,53,0.0,DATA: Actual value
9,RSTOS0,30,0.0,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: local_orig
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Value,Count,Pct,Interpretation
0,-,1446662,100.0,PLACEHOLDER: Feature not applicable (Zeek sentinel)



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: local_resp
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Value,Count,Pct,Interpretation
0,-,1446662,100.0,PLACEHOLDER: Feature not applicable (Zeek sentinel)



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: history
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   ‚ö†Ô∏è High cardinality: 171 unique values (showing top 30)


Unnamed: 0,Value,Count,Pct,Interpretation
0,S,1262553,87.27,DATA: Actual value
1,C,113653,7.86,DATA: Actual value
2,D,43198,2.99,DATA: Actual value
3,Dd,11621,0.8,DATA: Actual value
4,ShAdDaFf,3796,0.26,DATA: Actual value
5,-,3786,0.26,PLACEHOLDER: Feature not applicable (Zeek sentinel)
6,Sr,2716,0.19,DATA: Actual value
7,ShAdDaf,2263,0.16,DATA: Actual value
8,ShAdDafF,437,0.03,DATA: Actual value
9,ShADadfF,247,0.02,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: tunnel_parents   label   detailed-label
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Value,Count,Pct,Interpretation
0,- Malicious PartOfAHorizontalPortScan,578907,40.02,DATA: Actual value
1,(empty) Malicious PartOfAHorizontalPortScan,247024,17.08,DATA: Actual value
2,- Malicious Okiru,163015,11.27,DATA: Actual value
3,- Benign -,146292,10.11,DATA: Actual value
4,- Malicious DDoS,138775,9.59,DATA: Actual value
5,(empty) Malicious Okiru,99672,6.89,DATA: Actual value
6,(empty) Benign -,51542,3.56,DATA: Actual value
7,(empty) Malicious C&C,8229,0.57,DATA: Actual value
8,- Malicious C&C,6878,0.48,DATA: Actual value
9,(empty) Malicious Attack,3814,0.26,DATA: Actual value



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  COLUMN: source_scenario
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Value,Count,Pct,Interpretation
0,CTU-IoT-Malware-Capture-1-1,100000,6.91,DATA: Actual value
1,CTU-IoT-Malware-Capture-17-1,100000,6.91,DATA: Actual value
2,CTU-IoT-Malware-Capture-43-1,100000,6.91,DATA: Actual value
3,CTU-IoT-Malware-Capture-36-1,100000,6.91,DATA: Actual value
4,CTU-IoT-Malware-Capture-35-1,100000,6.91,DATA: Actual value
5,CTU-IoT-Malware-Capture-33-1,100000,6.91,DATA: Actual value
6,CTU-IoT-Malware-Capture-3-1,100000,6.91,DATA: Actual value
7,CTU-IoT-Malware-Capture-9-1,100000,6.91,DATA: Actual value
8,CTU-IoT-Malware-Capture-7-1,100000,6.91,DATA: Actual value
9,CTU-IoT-Malware-Capture-52-1,100000,6.91,DATA: Actual value



üìä COMPREHENSIVE SENTINEL SUMMARY ACROSS ALL CATEGORICAL COLUMNS
    (counts of each placeholder/sentinel pattern per column)


Unnamed: 0,Column,Total,dash (-),quest (?),empty,(empty),F,T,NaN,any_sentinel
0,local_orig,1446662,1446662,0,0,0,0,0,0,1446662
1,local_resp,1446662,1446662,0,0,0,0,0,0,1446662
2,service,1446662,1434765,0,0,0,0,0,0,1434765
3,orig_bytes,1446662,900408,0,0,0,0,0,0,900408
4,resp_bytes,1446662,900408,0,0,0,0,0,0,900408
5,duration,1446662,900408,0,0,0,0,0,0,900408
6,history,1446662,3786,0,0,0,109,0,0,3895
7,id.resp_h,1446662,0,0,0,0,0,0,0,0
8,proto,1446662,0,0,0,0,0,0,0,0
9,uid,1446662,0,0,0,0,0,0,0,0



üìù KEY INSIGHTS:
   ‚Ä¢ dash '-'  = Zeek 'not applicable' (NOT missing ‚Äî carries semantic meaning)
   ‚Ä¢ quest '?' = Zeek 'unknown' (NOT missing ‚Äî indicates uncertainty)
   ‚Ä¢ F / T     = Zeek boolean flags (False / True)
   ‚Ä¢ NaN       = true pandas null (rare in Zeek logs; possible parse artifact)
   ‚Ä¢ (empty)   = Zeek explicit empty-field sentinel (distinct from dash and NaN)
   ‚Ä¢ NEVER impute these without per-column domain reasoning in Phase 1!


---
## Section 0.4 ‚Äî Numerical Column Semantics

### Objectives
1. Compute standard statistics: min, max, mean, median, std, zero%, skewness
2. Explain what each Zeek field measures in **network behavior context**
3. Determine if zero values are valid (no response) vs anomalies
4. Identify high-skew features that will need log-transform in Phase 1

> Most IoT-23 numeric Zeek fields arrive as **`object` dtype** (strings) because
> all columns come from a TSV with no type hints. This section will cast
> numeric-looking columns for the statistics pass only ‚Äî **no permanent changes**.

In [33]:
# ============================================================
# SECTION 0.4: Numerical Column Statistics
# ============================================================

# Zeek/Bro fields that should be numeric in IoT-23 (all arrive as object/string)
ZEEK_NUMERIC_FIELDS = [
    "ts", "duration",
    "orig_bytes", "resp_bytes",
    "orig_pkts",  "resp_pkts",
    "orig_ip_bytes", "resp_ip_bytes",
    "missed_bytes",
    "id.orig_p", "id.resp_p",   # ports
]

# ‚îÄ‚îÄ Temporary numeric cast (read-only analysis only) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
num_cast = {}
for col in ZEEK_NUMERIC_FIELDS:
    if col in df.columns:
        # Replace Zeek sentinels with NaN for statistics only
        num_cast[col] = pd.to_numeric(
            df[col].replace(["-", "?", "", "T", "F"], pd.NA),
            errors="coerce"
        )

# Also include any columns already inferred as numeric
for col in numerical_columns:
    if col not in num_cast:
        num_cast[col] = df[col]

print("=" * 65)
print("üî¢ NUMERICAL COLUMN STATISTICS (temporary cast for EDA only)")
print("=" * 65)

stats_rows = []
for col, series in num_cast.items():
    s = series.dropna()
    if len(s) == 0:
        continue
    stats_rows.append({
        "Column":       col,
        "Non-null":     len(s),
        "Min":          round(s.min(), 4),
        "Max":          round(s.max(), 4),
        "Mean":         round(s.mean(), 4),
        "Median":       round(s.median(), 4),
        "Std Dev":      round(s.std(), 4),
        "Zero count":   int((s == 0).sum()),
        "Zero %":       f"{(s==0).sum()/len(s)*100:.1f}%",
        "Negative cnt": int((s < 0).sum()),
        "Skewness":     round(s.skew(), 2),
    })

num_stats_df = pd.DataFrame(stats_rows)
display(num_stats_df)

# ‚îÄ‚îÄ Zeek-specific semantic annotations ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
ZEEK_FIELD_SEMANTICS = {
    "ts": {
        "meaning":      "Unix timestamp (float): connection start time",
        "behavior":     "Temporal context ‚Äî when the flow occurred",
        "zero_meaning": "Invalid; timestamps should never be 0",
        "extreme_note": "Large values are normal (seconds since epoch 1970)",
        "relevance":    "Context ‚Äî useful for temporal train/test splitting",
    },
    "duration": {
        "meaning":      "Connection duration in seconds",
        "behavior":     "Short = scan/SYN probe; long = data transfer or C2 persistence",
        "zero_meaning": "Single packet or instant teardown (e.g., ICMP, UDP)",
        "extreme_note": "Multi-hour connections ‚Üí legitimate (streaming) or C2 keep-alive",
        "relevance":    "VERY HIGH ‚Äî key discriminator between scan and data flows",
    },
    "orig_bytes": {
        "meaning":      "Bytes sent by the originator (client ‚Üí server)",
        "behavior":     "Upload volume ‚Äî data exfil potential, payload size",
        "zero_meaning": "SYN with no payload (connection probe / scan)",
        "extreme_note": "Large values ‚Üí file upload or exfiltration",
        "relevance":    "VERY HIGH ‚Äî primary behavioral fingerprint",
    },
    "resp_bytes": {
        "meaning":      "Bytes sent by the responder (server ‚Üí client)",
        "behavior":     "Download / response volume",
        "zero_meaning": "No data returned: blocked, rejected, one-way scan",
        "extreme_note": "Large values ‚Üí file download, streaming, DDos amplification",
        "relevance":    "VERY HIGH ‚Äî asymmetry ratio orig_bytes/resp_bytes is informative",
    },
    "orig_pkts": {
        "meaning":      "Packet count from originator",
        "behavior":     "Packet flood potential; scanning pattern (many pkts, 0 bytes)",
        "zero_meaning": "Should not be 0 ‚Äî would mean no packets sent",
        "extreme_note": "Very high ‚Üí DoS / flood",
        "relevance":    "HIGH ‚Äî volume metric for flooding detection",
    },
    "resp_pkts": {
        "meaning":      "Packet count from responder",
        "behavior":     "Response volume; asymmetry with orig_pkts reveals reflective/amplification attacks",
        "zero_meaning": "No response received ‚Äî unidirectional flow, dropped packets",
        "extreme_note": "resp_pkts >> orig_pkts ‚Üí amplification attack (DNS, SSDP, NTP)",
        "relevance":    "VERY HIGH ‚Äî amplification attack detection",
    },
    "orig_ip_bytes": {
        "meaning":      "IP-layer bytes from originator (includes headers)",
        "behavior":     "Gross network usage from sender; slightly larger than orig_bytes",
        "zero_meaning": "Uncommon; would imply no IP traffic",
        "extreme_note": "Correlated with orig_bytes but useful for header-overhead analysis",
        "relevance":    "HIGH ‚Äî redundant with orig_bytes but captures overhead",
    },
    "resp_ip_bytes": {
        "meaning":      "IP-layer bytes from responder",
        "behavior":     "Gross network usage from receiver side",
        "zero_meaning": "No IP response",
        "extreme_note": "DNS amplification: resp_ip_bytes >> orig_ip_bytes",
        "relevance":    "HIGH ‚Äî useful for amplification ratio calculation",
    },
    "missed_bytes": {
        "meaning":      "Bytes Zeek missed due to capture loss",
        "behavior":     "Capture quality indicator; not a traffic behavioral feature",
        "zero_meaning": "Perfect capture ‚Äî no gaps",
        "extreme_note": "High values ‚Üí unreliable flow metrics for that row",
        "relevance":    "LOW behavioral / HIGH data quality signal",
    },
    "id.orig_p": {
        "meaning":      "Source port number (originator)",
        "behavior":     "Usually ephemeral (1024‚Äì65535); low values = server-side",
        "zero_meaning": "ICMP or other portless protocol (no port concept)",
        "extreme_note": "Well-known port as source ‚Üí reverse shell or misuse",
        "relevance":    "MEDIUM ‚Äî service context; ephemeral port patterns signal scanning",
    },
    "id.resp_p": {
        "meaning":      "Destination port number (responder = target service)",
        "behavior":     "Service identifier: 80=HTTP, 443=HTTPS, 22=SSH, 53=DNS, etc.",
        "zero_meaning": "ICMP / portless protocols",
        "extreme_note": "Targeting unusual high ports ‚Üí lateral movement or custom backdoors",
        "relevance":    "HIGH ‚Äî service identification is critical for attack classification",
    },
}

print("\n" + "=" * 65)
print("üî¨ DETAILED NUMERICAL COLUMN SEMANTICS (IoT-23 / Zeek)")
print("=" * 65)

for col, series in num_cast.items():
    s = series.dropna()
    print(f"\n{'‚îÄ'*65}")
    print(f"üìä  {col}  (dtype in df: {df[col].dtype if col in df.columns else 'derived'})")
    print(f"{'‚îÄ'*65}")
    if len(s) > 0:
        print(f"   Non-null : {len(s):,}  |  Min: {s.min()}  |  Max: {s.max()}")
        print(f"   Mean     : {s.mean():.4f}  |  Median: {s.median():.4f}  |  Std: {s.std():.4f}")
        zero_pct = (s == 0).sum() / len(s) * 100
        print(f"   Zero     : {(s==0).sum():,} ({zero_pct:.1f}%)  |  Negative: {(s<0).sum():,}")
    else:
        print("   [all values were sentinels / NaN after cast]")

    if col in ZEEK_FIELD_SEMANTICS:
        sem = ZEEK_FIELD_SEMANTICS[col]
        print(f"\n   üìù Meaning       : {sem['meaning']}")
        print(f"      Behavior      : {sem['behavior']}")
        print(f"      Zero means    : {sem['zero_meaning']}")
        print(f"      Extreme note  : {sem['extreme_note']}")
        print(f"      Relevance     : {sem['relevance']}")
    else:
        print("   üìù Semantic note: Pattern-based inference only ‚Äî verify domain meaning")

print(f"\n‚úÖ {len(num_cast)} numeric fields analysed (analysis only ‚Äî df unchanged)")

üî¢ NUMERICAL COLUMN STATISTICS (temporary cast for EDA only)


Unnamed: 0,Column,Non-null,Min,Max,Mean,Median,Std Dev,Zero count,Zero %,Negative cnt,Skewness
0,ts,1446662,1525879831.02,1569018214.47,1543353007.54,1545403478.76,11130125.37,0,0.0%,0,0.29
1,duration,546254,0.0,78840.33,2.27,0.0,202.37,0,0.0%,0,274.52
2,orig_bytes,546254,0.0,1744830458.0,7223.02,0.0,2788636.18,523414,95.8%,0,531.45
3,resp_bytes,546254,0.0,336516351.0,686.33,0.0,455624.02,526244,96.3%,0,737.6
4,orig_pkts,1446662,0.0,66027354.0,177.92,1.0,72254.58,113721,7.9%,0,658.02
5,resp_pkts,1446662,0.0,239484.0,0.31,0.0,199.84,1423309,98.4%,0,1189.84
6,orig_ip_bytes,1446662,0.0,1914793266.0,7746.12,40.0,2847457.65,113721,7.9%,0,515.04
7,resp_ip_bytes,1446662,0.0,349618679.0,277.92,0.0,290880.57,1423309,98.4%,0,1200.3
8,missed_bytes,1446662,0.0,20272.0,0.22,0.0,44.33,1446598,100.0%,0,273.18
9,id.orig_p,1446662,0.0,65535.0,37108.82,38114.0,17320.1,35,0.0%,0,-0.32



üî¨ DETAILED NUMERICAL COLUMN SEMANTICS (IoT-23 / Zeek)

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä  ts  (dtype in df: object)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Non-null : 1,446,662  |  Min: 1525879831.015073  |  Max: 1569018214.466954
   Mean     : 1543353007.5429  |  Median: 1545403478.7612  |  Std: 11130125.3675
   Zero     : 0 (0.0%)  |  Negative: 0

   üìù Meaning       : Unix timestamp (float): connection start time
      Behavior      : Temporal context ‚Äî when the flow occurred
      Zero means    : Invalid; timestamps should never be 0
      Extreme note  : Large values are normal (seconds since epoch 1970)
      Relevance     : Context ‚Äî useful for temp

---
## Section 0.5 ‚Äî Subjective Feature Meaning (Full IoT-23 Dictionary)

For **every column** in the IoT-23 `conn.log.labeled`, this section documents:

| Field | Meaning |
|-------|---------|
| **Description** | What the column records |
| **Protocol/Context** | Which protocol(s) populate it |
| **Populated when** | Conditions that fill the field |
| **Empty/Dash when** | Conditions that trigger the `-` sentinel |
| **Captures** | Category: Behavior / Identity / Context / Label / Metadata |
| **Behavioral Relevance** | Importance for IDS model features |

Zeek column naming differs from TON-IoT: `orig_bytes` ‚Üî `src_bytes`, `id.orig_h` ‚Üî `src_ip`, etc.
This dictionary is the **cross-dataset alignment bridge** needed in Phase 1.

In [34]:
# ============================================================
# SECTION 0.5: Comprehensive IoT-23 Feature Meaning Dictionary
# ============================================================

IOT23_FEATURE_MEANINGS = {
    # ‚îÄ‚îÄ TEMPORAL ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "ts": {
        "description": "Timestamp of the first packet of the connection (Unix epoch float)",
        "protocol": "All protocols",
        "populated_when": "Always ‚Äî every Zeek connection gets a start timestamp",
        "empty_when": "Never (Zeek always records ts)",
        "captures": "Context",
        "behavioral_relevance": "MEDIUM ‚Äî useful for temporal splits and time-of-day features",
        "ton_iot_equiv": "ts",
    },
    "duration": {
        "description": "Total duration of the network connection in seconds",
        "protocol": "All protocols",
        "populated_when": "Connection has defined start and end (Zeek saw both SYN and FIN/RST or timeout)",
        "empty_when": "'-' for single packets, ICMP, or flows terminated by capture end",
        "captures": "Behavior",
        "behavioral_relevance": "VERY HIGH ‚Äî short = scan/probe; long = data transfer, C2 keep-alive",
        "ton_iot_equiv": "duration",
    },
    # ‚îÄ‚îÄ IDENTIFIERS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "uid": {
        "description": "Zeek unique connection ID (random base62 string ‚Äî e.g., CqO5kl1SJ3fdmkM94)",
        "protocol": "All protocols",
        "populated_when": "Always ‚Äî Zeek assigns a UID to every connection",
        "empty_when": "Never",
        "captures": "Identity",
        "behavioral_relevance": "NONE ‚Äî just an internal key; not a behavioral feature",
        "ton_iot_equiv": "N/A",
    },
    "id.orig_h": {
        "description": "Source IP address (originator host)",
        "protocol": "All IP-based protocols",
        "populated_when": "Always",
        "empty_when": "Never",
        "captures": "Identity",
        "behavioral_relevance": "LOW for generalization ‚Äî network-specific; causes topology overfitting",
        "ton_iot_equiv": "src_ip",
    },
    "id.orig_p": {
        "description": "Source port number (originator)",
        "protocol": "TCP, UDP",
        "populated_when": "TCP/UDP connections",
        "empty_when": "0 for ICMP or portless protocols",
        "captures": "Context",
        "behavioral_relevance": "MEDIUM ‚Äî ephemeral port patterns; low ports as source = suspicious",
        "ton_iot_equiv": "src_port",
    },
    "id.resp_h": {
        "description": "Destination IP address (responder host)",
        "protocol": "All IP-based protocols",
        "populated_when": "Always",
        "empty_when": "Never",
        "captures": "Identity",
        "behavioral_relevance": "LOW ‚Äî same topology-overfitting risk as id.orig_h",
        "ton_iot_equiv": "dst_ip",
    },
    "id.resp_p": {
        "description": "Destination port number (service being targeted)",
        "protocol": "TCP, UDP",
        "populated_when": "TCP/UDP connections",
        "empty_when": "0 for ICMP",
        "captures": "Context",
        "behavioral_relevance": "HIGH ‚Äî service identifier: 80=HTTP, 443=HTTPS, 22=SSH, 53=DNS, 8883=MQTT",
        "ton_iot_equiv": "dst_port",
    },
    # ‚îÄ‚îÄ PROTOCOL / STATE ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "proto": {
        "description": "Transport layer protocol (tcp, udp, icmp, icmp6)",
        "protocol": "Meta ‚Äî identifies the protocol",
        "populated_when": "Always",
        "empty_when": "Never",
        "captures": "Context",
        "behavioral_relevance": "HIGH ‚Äî different protocols = different attack vectors",
        "ton_iot_equiv": "proto",
    },
    "service": {
        "description": "Application-layer service detected (http, dns, ssl, smtp, ssh, etc.)",
        "protocol": "Application layer (L7)",
        "populated_when": "Zeek successfully identifies the application protocol",
        "empty_when": "'-' when service is unknown, encrypted without SNI, or not recognized",
        "captures": "Context",
        "behavioral_relevance": "HIGH ‚Äî '-' itself is informative (encrypted/evasive traffic)",
        "ton_iot_equiv": "service",
    },
    "conn_state": {
        "description": "Zeek TCP connection state code indicating handshake/teardown outcome",
        "protocol": "TCP primarily; abbreviated codes for UDP/ICMP",
        "populated_when": "Always ‚Äî Zeek assigns a state to every connection",
        "empty_when": "Never (Zeek always records conn_state)",
        "captures": "Behavior",
        "behavioral_relevance": "VERY HIGH ‚Äî state codes (S0, REJ, RSTOS0, OTH) are attack signatures",
        "ton_iot_equiv": "conn_state",
    },
    "history": {
        "description": "TCP flag history string (sequence of events: S=SYN, A=ACK, D=data, F=FIN, R=RST, ‚Ä¶)",
        "protocol": "TCP",
        "populated_when": "TCP connections with activity",
        "empty_when": "'-' for UDP, ICMP, or no TCP activity logged",
        "captures": "Behavior",
        "behavioral_relevance": "VERY HIGH ‚Äî encodes entire handshake pattern; e.g., 'S' alone = SYN scan",
        "ton_iot_equiv": "history",
    },
    # ‚îÄ‚îÄ VOLUME ‚Äî BYTES ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "orig_bytes": {
        "description": "Payload bytes sent from originator (client) to responder (server)",
        "protocol": "All protocols",
        "populated_when": "Originator transmitted payload data",
        "empty_when": "'-' for SYN-only probes, blocks, or RST without payload",
        "captures": "Behavior",
        "behavioral_relevance": "VERY HIGH ‚Äî exfiltration detection; 0/'-' = scan; large = data transfer",
        "ton_iot_equiv": "src_bytes",
    },
    "resp_bytes": {
        "description": "Payload bytes sent from responder (server) to originator (client)",
        "protocol": "All protocols",
        "populated_when": "Responder sent back payload data",
        "empty_when": "'-' for unanswered or rejected connections",
        "captures": "Behavior",
        "behavioral_relevance": "VERY HIGH ‚Äî asymmetry ratio orig_bytes/resp_bytes reveals attack type",
        "ton_iot_equiv": "dst_bytes",
    },
    "orig_ip_bytes": {
        "description": "Total IP-layer bytes from originator (payload + headers)",
        "protocol": "All IP protocols",
        "populated_when": "Always when originator sent packets",
        "empty_when": "Rarely '-'",
        "captures": "Behavior",
        "behavioral_relevance": "HIGH ‚Äî includes overhead; useful for header-overhead fingerprinting",
        "ton_iot_equiv": "N/A (not in TON-IoT)",
    },
    "resp_ip_bytes": {
        "description": "Total IP-layer bytes from responder (payload + headers)",
        "protocol": "All IP protocols",
        "populated_when": "Always when responder sent packets",
        "empty_when": "0 or '-' if no response",
        "captures": "Behavior",
        "behavioral_relevance": "HIGH ‚Äî resp_ip_bytes / orig_ip_bytes ratio detects amplification attacks",
        "ton_iot_equiv": "N/A (not in TON-IoT)",
    },
    # ‚îÄ‚îÄ VOLUME ‚Äî PACKETS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "orig_pkts": {
        "description": "Number of packets sent by originator",
        "protocol": "All protocols",
        "populated_when": "Always (at least 1 for any logged connection)",
        "empty_when": "Should not be 0 or '-'",
        "captures": "Behavior",
        "behavioral_relevance": "HIGH ‚Äî packet flood detection; many pkts + 0 bytes = SYN scan",
        "ton_iot_equiv": "src_pkts",
    },
    "resp_pkts": {
        "description": "Number of packets sent by responder",
        "protocol": "All protocols",
        "populated_when": "Responder replied with packets",
        "empty_when": "0 for unanswered flows; '-' for some protocols",
        "captures": "Behavior",
        "behavioral_relevance": "HIGH ‚Äî resp_pkts=0 = port close or firewall block",
        "ton_iot_equiv": "dst_pkts",
    },
    # ‚îÄ‚îÄ METADATA ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "missed_bytes": {
        "description": "Bytes Zeek missed due to packet loss or capture limitation",
        "protocol": "Any",
        "populated_when": "Always (0 when no packets missed)",
        "empty_when": "Never (defaults to 0)",
        "captures": "Metadata",
        "behavioral_relevance": "LOW behavioral ‚Äî HIGH data quality signal; non-zero = unreliable row",
        "ton_iot_equiv": "missed_bytes",
    },
    "local_orig": {
        "description": "Boolean ‚Äî whether the originator is local to the monitored network",
        "protocol": "All",
        "populated_when": "Zeek could determine locality from configured local subnets",
        "empty_when": "'-' when locality is indeterminate",
        "captures": "Context",
        "behavioral_relevance": "MEDIUM ‚Äî inbound vs. outbound distinction; IoT devices always local",
        "ton_iot_equiv": "local_orig",
    },
    "local_resp": {
        "description": "Boolean ‚Äî whether the responder is local to the monitored network",
        "protocol": "All",
        "populated_when": "Zeek could determine responder locality",
        "empty_when": "'-' when indeterminate",
        "captures": "Context",
        "behavioral_relevance": "MEDIUM ‚Äî lateral movement detection (local‚Üílocal vs. exfil local‚Üíexternal)",
        "ton_iot_equiv": "local_resp",
    },
    "tunnel_parents": {
        "description": "UID of any encapsulating tunnel connection (e.g., GRE, IPinIP, VXLAN)",
        "protocol": "Tunnel-bearing protocols",
        "populated_when": "Connection is carried inside a tunnel",
        "empty_when": "'-' for direct (non-tunnelled) connections (vast majority)",
        "captures": "Metadata",
        "behavioral_relevance": "MEDIUM ‚Äî tunnelling is evasion technique; most flows have '-' (benign)",
        "ton_iot_equiv": "N/A",
    },
    # ‚îÄ‚îÄ LABELS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "label": {
        "description": "Coarse binary-ish label: 'Malicious' or 'Benign'",
        "protocol": "N/A ‚Äî ground truth annotation",
        "populated_when": "Always ‚Äî every row is labelled",
        "empty_when": "Never",
        "captures": "Label",
        "behavioral_relevance": "N/A ‚Äî target variable (do NOT use as input feature)",
        "ton_iot_equiv": "label (binary: 0/1 in TON-IoT)",
    },
    "detailed-label": {
        "description": "Fine-grained attack taxonomy (e.g., PartOfAHorizontalPortScan, C&C-HeartBeat, DDoS)",
        "protocol": "N/A ‚Äî ground truth annotation",
        "populated_when": "Always ‚Äî '-' for Benign rows, attack name for Malicious rows",
        "empty_when": "'-' is used for Benign; NOT a missing value ‚Äî means 'no specific attack'",
        "captures": "Label",
        "behavioral_relevance": "N/A ‚Äî multi-class target; used for attack taxonomy analysis",
        "ton_iot_equiv": "type (attack category in TON-IoT)",
    },
    "source_scenario": {
        "description": "Added by the loader ‚Äî CTU-* folder name from which the row was loaded",
        "protocol": "N/A ‚Äî provenance column",
        "populated_when": "Always (added by load_conn_log)",
        "empty_when": "Never",
        "captures": "Metadata",
        "behavioral_relevance": "NONE for model ‚Äî useful for stratification and scenario-aware splits",
        "ton_iot_equiv": "N/A",
    },
}

# ‚îÄ‚îÄ Generate the Feature Meaning Table ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("=" * 65)
print("üìñ IOT-23 FEATURE MEANING TABLE")
print("=" * 65)

meaning_rows = []
for col in df.columns:
    if col in IOT23_FEATURE_MEANINGS:
        info = IOT23_FEATURE_MEANINGS[col]
    else:
        # Pattern-based fallback
        col_l = col.lower()
        if "byte" in col_l:        role, relevance = "Behavior",  "HIGH ‚Äî volume metric"
        elif "pkt"  in col_l:      role, relevance = "Behavior",  "HIGH ‚Äî packet count"
        elif "port" in col_l or "id." in col_l: role, relevance = "Context", "MEDIUM"
        elif "ip"   in col_l:      role, relevance = "Identity",  "LOW ‚Äî topology-specific"
        elif "label" in col_l:     role, relevance = "Label",     "N/A ‚Äî target column"
        elif "ts" == col_l:        role, relevance = "Context",   "MEDIUM ‚Äî temporal"
        else:                      role, relevance = "Unknown",   "Needs manual review"
        info = {
            "description": f"Pattern-inferred: {col} ‚Äî verify manually",
            "protocol": "Unknown",
            "populated_when": "Unknown",
            "empty_when": "Unknown",
            "captures": role,
            "behavioral_relevance": relevance,
            "ton_iot_equiv": "Unknown",
        }
    meaning_rows.append({
        "Column":               col,
        "Description":          info["description"],
        "Protocol/Context":     info["protocol"],
        "Populated When":       info["populated_when"],
        "Empty/'-' When":       info["empty_when"],
        "Captures":             info["captures"],
        "Behavioral Relevance": info["behavioral_relevance"],
        "TON-IoT Equiv":        info["ton_iot_equiv"],
    })

meaning_df = pd.DataFrame(meaning_rows)
display(meaning_df)

print("\nüìà SUMMARY BY CATEGORY:")
for cat, cnt in meaning_df["Captures"].value_counts().items():
    pct = cnt / len(meaning_df) * 100
    print(f"   {cat:<22}: {cnt} columns ({pct:.0f}%)")

üìñ IOT-23 FEATURE MEANING TABLE


Unnamed: 0,Column,Description,Protocol/Context,Populated When,Empty/'-' When,Captures,Behavioral Relevance,TON-IoT Equiv
0,ts,Timestamp of the first packet of the connection (Unix ep...,All protocols,Always ‚Äî every Zeek connection gets a start timestamp,Never (Zeek always records ts),Context,MEDIUM ‚Äî useful for temporal splits and time-of-day feat...,ts
1,uid,"Zeek unique connection ID (random base62 string ‚Äî e.g., ...",All protocols,Always ‚Äî Zeek assigns a UID to every connection,Never,Identity,NONE ‚Äî just an internal key; not a behavioral feature,
2,id.orig_h,Source IP address (originator host),All IP-based protocols,Always,Never,Identity,LOW for generalization ‚Äî network-specific; causes topolo...,src_ip
3,id.orig_p,Source port number (originator),"TCP, UDP",TCP/UDP connections,0 for ICMP or portless protocols,Context,MEDIUM ‚Äî ephemeral port patterns; low ports as source = ...,src_port
4,id.resp_h,Destination IP address (responder host),All IP-based protocols,Always,Never,Identity,LOW ‚Äî same topology-overfitting risk as id.orig_h,dst_ip
5,id.resp_p,Destination port number (service being targeted),"TCP, UDP",TCP/UDP connections,0 for ICMP,Context,"HIGH ‚Äî service identifier: 80=HTTP, 443=HTTPS, 22=SSH, 5...",dst_port
6,proto,"Transport layer protocol (tcp, udp, icmp, icmp6)",Meta ‚Äî identifies the protocol,Always,Never,Context,HIGH ‚Äî different protocols = different attack vectors,proto
7,service,"Application-layer service detected (http, dns, ssl, smtp...",Application layer (L7),Zeek successfully identifies the application protocol,"'-' when service is unknown, encrypted without SNI, or n...",Context,HIGH ‚Äî '-' itself is informative (encrypted/evasive traf...,service
8,duration,Total duration of the network connection in seconds,All protocols,Connection has defined start and end (Zeek saw both SYN ...,"'-' for single packets, ICMP, or flows terminated by cap...",Behavior,"VERY HIGH ‚Äî short = scan/probe; long = data transfer, C2...",duration
9,orig_bytes,Payload bytes sent from originator (client) to responder...,All protocols,Originator transmitted payload data,"'-' for SYN-only probes, blocks, or RST without payload",Behavior,VERY HIGH ‚Äî exfiltration detection; 0/'-' = scan; large ...,src_bytes



üìà SUMMARY BY CATEGORY:
   Behavior              : 9 columns (41%)
   Context               : 7 columns (32%)
   Identity              : 3 columns (14%)
   Metadata              : 2 columns (9%)
   Label                 : 1 columns (5%)


---
## Section 0.6 ‚Äî Preliminary Feature Role Classification

Classify each column into one of five roles ‚Äî **NO dropping, NO encoding**:

| Role | Description |
|------|-------------|
| **Behavioral** | Captures flow behavior: bytes, packets, duration, states, history |
| **Identifier** | Network-specific identity: IPs, UIDs ‚Äî overfitting risk |
| **Contextual** | Situational info: protocol, service, ports, timestamps |
| **Label/Ground Truth** | Classification targets ‚Äî **never use as input** |
| **Metadata/Auxiliary** | Capture artifacts: missed bytes, tunnel, scenario origin |

> ‚ö†Ô∏è This is **tagging only**. All columns remain in `df`.

In [35]:
# ============================================================
# SECTION 0.6: Feature Role Classification (IoT-23)
# ============================================================

# Explicit column ‚Üí role mapping for Zeek conn.log.labeled schema
IOT23_ROLE_MAP = {
    # ‚îÄ‚îÄ Behavioral ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "duration":      "Behavioral",
    "orig_bytes":    "Behavioral",
    "resp_bytes":    "Behavioral",
    "orig_pkts":     "Behavioral",
    "resp_pkts":     "Behavioral",
    "orig_ip_bytes": "Behavioral",
    "resp_ip_bytes": "Behavioral",
    "conn_state":    "Behavioral",
    "history":       "Behavioral",
    # ‚îÄ‚îÄ Identifier ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "uid":           "Identifier",
    "id.orig_h":     "Identifier",
    "id.resp_h":     "Identifier",
    # ‚îÄ‚îÄ Contextual ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "ts":            "Contextual",
    "id.orig_p":     "Contextual",
    "id.resp_p":     "Contextual",
    "proto":         "Contextual",
    "service":       "Contextual",
    "local_orig":    "Contextual",
    "local_resp":    "Contextual",
    # ‚îÄ‚îÄ Label ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "label":         "Label/Ground Truth",
    "detailed-label":"Label/Ground Truth",
    # ‚îÄ‚îÄ Metadata ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    "missed_bytes":  "Metadata/Auxiliary",
    "tunnel_parents":"Metadata/Auxiliary",
    "source_scenario":"Metadata/Auxiliary",
}

# Pattern-based fallback for columns not listed above
ROLE_PATTERNS = {
    "Behavioral":       ["byte", "pkt", "packet", "rate", "duration",
                         "state", "history", "flag", "load", "loss"],
    "Identifier":       ["id.", ".h", "uid", "mac", "addr", "saddr", "daddr"],
    "Contextual":       ["proto", "service", "port", ".p", "ts", "local_", "tunnel"],
    "Label/Ground Truth": ["label", "type", "attack", "detailed", "class", "target"],
    "Metadata/Auxiliary": ["missed", "scenario", "source_", "peer", "gap"],
}

def classify_role(col):
    if col in IOT23_ROLE_MAP:
        return IOT23_ROLE_MAP[col], "HIGH"
    col_l = col.lower()
    for role, patterns in ROLE_PATTERNS.items():
        for pat in patterns:
            if pat in col_l:
                return role, "MEDIUM"
    return "Unknown ‚Äî Needs Review", "LOW"

# ‚îÄ‚îÄ Build role classification table ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
role_rows = []
for col in df.columns:
    role, confidence = classify_role(col)
    role_rows.append({
        "Column":           col,
        "Role":             role,
        "Confidence":       confidence,
        "dtype":            str(df[col].dtype),
        "unique_count":     int(df[col].nunique(dropna=False)),
        "null_count":       int(df[col].isna().sum()),
        "Cardinality Note": "‚ö†Ô∏è High" if df[col].nunique() > 1000 and role != "Identifier" else "OK",
    })

role_df = pd.DataFrame(role_rows)

print("=" * 65)
print("üè∑Ô∏è  FEATURE ROLE CLASSIFICATION TABLE (Read-Only ‚Äî No Dropping)")
print("=" * 65)
display(role_df)

# ‚îÄ‚îÄ Summary by role ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüìà ROLE DISTRIBUTION:")
role_counts = role_df["Role"].value_counts()
for role, cnt in role_counts.items():
    pct = cnt / len(role_df) * 100
    print(f"   {role:<25} : {cnt} columns ({pct:.0f}%)")

# ‚îÄ‚îÄ IDS model implications ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üìã COLUMNS GROUPED BY ROLE:")
print("=" * 65)

ROLE_NOTES = {
    "Behavioral":         "‚úÖ PRIMARY inputs for behavioral embedding and IDS detection",
    "Identifier":         "‚ö†Ô∏è  EXCLUDE or transform ‚Äî IP-based features cause network overfitting",
    "Contextual":         "‚úÖ  Use for stratification, conditional modeling, or as aux inputs",
    "Label/Ground Truth": "‚ùå  TARGET variables ‚Äî never feed as input features",
    "Metadata/Auxiliary": "‚ÑπÔ∏è  DROP from model inputs; retain for provenance/analysis",
    "Unknown ‚Äî Needs Review": "‚ùì  Manual inspection required before Phase 1 decision",
}

for role in role_df["Role"].unique():
    cols_in_role = role_df[role_df["Role"] == role]["Column"].tolist()
    print(f"\nüîπ {role.upper()} ({len(cols_in_role)} columns)")
    print(f"   {ROLE_NOTES.get(role, '')}")
    for c in cols_in_role:
        row = role_df[role_df["Column"] == c].iloc[0]
        print(f"      ‚Ä¢ {c:<22}  dtype={row['dtype']:<8}  unique={row['unique_count']}")

# ‚îÄ‚îÄ High cardinality / concern list ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
concerns = role_df[
    (role_df["Cardinality Note"] == "‚ö†Ô∏è High") &
    (role_df["Role"] != "Identifier")
]
if not concerns.empty:
    print(f"\n‚ö†Ô∏è  HIGH-CARDINALITY CONCERNS (>1000 unique, not Identifier):")
    for _, r in concerns.iterrows():
        print(f"   {r['Column']} ‚Äî {r['unique_count']} unique values ‚Üí may need hashing/grouping")

üè∑Ô∏è  FEATURE ROLE CLASSIFICATION TABLE (Read-Only ‚Äî No Dropping)


Unnamed: 0,Column,Role,Confidence,dtype,unique_count,null_count,Cardinality Note
0,ts,Contextual,HIGH,object,1446662,0,‚ö†Ô∏è High
1,uid,Identifier,HIGH,object,1446662,0,OK
2,id.orig_h,Identifier,HIGH,object,3297,0,OK
3,id.orig_p,Contextual,HIGH,float64,62295,0,‚ö†Ô∏è High
4,id.resp_h,Identifier,HIGH,object,1096204,0,OK
5,id.resp_p,Contextual,HIGH,float64,29265,0,‚ö†Ô∏è High
6,proto,Contextual,HIGH,object,3,0,OK
7,service,Contextual,HIGH,object,7,0,OK
8,duration,Behavioral,HIGH,object,60004,0,‚ö†Ô∏è High
9,orig_bytes,Behavioral,HIGH,object,480,0,OK



üìà ROLE DISTRIBUTION:
   Behavioral                : 9 columns (41%)
   Contextual                : 8 columns (36%)
   Identifier                : 3 columns (14%)
   Metadata/Auxiliary        : 2 columns (9%)

üìã COLUMNS GROUPED BY ROLE:

üîπ CONTEXTUAL (8 columns)
   ‚úÖ  Use for stratification, conditional modeling, or as aux inputs
      ‚Ä¢ ts                      dtype=object    unique=1446662
      ‚Ä¢ id.orig_p               dtype=float64   unique=62295
      ‚Ä¢ id.resp_p               dtype=float64   unique=29265
      ‚Ä¢ proto                   dtype=object    unique=3
      ‚Ä¢ service                 dtype=object    unique=7
      ‚Ä¢ local_orig              dtype=object    unique=1
      ‚Ä¢ local_resp              dtype=object    unique=1
      ‚Ä¢ tunnel_parents   label   detailed-label  dtype=object    unique=19

üîπ IDENTIFIER (3 columns)
   ‚ö†Ô∏è  EXCLUDE or transform ‚Äî IP-based features cause network overfitting
      ‚Ä¢ uid                     dtype=obje

---
## Cell 6: Placeholder (`-`) Analysis

Zeek writes `-` (a single hyphen) whenever a field is not applicable (e.g.,
`resp_bytes` during a SYN scan where no response was ever sent).  
These are **not** NaN ‚Äî pandas will see them as a valid string value `"-"`.

### Why this matters
- Imputing `-` as 0 or mean would be **semantically wrong** (absence ‚â† zero).
- A column with 95% `-` is effectively unusable as a numeric feature without
  domain-specific handling ‚Äî flag it now so Phase 1 preprocessing can decide.
- The `-` rate is also a Zeek-specific data quality signal: high `-` in
  `resp_pkts` or `resp_bytes` is expected for scanning/DoS traffic.

In [36]:
# ============================================================
# CELL 6: Placeholder ("-") Analysis
# ============================================================

PLACEHOLDER = "-"

if df.empty:
    print("‚ö†Ô∏è  df is empty ‚Äî skipping placeholder analysis.")
else:
    total_rows = len(df)

    # ‚îÄ‚îÄ Count placeholder occurrences per column ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    placeholder_counts = {
        col: int((df[col].astype(str) == PLACEHOLDER).sum())
        for col in df.columns
    }

    placeholder_df = (
        pd.DataFrame.from_dict(
            placeholder_counts, orient="index", columns=["placeholder_count"]
        )
        .assign(placeholder_pct=lambda x: (x["placeholder_count"] / total_rows * 100).round(2))
        .sort_values("placeholder_pct", ascending=False)
        .reset_index()
        .rename(columns={"index": "column"})
    )

    # ‚îÄ‚îÄ Filter to only columns that have at least one '-' ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    has_placeholder = placeholder_df[placeholder_df["placeholder_count"] > 0]
    zero_placeholder = placeholder_df[placeholder_df["placeholder_count"] == 0]

    print("=" * 65)
    print(f"üìä PLACEHOLDER ('-') ANALYSIS  (total rows: {total_rows:,})")
    print("=" * 65)
    print(f"  Columns with NO '-'      : {len(zero_placeholder)}")
    print(f"  Columns with some '-'    : {len(has_placeholder)}")
    print(f"\n  ‚Üí Columns sorted by '-' rate:\n")
    display(has_placeholder)

    # ‚îÄ‚îÄ Flag high-placeholder columns (>50%) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    high_ph = has_placeholder[has_placeholder["placeholder_pct"] > 50]
    if not high_ph.empty:
        print(f"\n‚ö†Ô∏è  HIGH-PLACEHOLDER COLUMNS (>50%):  {high_ph['column'].tolist()}")
        print("   These may be effectively unusable as numeric features without")
        print("   domain-specific handling in Phase 1.")
    else:
        print("\n‚úÖ No columns exceed 50% '-' placeholder rate.")

üìä PLACEHOLDER ('-') ANALYSIS  (total rows: 1,446,662)
  Columns with NO '-'      : 15
  Columns with some '-'    : 7

  ‚Üí Columns sorted by '-' rate:



Unnamed: 0,column,placeholder_count,placeholder_pct
0,local_resp,1446662,100.0
1,local_orig,1446662,100.0
2,service,1434765,99.18
3,duration,900408,62.24
4,orig_bytes,900408,62.24
5,resp_bytes,900408,62.24
6,history,3786,0.26



‚ö†Ô∏è  HIGH-PLACEHOLDER COLUMNS (>50%):  ['local_resp', 'local_orig', 'service', 'duration', 'orig_bytes', 'resp_bytes']
   These may be effectively unusable as numeric features without
   domain-specific handling in Phase 1.


---
## Cell 7: Dual-Label Taxonomy

IoT-23 provides **two** label columns per connection record:

| Column | Example Values | Granularity |
|--------|---------------|-------------|
| `label` | `Malicious`, `Benign` | Binary / coarse |
| `detailed-label` | `PartOfAHorizontalPortScan`, `C&C-HeartBeat`, `-` | Fine-grained |

A cross-tabulation (`pd.crosstab`) reveals:
- How many attack categories exist
- Whether any `Benign` rows have a non-`-` detailed-label (data-quality check)
- Class imbalance at both levels ‚Äî critical input for Phase 1 sampling strategy

In [37]:
# ============================================================
# CELL 7: Dual-Label Taxonomy
# ============================================================

LABEL_COL    = "label"
DET_LABEL_COL = "detailed-label"

if df.empty:
    print("‚ö†Ô∏è  df is empty ‚Äî skipping dual-label taxonomy.")
elif LABEL_COL not in df.columns or DET_LABEL_COL not in df.columns:
    print(f"‚ö†Ô∏è  Expected columns '{LABEL_COL}' and/or '{DET_LABEL_COL}' not found.")
    print(f"   Available columns: {list(df.columns)}")
else:
    # ‚îÄ‚îÄ Top-level label distribution ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    label_counts = (
        df[LABEL_COL]
        .value_counts(dropna=False)
        .rename_axis("label")
        .reset_index(name="count")
        .assign(pct=lambda x: (x["count"] / len(df) * 100).round(2))
    )

    print("=" * 65)
    print("üè∑Ô∏è  TOP-LEVEL LABEL DISTRIBUTION")
    print("=" * 65)
    display(label_counts)

    # ‚îÄ‚îÄ Detailed-label distribution ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    det_label_counts = (
        df[DET_LABEL_COL]
        .value_counts(dropna=False)
        .rename_axis("detailed-label")
        .reset_index(name="count")
        .assign(pct=lambda x: (x["count"] / len(df) * 100).round(2))
    )

    print(f"\nüìã UNIQUE detailed-labels  : {df[DET_LABEL_COL].nunique(dropna=False)}")
    print("=" * 65)
    print("üè∑Ô∏è  DETAILED-LABEL DISTRIBUTION")
    print("=" * 65)
    display(det_label_counts)

    # ‚îÄ‚îÄ Cross-tabulation: label √ó detailed-label ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    print("\n" + "=" * 65)
    print("üîÄ CROSS-TABULATION: label √ó detailed-label")
    print("    (rows = label, columns = detailed-label)")
    print("=" * 65)

    crosstab = pd.crosstab(
        df[LABEL_COL],
        df[DET_LABEL_COL],
        margins=True,
        margins_name="TOTAL",
    )
    display(crosstab)

    # ‚îÄ‚îÄ Data quality check ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    benign_with_detail = df[
        (df[LABEL_COL].str.strip().str.lower() == "benign") &
        (df[DET_LABEL_COL].astype(str).str.strip() != PLACEHOLDER)
    ]
    if len(benign_with_detail) > 0:
        print(f"\n‚ö†Ô∏è  DATA QUALITY: {len(benign_with_detail):,} Benign rows have a non-'-' detailed-label.")
        print("   This may indicate mislabelling or inconsistent annotation.")
    else:
        print("\n‚úÖ Data quality OK: All Benign rows have detailed-label = '-'.")

    # ‚îÄ‚îÄ Per-scenario label breakdown ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    print("\n" + "=" * 65)
    print("üìÇ LABEL DISTRIBUTION PER SCENARIO")
    print("=" * 65)
    scenario_label = (
        df.groupby(["source_scenario", LABEL_COL])
        .size()
        .unstack(fill_value=0)
        .assign(total=lambda x: x.sum(axis=1))
    )
    display(scenario_label)

‚ö†Ô∏è  Expected columns 'label' and/or 'detailed-label' not found.
   Available columns: ['ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'local_resp', 'missed_bytes', 'history', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents   label   detailed-label', 'source_scenario']


---
## Section 0.7 ‚Äî Phase-0 Summary, Challenges & Artifacts

### Final objectives
1. Consolidate all Phase-0 findings into a comprehensive summary
2. Identify **potential challenges** for Phase-1 preprocessing
3. Export key analysis tables as CSV artifacts
4. Generate full markdown report covering all 7 sections

In [38]:
# ============================================================
# SECTION 0.7 ‚Äî Potential Challenges for Phase-1
# ============================================================

print("=" * 65)
print("‚ö†Ô∏è  POTENTIAL CHALLENGES IDENTIFIED FOR PHASE-1")
print("=" * 65)

# ‚îÄ‚îÄ 1. String-typed numeric columns ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüî∏ 1. ALL NUMERIC FIELDS ARRIVE AS DTYPE=OBJECT (STRING)")
print("   Every column is `object` because TSV has no type hints.")
print("   Phase 1 must cast these explicitly:")
object_numeric = [c for c in df.columns if c in ZEEK_NUMERIC_FIELDS and c in df.columns]
for c in object_numeric:
    print(f"      ‚Üí {c}: dtype={df[c].dtype}  (will need pd.to_numeric + sentinel handling)")

# ‚îÄ‚îÄ 2. Sentinel-heavy columns ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüî∏ 2. HIGH-PLACEHOLDER COLUMNS (>50% dash '-')")
if "placeholder_df" not in dir() and "has_placeholder" not in dir():
    print("   [Run Cell 6 first to populate placeholder_df]")
else:
    try:
        high_dash = has_placeholder[has_placeholder["placeholder_pct"] > 50]
        if not high_dash.empty:
            for _, r in high_dash.iterrows():
                print(f"      ‚ö†Ô∏è  {r['column']}: {r['placeholder_pct']}% dash '-'")
                print(f"           ‚Üí numeric cast impossible without semantic decision")
        else:
            print("   ‚úÖ No column exceeds 50% dash rate")
    except NameError:
        print("   [placeholder analysis not yet run]")

# ‚îÄ‚îÄ 3. Class imbalance ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüî∏ 3. CLASS IMBALANCE")
if LABEL_COL in df.columns and DET_LABEL_COL in df.columns:
    lc = df[LABEL_COL].value_counts()
    if len(lc) > 1:
        ratio = lc.iloc[0] / lc.iloc[-1]
        print(f"   label      : majority/minority ratio = {ratio:.1f}:1")
        if ratio > 10:
            print(f"      ‚ö†Ô∏è  Severe imbalance ‚Äî SMOTE or class_weight='balanced' needed")
    dlc = df[DET_LABEL_COL].value_counts()
    rare = dlc[dlc < 1000]
    print(f"   detailed-label: {len(rare)} categories have < 1,000 rows ‚Üí risk of rare-class exclusion")

# ‚îÄ‚îÄ 4. Identity leakage risk ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüî∏ 4. IDENTITY LEAKAGE RISK (network-specific overfitting)")
id_cols = role_df[role_df["Role"] == "Identifier"]["Column"].tolist() if "role_df" in dir() else ["id.orig_h", "id.resp_h", "uid"]
for c in id_cols:
    n_unique = df[c].nunique() if c in df.columns else "?"
    print(f"      ‚ö†Ô∏è  {c}: {n_unique} unique values ‚Äî will cause topology overfitting if used raw")

# ‚îÄ‚îÄ 5. Cross-dataset schema mismatch ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüî∏ 5. CROSS-DATASET SCHEMA MISMATCH (IoT-23 vs TON-IoT)")
print("   IOT-23 column ‚Üí TON-IoT equivalent:")
for col, info in IOT23_FEATURE_MEANINGS.items():
    equiv = info.get("ton_iot_equiv", "Unknown")
    if equiv not in ("N/A", "N/A (not in TON-IoT)", "Unknown", "N/A (not in TON-IoT)"):
        if equiv != col and col in df.columns:
            print(f"      {col:<22} ‚Üí {equiv}")

print("\n   Columns with NO TON-IoT equivalent:")
for col, info in IOT23_FEATURE_MEANINGS.items():
    equiv = info.get("ton_iot_equiv", "Unknown")
    if "N/A" in str(equiv):
        print(f"      {col:<22} ‚Äî IoT-23 only")

# ‚îÄ‚îÄ 6. Scenario-aware split ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüî∏ 6. TRAIN/TEST SPLIT STRATEGY")
print("   Random split = data leakage (same CTU scenario in train AND test).")
print("   Required: scenario-aware split ‚Äî hold out 3-5 entire CTU_* folders for test.")
if "source_scenario" in df.columns:
    scenarios = sorted(df["source_scenario"].unique())
    print(f"   Available scenarios ({len(scenarios)}): {scenarios}")

# ‚îÄ‚îÄ 7. History column high cardinality ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüî∏ 7. HIGH-CARDINALITY FIELDS")
if "role_df" in dir():
    hc = role_df[(role_df["unique_count"] > 1000) & (~role_df["Role"].isin(["Identifier", "Metadata/Auxiliary"]))]
    if not hc.empty:
        for _, r in hc.iterrows():
            print(f"      ‚ö†Ô∏è  {r['Column']}: {r['unique_count']} unique values ‚Üí needs grouping or hashing in Phase 1")
    else:
        print("   ‚úÖ No high-cardinality non-identifier columns")

‚ö†Ô∏è  POTENTIAL CHALLENGES IDENTIFIED FOR PHASE-1

üî∏ 1. ALL NUMERIC FIELDS ARRIVE AS DTYPE=OBJECT (STRING)
   Every column is `object` because TSV has no type hints.
   Phase 1 must cast these explicitly:
      ‚Üí ts: dtype=object  (will need pd.to_numeric + sentinel handling)
      ‚Üí id.orig_p: dtype=float64  (will need pd.to_numeric + sentinel handling)
      ‚Üí id.resp_p: dtype=float64  (will need pd.to_numeric + sentinel handling)
      ‚Üí duration: dtype=object  (will need pd.to_numeric + sentinel handling)
      ‚Üí orig_bytes: dtype=object  (will need pd.to_numeric + sentinel handling)
      ‚Üí resp_bytes: dtype=object  (will need pd.to_numeric + sentinel handling)
      ‚Üí missed_bytes: dtype=float64  (will need pd.to_numeric + sentinel handling)
      ‚Üí orig_pkts: dtype=float64  (will need pd.to_numeric + sentinel handling)
      ‚Üí orig_ip_bytes: dtype=float64  (will need pd.to_numeric + sentinel handling)
      ‚Üí resp_pkts: dtype=float64  (will need pd.to_nu

In [44]:
# ============================================================
# SECTION 0.7 ‚Äî Export Artifacts + Markdown Report
# ============================================================

import os

# ‚îÄ‚îÄ Create artifacts directory ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# NOTEBOOK_DIR = ‚Ä¶/main_folder/Phase_0  ‚Üí go up one level to main_folder
artifacts_dir = NOTEBOOK_DIR.parent / "artifacts"
artifacts_dir.mkdir(parents=True, exist_ok=True)
print(f"üìÅ Artifacts directory: {artifacts_dir}")

# ‚îÄ‚îÄ Save CSV artifacts ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
exports = {}

# 1. Master column inventory
if "col_inventory" in dir() and col_inventory is not None:
    path = artifacts_dir / "iot23_phase0_column_inventory.csv"
    col_inventory.to_csv(path, index=False)
    exports["column_inventory"] = str(path)

# 2. Sentinel summary
if "sentinel_df" in dir() and sentinel_df is not None:
    path = artifacts_dir / "iot23_phase0_sentinel_analysis.csv"
    sentinel_df.to_csv(path, index=False)
    exports["sentinel_analysis"] = str(path)

# 3. Feature meaning table
if "meaning_df" in dir() and meaning_df is not None:
    path = artifacts_dir / "iot23_phase0_feature_meanings.csv"
    meaning_df.to_csv(path, index=False)
    exports["feature_meanings"] = str(path)

# 4. Role classification
if "role_df" in dir() and role_df is not None:
    path = artifacts_dir / "iot23_phase0_role_classification.csv"
    role_df.to_csv(path, index=False)
    exports["role_classification"] = str(path)

# 5. File inventory
if "inventory_df" in dir() and inventory_df is not None:
    inv_save = inventory_df.copy()
    inv_save["column_names"] = inv_save["column_names"].apply(str)  # make serialisable
    path = artifacts_dir / "iot23_phase0_file_inventory.csv"
    inv_save.to_csv(path, index=False)
    exports["file_inventory"] = str(path)

# 6. Numerical stats
if "num_stats_df" in dir() and num_stats_df is not None:
    path = artifacts_dir / "iot23_phase0_numerical_stats.csv"
    num_stats_df.to_csv(path, index=False)
    exports["numerical_stats"] = str(path)

print("\n‚úÖ CSV Artifacts saved:")
for name, path in exports.items():
    print(f"   ‚Ä¢ {name}: {Path(path).name}")

# ‚îÄ‚îÄ Generate Markdown Report ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def generate_iot23_markdown_report():
    mem = df.memory_usage(deep=True).sum() / 1024 / 1024
    n_scenarios = df["source_scenario"].nunique() if "source_scenario" in df.columns else "?"
    n_uniq_labels = df[LABEL_COL].nunique(dropna=False) if LABEL_COL in df.columns else "?"
    n_uniq_det    = df[DET_LABEL_COL].nunique(dropna=False) if DET_LABEL_COL in df.columns else "?"

    report = f"""# üìä Phase-0.2: Deep Data Understanding Report ‚Äî IoT-23

**Dataset:** CTU-IoT-Malware-Capture-23 (IoT-23)  
**Format:** Zeek/Bro TSV conn.log.labeled  
**Generated by:** Phase_0_2_IoT23_Data_Understanding.ipynb

---

# Table of Contents
1. [Section 0.1 ‚Äî File Inventory & Loading](#section-01)
2. [Section 0.2 ‚Äî Master Column Inventory](#section-02)
3. [Section 0.3 ‚Äî Categorical Column Value Analysis](#section-03)
4. [Section 0.4 ‚Äî Numerical Column Semantics](#section-04)
5. [Section 0.5 ‚Äî Full Feature Meaning Dictionary](#section-05)
6. [Section 0.6 ‚Äî Feature Role Classification](#section-06)
7. [Section 0.7 ‚Äî Summary, Challenges & Artifacts](#section-07)

---

## Section 0.1 ‚Äî File Inventory & Loading {{#section-01}}

### Dataset Structure
- **Format:** TSV (tab-separated), NOT CSV
- **Header:** 8 Zeek metadata lines starting with `#` (lines 0-7)
- **Column names:** Line 6 (`#fields` tab-separated)
- **Footer:** `#close` on the last line (must be dropped)
- **Null sentinel:** `-` (dash) = "not applicable" ‚Äî Zeek-specific, NOT pandas NaN
- **Dual labels:** `label` (binary) + `detailed-label` (fine-grained)

### File Inventory Summary
"""

    if "inventory_df" in dir() and inventory_df is not None:
        report += "\n| # | Scenario | File Size (MB) | Columns |\n|---|----------|---------------|--------|\n"
        for i, row in inventory_df.iterrows():
            report += f"| {i+1} | {row['scenario_name']} | {row['file_size_mb']:.1f} | {row['num_columns']} |\n"

    report += f"""
### Combined Dataset Stats
| Metric | Value |
|--------|-------|
| Scenarios loaded | {n_scenarios} / {len(inventory_df) if 'inventory_df' in dir() else '?'} |
| Total rows sampled | {len(df):,} (max {MAX_ROWS_PER_FILE:,}/file) |
| Total columns | {len(df.columns)} |
| Memory footprint | {mem:.1f} MB |

---

## Section 0.2 ‚Äî Master Column Inventory {{#section-02}}

Full column inventory with dtype, null%, and unique count.

"""
    if "col_inventory" in dir() and col_inventory is not None:
        report += "| Column | dtype | non_null | null_count | null_pct | unique |\n"
        report += "|--------|-------|----------|------------|----------|--------|\n"
        for _, r in col_inventory.iterrows():
            report += f"| {r['column']} | {r['dtype']} | {r['non_null_count']:,} | {r['null_count']:,} | {r['null_pct']}% | {r['unique_count']:,} |\n"

    report += f"""
---

## Section 0.3 ‚Äî Categorical Column Value Analysis {{#section-03}}

### Column Type Split
- **Categorical (object):** {len(categorical_columns)} columns
- **Numerical (int/float):** {len(numerical_columns)} columns

### Sentinel Summary (all categorical columns)

> `dash (-)` = Zeek "not applicable"; `quest (?)` = Zeek "unknown"; `F/T` = boolean; `NaN` = true pandas null

"""
    if "sentinel_df" in dir() and sentinel_df is not None:
        report += "| Column | Total | dash(-) | quest(?) | empty | F | T | NaN | any_sentinel |\n"
        report += "|--------|-------|---------|----------|-------|---|---|-----|-------------|\n"
        for _, r in sentinel_df.iterrows():
            report += (f"| {r['Column']} | {r['Total']:,} | {r['dash (-)']} | "
                       f"{r['quest (?)']} | {r['empty']} | {r['F']} | {r['T']} | "
                       f"{r['NaN']} | {r['any_sentinel']} |\n")

    report += f"""
---

## Section 0.4 ‚Äî Numerical Column Semantics {{#section-04}}

"""
    if "num_stats_df" in dir() and num_stats_df is not None:
        report += "| Column | Non-null | Min | Max | Mean | Median | Std Dev | Zero% | Skewness |\n"
        report += "|--------|----------|-----|-----|------|--------|---------|-------|----------|\n"
        for _, r in num_stats_df.iterrows():
            report += (f"| {r['Column']} | {r['Non-null']:,} | {r['Min']} | {r['Max']} | "
                       f"{r['Mean']} | {r['Median']} | {r['Std Dev']} | {r['Zero %']} | {r['Skewness']} |\n")

    report += f"""
---

## Section 0.5 ‚Äî Full Feature Meaning Dictionary {{#section-05}}

"""
    if "meaning_df" in dir() and meaning_df is not None:
        report += "| Column | Description | Captures | Behavioral Relevance | TON-IoT Equiv |\n"
        report += "|--------|-------------|----------|---------------------|---------------|\n"
        for _, r in meaning_df.iterrows():
            desc_short = str(r["Description"])[:60] + "‚Ä¶" if len(str(r["Description"])) > 60 else str(r["Description"])
            report += f"| {r['Column']} | {desc_short} | {r['Captures']} | {r['Behavioral Relevance']} | {r['TON-IoT Equiv']} |\n"

    report += f"""
---

## Section 0.6 ‚Äî Feature Role Classification {{#section-06}}

> ‚ö†Ô∏è Classification ONLY ‚Äî all columns preserved in `df`.

"""
    if "role_df" in dir() and role_df is not None:
        report += "| Column | Role | Confidence | dtype | Unique |\n"
        report += "|--------|------|------------|-------|--------|\n"
        for _, r in role_df.iterrows():
            report += f"| {r['Column']} | {r['Role']} | {r['Confidence']} | {r['dtype']} | {r['unique_count']:,} |\n"

        report += "\n### Role Distribution\n\n"
        for role, cnt in role_df["Role"].value_counts().items():
            report += f"- **{role}**: {cnt} columns\n"

    report += f"""
---

## Section 0.7 ‚Äî Summary, Challenges & Artifacts {{#section-07}}

### Dataset Overview
| Metric | Value |
|--------|-------|
| Scenarios | {n_scenarios} |
| Rows sampled | {len(df):,} |
| Columns | {len(df.columns)} |
| Memory | {mem:.1f} MB |
| Unique top-level labels | {n_uniq_labels} |
| Unique detailed-labels | {n_uniq_det} |

### Label Distribution

**Top-level (`label`):**
"""
    if LABEL_COL in df.columns:
        lc = df[LABEL_COL].value_counts(dropna=False)
        report += "| Label | Count | Pct |\n|-------|-------|-----|\n"
        for val, cnt in lc.items():
            report += f"| {val} | {cnt:,} | {cnt/len(df)*100:.2f}% |\n"

    report += "\n**Detailed-label (top 15):**\n"
    if DET_LABEL_COL in df.columns:
        dlc = df[DET_LABEL_COL].value_counts(dropna=False).head(15)
        report += "| detailed-label | Count | Pct |\n|----------------|-------|-----|\n"
        for val, cnt in dlc.items():
            report += f"| {val} | {cnt:,} | {cnt/len(df)*100:.2f}% |\n"

    report += """
### Key Challenges for Phase-1
1. **All columns are `object` dtype** ‚Äî numeric cast required per field
2. **`-` is a semantic sentinel** ‚Äî do NOT replace blindly with NaN/0
3. **Class imbalance** ‚Äî severe skew; requires SMOTE or class_weight
4. **Identity columns** (id.orig_h, id.resp_h, uid) ‚Äî must be excluded or transformed
5. **Scenario-aware split** ‚Äî random split leaks; hold out full CTU-* scenarios
6. **Cross-dataset alignment** ‚Äî IoT-23 column names differ from TON-IoT/BoT-IoT
7. **history column** ‚Äî high cardinality string; needs aggregation or embedding

### Artifacts Generated
| File | Description |
|------|-------------|
| `iot23_phase0_column_inventory.csv` | dtype, null%, unique counts per column |
| `iot23_phase0_sentinel_analysis.csv` | All placeholder/sentinel counts per column |
| `iot23_phase0_feature_meanings.csv` | Full semantic meaning + TON-IoT cross-reference |
| `iot23_phase0_role_classification.csv` | Behavioral/Identifier/Contextual/Label/Metadata tags |
| `iot23_phase0_file_inventory.csv` | Per-file scenario info and column lists |
| `iot23_phase0_numerical_stats.csv` | Min/max/mean/skew for numeric fields |

---

## Phase-0.2 Compliance Checklist
| Rule | Status |
|------|--------|
| No feature dropping | ‚úÖ All columns preserved |
| No encoding | ‚úÖ Raw data only; role-tagged not transformed |
| No scaling | ‚úÖ Statistics computed but df unchanged |
| Complete data understanding | ‚úÖ All 7 sections completed |

---
**End of Phase-0.2: IoT-23 Deep Data Understanding**  
*Proceed to Phase-0.3 (BoT-IoT) or Phase-1 (Preprocessing)*
"""
    return report

# Generate and save report
print("\nüìù Generating Phase-0.2 Markdown Report...")
md_report = generate_iot23_markdown_report()
report_path = artifacts_dir / "Phase_0_2_IoT23_Data_Understanding_Report.md"
with open(report_path, "w", encoding="utf-8") as f:
    f.write(md_report)

print(f"‚úÖ Markdown report saved: {report_path.name}")
print(f"   Size: {len(md_report):,} characters / {len(md_report.splitlines())} lines")
print("\n" + "=" * 65)
print("‚úÖ PHASE-0.2 COMPLETE ‚Äî All artifacts exported")
print("=" * 65)

üìÅ Artifacts directory: c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\artifacts

‚úÖ CSV Artifacts saved:
   ‚Ä¢ column_inventory: iot23_phase0_column_inventory.csv
   ‚Ä¢ sentinel_analysis: iot23_phase0_sentinel_analysis.csv
   ‚Ä¢ feature_meanings: iot23_phase0_feature_meanings.csv
   ‚Ä¢ role_classification: iot23_phase0_role_classification.csv
   ‚Ä¢ file_inventory: iot23_phase0_file_inventory.csv
   ‚Ä¢ numerical_stats: iot23_phase0_numerical_stats.csv

üìù Generating Phase-0.2 Markdown Report...
‚úÖ Markdown report saved: Phase_0_2_IoT23_Data_Understanding_Report.md
   Size: 4,008 characters / 128 lines

‚úÖ PHASE-0.2 COMPLETE ‚Äî All artifacts exported


In [40]:
# ============================================================
# CELL 8: Open Questions & Phase 0 Summary
# ============================================================

import textwrap

print("=" * 65)
print("üìù PHASE 0 SUMMARY ‚Äî IoT-23 Dataset")
print("=" * 65)

if not df.empty:
    total_rows   = len(df)
    total_cols   = len(df.columns)
    num_scenarios = df["source_scenario"].nunique() if "source_scenario" in df.columns else "N/A"
    mem_mb       = df.memory_usage(deep=True).sum() / 1024 / 1024

    print(f"\n  Scenarios loaded        : {num_scenarios} / {len(inventory_df)}")
    print(f"  Total rows sampled      : {total_rows:,}  (max {MAX_ROWS_PER_FILE:,} per file)")
    print(f"  Total columns           : {total_cols}")
    print(f"  In-memory footprint     : {mem_mb:.1f} MB")

    if LABEL_COL in df.columns:
        n_unique_labels = df[LABEL_COL].nunique(dropna=False)
        print(f"  Unique top-level labels : {n_unique_labels}")
    if DET_LABEL_COL in df.columns:
        n_unique_det = df[DET_LABEL_COL].nunique(dropna=False)
        print(f"  Unique detailed-labels  : {n_unique_det}")

summary_text = """
OPEN QUESTIONS FOR PHASE 1
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

1. FEATURE TYPE CASTING
   All columns arrive as dtype=object (string). Phase 1 must
   cast numeric Zeek fields (duration, orig_bytes, resp_bytes,
   orig_pkts, resp_pkts, ‚Ä¶) to float/int, handling '-' and 'C'
   (Zeek connection-state codes) appropriately.

2. PLACEHOLDER STRATEGY
   '-' is a Zeek semantic sentinel ‚Äî NOT a missing value.
   Options:
     a) Replace with NaN ‚Üí enables sklearn imputers
     b) Replace with 0   ‚Üí valid for counts (orig_pkts), wrong for others
     c) Add binary flag column (e.g., resp_bytes_missing=1)
     d) Drop the column if >X% are '-'
   Decision must be made per-column based on domain knowledge.

3. LABEL ENCODING
   - Binary IDS: Benign vs Malicious (use label)
   - Multi-class IDS: use detailed-label (collapse rare classes?)
   - Phase 1 must decide the target and encode it.

4. TRAIN / TEST SPLIT STRATEGY
   - Random split risks data leakage (same scenario in train & test).
   - Scenario-aware split: hold out entire CTU-* scenarios for test.
   - Stratify on detailed-label to preserve rare attack types.

5. CLASS IMBALANCE
   IoT-23 is heavily skewed (some attacks are rare).
   Candidate strategies: SMOTE, class_weight='balanced',
   undersampling benign traffic, or scenario-level reweighting.

6. CROSS-DATASET ALIGNMENT
   When combining with TON-IoT or BoT-IoT, column names differ.
   A canonical feature schema must be established in Phase 1.

7. TIMESTAMP / TEMPORAL FEATURES
   Zeek `ts` is a Unix float timestamp. Useful derived features:
   hour-of-day, day-of-week, inter-arrival time. Not extracted here.

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Phase 0 is READ-ONLY EDA. No features were altered or dropped.
Proceed to Phase 0.3 (BoT-IoT) or Phase 1 (Preprocessing).
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
"""

print(summary_text)

üìù PHASE 0 SUMMARY ‚Äî IoT-23 Dataset

  Scenarios loaded        : 23 / 23
  Total rows sampled      : 1,446,662  (max 100,000 per file)
  Total columns           : 22
  In-memory footprint     : 1215.0 MB

OPEN QUESTIONS FOR PHASE 1
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

1. FEATURE TYPE CASTING
   All columns arrive as dtype=object (string). Phase 1 must
   cast numeric Zeek fields (duration, orig_bytes, resp_bytes,
   orig_pkts, resp_pkts, ‚Ä¶) to float/int, handling '-' and 'C'
   (Zeek connection-state codes) appropriately.

2. PLACEHOLDER STRATEGY
   '-' is a Zeek semantic sentinel ‚Äî NOT a missing value.
   Options:
     a) Replace with NaN ‚Üí enables sklearn imputers
     b) Replace with 0   ‚Üí valid for counts (orig_pkts), wrong for others
     c) Add binary flag column (e.g., resp_bytes_missing=1)
     d) Drop the column if >

---
## Full Dataset Statistics ‚Äî ALL Rows (IoT-23, all 23 scenarios)

> **Why we need this:** Sections 0.1‚Äì0.7 above used `MAX_ROWS_PER_FILE = 100,000`, which reads only ~100k rows
> from each scenario file. Because each scenario focuses on a single attack type, the first 100k rows may
> be predominantly one label class, misrepresenting the real label balance and protocol distribution.
>
> This section re-reads **every row in every scenario file** using chunked I/O (500k rows per chunk)
> so we never load the full ~170M-row dataset into RAM at once. Only aggregated counts are stored.

**Aggregations computed (full dataset):**
- Exact row count per scenario file and grand total
- Full `label` and `detailed-label` distributions
- Full `proto` and `conn_state` distributions
- Per-column numerical min / max / mean (running-sum method)
- Sentinel (`-`, `?`, `(empty)`) counts across all rows

In [45]:
# ============================================================
# FULL DATASET SCAN ‚Äî ALL ROWS (IoT-23, chunked, no RAM limit)
# Iterates over all 23 scenario files; reads in 500k-row chunks.
# Accumulates counts ‚Äî never loads the full dataset into RAM.
# READ-ONLY: df/chunk from Phase 0 EDA above is not touched.
# ============================================================
import time
import pathlib

FULL_CHUNKSIZE = 500_000

# ‚îÄ‚îÄ Accumulators ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
full23_file_rows     = {}        # scenario_name ‚Üí row count
full23_total_rows    = 0

full23_label_counts  = {}        # {"Benign": n, "Attack": n, ...}
full23_det_label_counts = {}     # per detailed-label value counts
full23_proto_counts  = {}
full23_conn_counts   = {}        # conn_state

# Zeek numerical columns (as stored in the log files)
NUM_COLS_IOT23 = [
    "id.orig_p", "id.resp_p", "missed_bytes",
    "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes",
    "duration", "orig_bytes", "resp_bytes",
]
full23_num_stats = {
    c: {"count": 0, "sum": 0.0, "min": float("inf"), "max": float("-inf")}
    for c in NUM_COLS_IOT23
}

# Sentinel tracking
SENTINELS = ["-", "?", "(empty)"]
full23_sentinel_counts = {s: 0 for s in SENTINELS}

def _accum_vc23(acc_dict, series):
    for val, n in series.value_counts(dropna=False).items():
        k = str(val) if not (isinstance(val, float) and pd.isna(val)) else "<NaN>"
        acc_dict[k] = acc_dict.get(k, 0) + int(n)

def _accum_num23(stats_dict, col_name, series):
    s = pd.to_numeric(series, errors="coerce").dropna()
    if len(s) == 0:
        return
    st = stats_dict[col_name]
    st["count"] += len(s)
    st["sum"]   += float(s.sum())
    st["min"]    = min(st["min"], float(s.min()))
    st["max"]    = max(st["max"], float(s.max()))

def _accum_sentinels(chunk, sent_dict):
    """Count sentinel values (-, ?, (empty)) across all columns."""
    for s_val in SENTINELS:
        mask = (chunk == s_val)
        sent_dict[s_val] += int(mask.values.sum())

# ‚îÄ‚îÄ Main scan loop ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("=" * 65)
print("üîç FULL DATASET SCAN ‚Äî ALL ROWS (IoT-23, 500k-row chunks)")
print(f"   Scanning {len(inventory_df)} scenario files...")
print("=" * 65)

t0 = time.time()
errors_list = []

for _, inv_row in inventory_df.iterrows():
    fpath    = pathlib.Path(inv_row["file_path"])
    columns  = inv_row["column_names"]          # list already extracted
    scenario = inv_row["scenario_name"]
    f_rows   = 0

    print(f"\n  üìÇ {scenario}")
    if not columns:
        # Fallback: re-extract from file header
        columns = extract_columns_from_bro_header(fpath)
    if not columns:
        print(f"     ‚ö†Ô∏è  Cannot determine columns ‚Äî skipping.")
        errors_list.append(scenario)
        continue

    try:
        reader = pd.read_csv(
            fpath, sep="\t",
            skiprows=8,                   # skip the 8 Zeek metadata lines
            header=None,
            names=columns,
            chunksize=FULL_CHUNKSIZE,
            low_memory=False,
            dtype=str,                    # read everything as str first (safe for sentinels)
        )
        for chunk_i, chunk in enumerate(reader):
            # Drop Zeek footer lines that start with '#' (e.g. #close)
            first_col = columns[0]
            chunk = chunk[~chunk[first_col].astype(str).str.startswith("#")]
            n = len(chunk)
            f_rows += n

            # ‚îÄ‚îÄ Label columns ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            if "label" in chunk.columns:
                _accum_vc23(full23_label_counts, chunk["label"])
            if "detailed-label" in chunk.columns:
                _accum_vc23(full23_det_label_counts, chunk["detailed-label"])

            # ‚îÄ‚îÄ Protocol + conn_state ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            if "proto" in chunk.columns:
                _accum_vc23(full23_proto_counts, chunk["proto"])
            if "conn_state" in chunk.columns:
                _accum_vc23(full23_conn_counts, chunk["conn_state"])

            # ‚îÄ‚îÄ Sentinels (all string cols) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            _accum_sentinels(chunk, full23_sentinel_counts)

            # ‚îÄ‚îÄ Numerical stats ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            for col in NUM_COLS_IOT23:
                if col in chunk.columns:
                    _accum_num23(full23_num_stats, col, chunk[col])

            print(f"     chunk {chunk_i+1:>4d} ‚Äî {f_rows:>12,} rows", end="\r")

    except Exception as e:
        print(f"\n     ‚ö†Ô∏è  Error in {scenario}: {e}")
        errors_list.append(f"{scenario}: {e}")
        continue

    full23_file_rows[scenario] = f_rows
    full23_total_rows          += f_rows
    elapsed = time.time() - t0
    print(f"     ‚úÖ {f_rows:>12,} rows  (elapsed {elapsed:.1f}s)           ")

elapsed_total = time.time() - t0
print(f"\n{'‚îÄ'*65}")
print(f"  ‚úÖ Full scan complete in {elapsed_total:.1f}s")
print(f"  Grand total rows: {full23_total_rows:,}")
if errors_list:
    print(f"  ‚ö†Ô∏è  Errors in {len(errors_list)} file(s): {errors_list}")
print(f"{'‚îÄ'*65}")

# ‚îÄ‚îÄ Per-scenario row counts ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nüìã PER-SCENARIO ROW COUNTS (all rows):")
for scen, rows in full23_file_rows.items():
    pct = rows / max(full23_total_rows, 1) * 100
    print(f"   {scen:45s} : {rows:>12,}  ({pct:.2f}%)")
print(f"   {'TOTAL':45s} : {full23_total_rows:>12,}")

# ‚îÄ‚îÄ Label distributions ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üè∑Ô∏è  FULL LABEL DISTRIBUTIONS (all rows)")
print("=" * 65)

print("\nüìä label (binary / high-level):")
for val, cnt in sorted(full23_label_counts.items(), key=lambda x: -x[1]):
    pct = cnt / max(full23_total_rows, 1) * 100
    print(f"   {str(val):30s} : {cnt:>12,}  ({pct:.4f}%)")

print(f"\nüìä detailed-label ({len(full23_det_label_counts)} unique values):")
for val, cnt in sorted(full23_det_label_counts.items(), key=lambda x: -x[1])[:40]:
    pct = cnt / max(full23_total_rows, 1) * 100
    print(f"   {str(val):40s} : {cnt:>12,}  ({pct:.4f}%)")
if len(full23_det_label_counts) > 40:
    print(f"   ... ({len(full23_det_label_counts)-40} more values not shown)")

# ‚îÄ‚îÄ Proto + conn_state ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üîå FULL PROTOCOL & CONNECTION STATE DISTRIBUTIONS")
print("=" * 65)

print("\nüìä proto:")
for val, cnt in sorted(full23_proto_counts.items(), key=lambda x: -x[1]):
    pct = cnt / max(full23_total_rows, 1) * 100
    print(f"   {str(val):15s} : {cnt:>12,}  ({pct:.4f}%)")

print("\nüìä conn_state:")
for val, cnt in sorted(full23_conn_counts.items(), key=lambda x: -x[1]):
    pct = cnt / max(full23_total_rows, 1) * 100
    print(f"   {str(val):15s} : {cnt:>12,}  ({pct:.4f}%)")

# ‚îÄ‚îÄ Sentinel counts ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üîë SENTINEL VALUE COUNTS (across all cols, all rows)")
print("=" * 65)
for s_val, cnt in full23_sentinel_counts.items():
    pct = cnt / max(full23_total_rows * len(inventory_df.iloc[0]["column_names"]), 1) * 100
    print(f"   '{s_val:8s}' : {cnt:>14,} cell occurrences")

# ‚îÄ‚îÄ Full numerical stats ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üìä FULL NUMERICAL STATISTICS (all rows, running method)")
print("=" * 65)

full23_stats_rows = []
for col, st in full23_num_stats.items():
    if st["count"] > 0:
        mean_val = st["sum"] / st["count"]
        full23_stats_rows.append({
            "column": col,
            "count":  st["count"],
            "mean":   round(mean_val, 6),
            "min":    round(st["min"], 6),
            "max":    round(st["max"], 6),
        })
full23_stats_df = pd.DataFrame(full23_stats_rows)
display(full23_stats_df)

# ‚îÄ‚îÄ Save artifacts ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
full23_stats_df.to_csv(artifacts_dir / "iot23_fullscan_numerical_stats.csv", index=False)

full23_label_rows = [
    {"label_level": "label",          "value": k, "count": v,
     "pct": round(v/max(full23_total_rows,1)*100, 4)}
    for k, v in full23_label_counts.items()
] + [
    {"label_level": "detailed-label", "value": k, "count": v,
     "pct": round(v/max(full23_total_rows,1)*100, 4)}
    for k, v in full23_det_label_counts.items()
]
full23_label_df = pd.DataFrame(full23_label_rows).sort_values(["label_level","count"], ascending=[True,False])
full23_label_df.to_csv(artifacts_dir / "iot23_fullscan_label_distribution.csv", index=False)

full23_row_df = pd.DataFrame([{"scenario": k, "total_rows": v} for k, v in full23_file_rows.items()])
full23_row_df.loc[len(full23_row_df)] = {"scenario": "TOTAL", "total_rows": full23_total_rows}
full23_row_df.to_csv(artifacts_dir / "iot23_fullscan_row_counts.csv", index=False)

print(f"\n‚úÖ Full-scan artifacts saved:")
print(f"   ‚Ä¢ iot23_fullscan_numerical_stats.csv")
print(f"   ‚Ä¢ iot23_fullscan_label_distribution.csv")
print(f"   ‚Ä¢ iot23_fullscan_row_counts.csv")
print(f"\nüìù NOTE: df in memory still holds the 100k-per-file Phase 0 sample.")
print(f"         Phase 1 will load data using chunked processing.")

üîç FULL DATASET SCAN ‚Äî ALL ROWS (IoT-23, 500k-row chunks)
   Scanning 23 scenario files...

  üìÇ CTU-Honeypot-Capture-4-1
     ‚úÖ          452 rows  (elapsed 0.0s)           

  üìÇ CTU-Honeypot-Capture-5-1
     ‚úÖ        1,374 rows  (elapsed 0.0s)           

  üìÇ Somfy-01
     ‚úÖ          130 rows  (elapsed 0.0s)           

  üìÇ CTU-IoT-Malware-Capture-1-1
     ‚úÖ    1,008,748 rows  (elapsed 7.4s)           

  üìÇ CTU-IoT-Malware-Capture-17-1
     ‚úÖ   54,659,855 rows  (elapsed 354.8s)           

  üìÇ CTU-IoT-Malware-Capture-20-1
     ‚úÖ        3,209 rows  (elapsed 354.9s)           

  üìÇ CTU-IoT-Malware-Capture-21-1
     ‚úÖ        3,286 rows  (elapsed 354.9s)           

  üìÇ CTU-IoT-Malware-Capture-3-1
     ‚úÖ      156,103 rows  (elapsed 355.9s)           

  üìÇ CTU-IoT-Malware-Capture-33-1
     ‚úÖ   54,454,591 rows  (elapsed 726.1s)           

  üìÇ CTU-IoT-Malware-Capture-34-1
     ‚úÖ       23,145 rows  (elapsed 726.3s)           

  üìÇ CTU-I

Unnamed: 0,column,count,mean,min,max
0,id.orig_p,325309946,36687.67,0.0,65535.0
1,id.resp_p,325309946,19234.04,0.0,65535.0
2,missed_bytes,325309946,12.47,0.0,1908819480.0
3,orig_pkts,325309946,2.27,0.0,66027354.0
4,orig_ip_bytes,325309946,95.47,0.0,1914793266.0
5,resp_pkts,325309946,0.0,0.0,239484.0
6,resp_ip_bytes,325309946,1.34,0.0,349618679.0
7,duration,89732261,0.59,0.0,93280.03
8,orig_bytes,89732261,23865201.43,0.0,66205578295.0
9,resp_bytes,89732261,388.11,0.0,31720511878.0


KeyError: 'label_level'

In [46]:
# ============================================================
# FIX: Export full-scan artifacts (handles empty label dicts)
# Scan already ran above ‚Äî we reuse the accumulators in memory.
# ============================================================

# ‚îÄ‚îÄ Diagnose label accumulation ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("=" * 65)
print("üîç DIAGNOSING LABEL ACCUMULATION FROM FULL SCAN")
print("=" * 65)
print(f"\n  full23_total_rows          : {full23_total_rows:,}")
print(f"  full23_label_counts        : {len(full23_label_counts)} unique values")
print(f"  full23_det_label_counts    : {len(full23_det_label_counts)} unique values")
print(f"  full23_proto_counts        : {len(full23_proto_counts)} unique values")
print(f"  full23_conn_counts         : {len(full23_conn_counts)} unique values")

if full23_label_counts:
    print(f"\n  üìä label top values:")
    for k, v in sorted(full23_label_counts.items(), key=lambda x: -x[1])[:10]:
        print(f"     {str(k):35s}: {v:>12,}  ({v/full23_total_rows*100:.4f}%)")
else:
    print(f"\n  ‚ö†Ô∏è  full23_label_counts is EMPTY ‚Äî 'label' column was not found in chunks.")
    print(f"      Check: what columns were actually in the Zeek files?")
    # Peek at first inventory file columns
    import pathlib
    first_path = pathlib.Path(inventory_df["file_path"].iloc[0])
    first_cols = inventory_df["column_names"].iloc[0]
    print(f"\n  Columns in first file ({inventory_df['scenario_name'].iloc[0]}):")
    for i, c in enumerate(first_cols):
        print(f"     {i:>2}. '{c}'")

if full23_det_label_counts:
    print(f"\n  üìä detailed-label top values (showing up to 20):")
    for k, v in sorted(full23_det_label_counts.items(), key=lambda x: -x[1])[:20]:
        print(f"     {str(k):40s}: {v:>12,}  ({v/full23_total_rows*100:.4f}%)")
else:
    print(f"\n  ‚ö†Ô∏è  full23_det_label_counts is EMPTY ‚Äî 'detailed-label' column not found.")

# ‚îÄ‚îÄ Proto + conn_state summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if full23_proto_counts:
    print(f"\n  üìä proto distribution:")
    for k, v in sorted(full23_proto_counts.items(), key=lambda x: -x[1])[:10]:
        print(f"     {str(k):15s}: {v:>12,}  ({v/full23_total_rows*100:.4f}%)")

if full23_conn_counts:
    print(f"\n  üìä conn_state distribution:")
    for k, v in sorted(full23_conn_counts.items(), key=lambda x: -x[1])[:10]:
        print(f"     {str(k):15s}: {v:>12,}  ({v/full23_total_rows*100:.4f}%)")

# ‚îÄ‚îÄ Save numerical stats (always present) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
full23_stats_df.to_csv(artifacts_dir / "iot23_fullscan_numerical_stats.csv", index=False)
print(f"\n‚úÖ iot23_fullscan_numerical_stats.csv saved ({len(full23_stats_df)} rows)")

# ‚îÄ‚îÄ Save label distributions (graceful: handle empty dicts) ‚îÄ‚îÄ‚îÄ
label_rows_safe = []
for k, v in full23_label_counts.items():
    label_rows_safe.append({"label_level": "label", "value": k, "count": v,
                             "pct": round(v/max(full23_total_rows,1)*100, 4)})
for k, v in full23_det_label_counts.items():
    label_rows_safe.append({"label_level": "detailed-label", "value": k, "count": v,
                             "pct": round(v/max(full23_total_rows,1)*100, 4)})

if label_rows_safe:
    safe_label_df = pd.DataFrame(label_rows_safe)
    # Only sort if columns exist
    sort_cols = [c for c in ["label_level","count"] if c in safe_label_df.columns]
    asc_flags  = [True, False][:len(sort_cols)]
    if sort_cols:
        safe_label_df = safe_label_df.sort_values(sort_cols, ascending=asc_flags)
    safe_label_df.to_csv(artifacts_dir / "iot23_fullscan_label_distribution.csv", index=False)
    print(f"‚úÖ iot23_fullscan_label_distribution.csv saved ({len(safe_label_df)} rows)")
else:
    print("‚ö†Ô∏è  Label distribution CSV NOT saved ‚Äî both label dicts are empty.")
    print("   This means the 'label' and 'detailed-label' column names were not found.")
    print("   Check inventory_df column_names to see the actual column names used.")

# ‚îÄ‚îÄ Save row counts per scenario ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
full23_row_df = pd.DataFrame([{"scenario": k, "total_rows": v}
                               for k, v in full23_file_rows.items()])
grand_row = pd.DataFrame([{"scenario": "TOTAL", "total_rows": full23_total_rows}])
full23_row_df = pd.concat([full23_row_df, grand_row], ignore_index=True)
full23_row_df.to_csv(artifacts_dir / "iot23_fullscan_row_counts.csv", index=False)
print(f"‚úÖ iot23_fullscan_row_counts.csv saved ({len(full23_row_df)-1} scenarios + TOTAL)")

# ‚îÄ‚îÄ Sentinel summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f"\nüìã SENTINEL VALUE COUNTS (all rows, all columns):")
for s_val, cnt in full23_sentinel_counts.items():
    print(f"   '{s_val}' : {cnt:>14,} total cell occurrences")

print(f"\nüìù Grand total rows scanned: {full23_total_rows:,}")

üîç DIAGNOSING LABEL ACCUMULATION FROM FULL SCAN

  full23_total_rows          : 325,309,946
  full23_label_counts        : 0 unique values
  full23_det_label_counts    : 0 unique values
  full23_proto_counts        : 3 unique values
  full23_conn_counts         : 13 unique values

  ‚ö†Ô∏è  full23_label_counts is EMPTY ‚Äî 'label' column was not found in chunks.
      Check: what columns were actually in the Zeek files?

  Columns in first file (CTU-Honeypot-Capture-4-1):
      0. 'ts'
      1. 'uid'
      2. 'id.orig_h'
      3. 'id.orig_p'
      4. 'id.resp_h'
      5. 'id.resp_p'
      6. 'proto'
      7. 'service'
      8. 'duration'
      9. 'orig_bytes'
     10. 'resp_bytes'
     11. 'conn_state'
     12. 'local_orig'
     13. 'local_resp'
     14. 'missed_bytes'
     15. 'history'
     16. 'orig_pkts'
     17. 'orig_ip_bytes'
     18. 'resp_pkts'
     19. 'resp_ip_bytes'
     20. 'tunnel_parents   label   detailed-label'

  ‚ö†Ô∏è  full23_det_label_counts is EMPTY ‚Äî 'detaile

In [47]:
# ============================================================
# FIX: Label-only targeted scan with corrected column names
# The IoT-23 #fields header uses spaces (not tabs) to separate
# tunnel_parents / label / detailed-label. expand_compound_cols()
# splits such compound tokens so pandas gets the right column count.
# This scan reads only the label columns ‚Äî much faster than the full scan.
# ============================================================
import time
import pathlib

FULL_CHUNKSIZE = 500_000

def expand_compound_cols(col_list):
    """
    Expand any column name that contains spaces into multiple names.
    e.g. ['tunnel_parents   label   detailed-label']
      -> ['tunnel_parents', 'label', 'detailed-label']
    """
    expanded = []
    for c in col_list:
        if ' ' in c and not c.strip().startswith('#'):
            # Split by whitespace ‚Üí multiple actual column names
            parts = [p.strip() for p in c.split() if p.strip()]
            expanded.extend(parts)
        else:
            expanded.append(c)
    return expanded

# ‚îÄ‚îÄ Verify fix on first file ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
first_raw   = inventory_df["column_names"].iloc[0]
first_fixed = expand_compound_cols(first_raw)
print(f"Original column count : {len(first_raw)}")
print(f"Expanded column count : {len(first_fixed)}")
print(f"Columns 19-22        : {first_fixed[19:]}")

# ‚îÄ‚îÄ Targeted label scan (label + detailed-label only) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üè∑Ô∏è  TARGETED LABEL SCAN ‚Äî all 23 scenarios, chunks of 500k")
print("   (uses corrected column names to capture label/detailed-label)")
print("=" * 65)

full23_label_counts_v2      = {}
full23_det_label_counts_v2  = {}
full23_rows_v2              = 0

t0 = time.time()
errors_v2 = []

for _, inv_row in inventory_df.iterrows():
    fpath    = pathlib.Path(inv_row["file_path"])
    scenario = inv_row["scenario_name"]
    raw_cols = inv_row["column_names"]
    columns  = expand_compound_cols(raw_cols)     # ‚Üê FIXED column list

    f_rows = 0
    try:
        reader = pd.read_csv(
            fpath, sep="\t",
            skiprows=8,
            header=None,
            names=columns,
            chunksize=FULL_CHUNKSIZE,
            low_memory=False,
            dtype=str,
        )
        for chunk_i, chunk in enumerate(reader):
            # Drop Zeek footer / metadata rows
            first_col = columns[0]
            chunk = chunk[~chunk[first_col].astype(str).str.startswith("#")]
            n = len(chunk)
            f_rows         += n
            full23_rows_v2 += n

            # Accumulate label distributions
            if "label" in chunk.columns:
                for val, cnt in chunk["label"].value_counts(dropna=False).items():
                    k = str(val)
                    full23_label_counts_v2[k] = full23_label_counts_v2.get(k, 0) + int(cnt)
            if "detailed-label" in chunk.columns:
                for val, cnt in chunk["detailed-label"].value_counts(dropna=False).items():
                    k = str(val)
                    full23_det_label_counts_v2[k] = full23_det_label_counts_v2.get(k, 0) + int(cnt)

            print(f"     chunk {chunk_i+1:>4d} ‚Äî {f_rows:>10,} rows", end="\r")

    except Exception as e:
        print(f"\n     ‚ö†Ô∏è  Error in {scenario}: {e}")
        errors_v2.append(f"{scenario}: {e}")
        continue

    elapsed = time.time() - t0
    print(f"  ‚úÖ {scenario:45s}: {f_rows:>12,} rows  ({elapsed:.1f}s)  ")

elapsed_total = time.time() - t0
print(f"\n{'‚îÄ'*65}")
print(f"  Scan complete in {elapsed_total:.1f}s")
print(f"  Total rows verified: {full23_rows_v2:,}")
print(f"  label unique values: {len(full23_label_counts_v2)}")
print(f"  detailed-label unique: {len(full23_det_label_counts_v2)}")
print(f"{'‚îÄ'*65}")

# ‚îÄ‚îÄ Display results ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if full23_label_counts_v2:
    total = full23_rows_v2
    print("\nüìä label FULL DISTRIBUTION:")
    for val, cnt in sorted(full23_label_counts_v2.items(), key=lambda x: -x[1]):
        pct = cnt / max(total, 1) * 100
        print(f"   {str(val):30s} : {cnt:>12,}  ({pct:.4f}%)")

    print(f"\nüìä detailed-label FULL DISTRIBUTION ({len(full23_det_label_counts_v2)} unique):")
    for val, cnt in sorted(full23_det_label_counts_v2.items(), key=lambda x: -x[1]):
        pct = cnt / max(total, 1) * 100
        print(f"   {str(val):45s} : {cnt:>12,}  ({pct:.4f}%)")
else:
    print("\n‚ö†Ô∏è  Still empty ‚Äî check column names:")
    print(f"   Fixed columns: {first_fixed}")

# ‚îÄ‚îÄ Save label artifacts ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if full23_label_counts_v2:
    combined_rows = []
    for k, v in full23_label_counts_v2.items():
        combined_rows.append({"label_level":"label", "value": k, "count": v,
                               "pct": round(v/max(full23_rows_v2,1)*100, 4)})
    for k, v in full23_det_label_counts_v2.items():
        combined_rows.append({"label_level":"detailed-label", "value": k, "count": v,
                               "pct": round(v/max(full23_rows_v2,1)*100, 4)})
    ldf = (pd.DataFrame(combined_rows)
             .sort_values(["label_level","count"], ascending=[True,False])
             .reset_index(drop=True))
    ldf.to_csv(artifacts_dir / "iot23_fullscan_label_distribution.csv", index=False)
    print(f"\n‚úÖ iot23_fullscan_label_distribution.csv saved ({len(ldf)} rows)")

Original column count : 21
Expanded column count : 23
Columns 19-22        : ['resp_ip_bytes', 'tunnel_parents', 'label', 'detailed-label']

üè∑Ô∏è  TARGETED LABEL SCAN ‚Äî all 23 scenarios, chunks of 500k
   (uses corrected column names to capture label/detailed-label)
  ‚úÖ CTU-Honeypot-Capture-4-1                     :          452 rows  (0.0s)  
  ‚úÖ CTU-Honeypot-Capture-5-1                     :        1,374 rows  (0.1s)  
  ‚úÖ Somfy-01                                     :          130 rows  (0.1s)  
  ‚úÖ CTU-IoT-Malware-Capture-1-1                  :    1,008,748 rows  (3.3s)  
  ‚úÖ CTU-IoT-Malware-Capture-17-1                 :   54,659,855 rows  (182.5s)  
  ‚úÖ CTU-IoT-Malware-Capture-20-1                 :        3,209 rows  (182.5s)  
  ‚úÖ CTU-IoT-Malware-Capture-21-1                 :        3,286 rows  (182.6s)  
  ‚úÖ CTU-IoT-Malware-Capture-3-1                  :      156,103 rows  (183.0s)  
  ‚úÖ CTU-IoT-Malware-Capture-33-1                 :   54,454,591 rows  

In [48]:
# ============================================================
# DIAGNOSTIC: peek at actual raw data lines in an IoT-23 file
# to understand the separator used in the data rows for labels
# ============================================================
import pathlib

first_path = pathlib.Path(inventory_df["file_path"].iloc[1])   # use scenario 2 (larger)
print(f"Peeking at: {first_path.name}")
print(f"Scenario  : {inventory_df['scenario_name'].iloc[1]}\n")

with open(first_path, "r", encoding="utf-8", errors="replace") as f:
    for i, line in enumerate(f):
        if i < 10:
            # Show header lines
            repr_line = repr(line[:200])
            print(f"  Line {i:>2}: {repr_line}")
        elif i < 15:
            # Show first few data lines (after the 8 metadata lines)
            repr_line = repr(line[:300])
            print(f"  Line {i:>2}: {repr_line}")
        else:
            break

print("\nColumn list (21 raw):")
for idx, c in enumerate(inventory_df["column_names"].iloc[1]):
    print(f"   {idx:>2}. {repr(c)}")

Peeking at: conn.log.labeled
Scenario  : CTU-Honeypot-Capture-5-1

  Line  0: '#separator \\x09\n'
  Line  1: '#set_separator\t,\n'
  Line  2: '#empty_field\t(empty)\n'
  Line  3: '#unset_field\t-\n'
  Line  4: '#path\tconn\n'
  Line  5: '#open\t2019-01-03-20-02-04\n'
  Line  6: '#fields\tts\tuid\tid.orig_h\tid.orig_p\tid.resp_h\tid.resp_p\tproto\tservice\tduration\torig_bytes\tresp_bytes\tconn_state\tlocal_orig\tlocal_resp\tmissed_bytes\thistory\torig_pkts\torig_ip_bytes\tresp_pkts\tresp_ip_byte'
  Line  7: '#types\ttime\tstring\taddr\tport\taddr\tport\tenum\tstring\tinterval\tcount\tcount\tstring\tbool\tbool\tcount\tstring\tcount\tcount\tcount\tcount\tset[string]   string   string\n'
  Line  8: '1537522822.965530\tCJAF5z3MDFg4XVDXB\t0.0.0.0\t68\t255.255.255.255\t67\tudp\tdhcp\t8.322388\t600\t0\tS0\t-\t-\t0\tD\t2\t656\t0\t0\t-   benign   -\n'
  Line  9: '1537522897.732295\tCcYEFX3Qj9xdNX1ZCa\t192.168.2.1\t5353\t224.0.0.251\t5353\tudp\tdns\t-\t-\t-\tS0\t-\t-\t0\tD\t1\t391\t0\t0\t-   be

In [49]:
# ============================================================
# FINAL LABEL SCAN ‚Äî split the compound last column correctly
# Format: 21 tab-separated columns; column 20 = "tp   label   det-label"
# The 3 sub-values are separated by exactly three spaces "   "
# ============================================================
import time, pathlib

FULL_CHUNKSIZE = 500_000
COMPOUND_SEP   = "   "          # 3 spaces as used in the file

full23_label_counts_v3     = {}
full23_det_label_counts_v3 = {}
full23_rows_v3             = 0

COMPOUND_COL = inventory_df["column_names"].iloc[0][-1]   # 'tunnel_parents   label   detailed-label'
print(f"Compound column name: {repr(COMPOUND_COL)}")
print(f"Split separator     : {repr(COMPOUND_SEP)}")
print(f"Expected sub-cols   : {COMPOUND_COL.split(COMPOUND_SEP)}")

# ‚îÄ‚îÄ Quick sanity test on 5 rows ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
test_path = pathlib.Path(inventory_df["file_path"].iloc[1])
test_cols  = inventory_df["column_names"].iloc[1]          # 21-element list
test_chunk = pd.read_csv(
    test_path, sep="\t", skiprows=8, header=None,
    names=test_cols, nrows=5, dtype=str, low_memory=False
)
print(f"\nSanity check ‚Äî last col values in first 5 rows:")
for val in test_chunk[COMPOUND_COL]:
    parts = str(val).split(COMPOUND_SEP)
    print(f"   raw: {repr(str(val)[:60])}  ‚Üí  split: {parts}")

del test_chunk

# ‚îÄ‚îÄ Full scan ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n" + "=" * 65)
print("üè∑Ô∏è  FINAL LABEL SCAN ‚Äî splitting compound column correctly")
print("=" * 65)

t0 = time.time()
errors_v3 = []

for _, inv_row in inventory_df.iterrows():
    fpath    = pathlib.Path(inv_row["file_path"])
    scenario = inv_row["scenario_name"]
    columns  = inv_row["column_names"]         # 21-element list ‚Äî use as-is
    comp_col = columns[-1]                     # the compound col

    f_rows = 0
    try:
        reader = pd.read_csv(
            fpath, sep="\t", skiprows=8, header=None,
            names=columns, chunksize=FULL_CHUNKSIZE,
            low_memory=False, dtype=str,
        )
        for chunk_i, chunk in enumerate(reader):
            # Drop Zeek footer lines
            chunk = chunk[~chunk[columns[0]].astype(str).str.startswith("#")]
            n = len(chunk)
            f_rows         += n
            full23_rows_v3 += n

            # Split compound column ‚Üí extract label + detailed-label
            if comp_col in chunk.columns:
                split_df = (
                    chunk[comp_col]
                    .astype(str)
                    .str.split(COMPOUND_SEP, expand=True)
                )
                # split_df columns: 0=tunnel_parents, 1=label, 2=detailed-label
                if split_df.shape[1] >= 2:
                    label_series = split_df[1].str.strip()
                    for val, cnt in label_series.value_counts(dropna=False).items():
                        k = str(val)
                        full23_label_counts_v3[k] = full23_label_counts_v3.get(k,0) + int(cnt)
                if split_df.shape[1] >= 3:
                    det_series = split_df[2].str.strip()
                    for val, cnt in det_series.value_counts(dropna=False).items():
                        k = str(val)
                        full23_det_label_counts_v3[k] = full23_det_label_counts_v3.get(k,0) + int(cnt)

            print(f"     chunk {chunk_i+1:>4d} ‚Äî {f_rows:>10,} rows  ({len(full23_label_counts_v3)} labels so far)", end="\r")

    except Exception as e:
        print(f"\n     ‚ö†Ô∏è  Error {scenario}: {e}")
        errors_v3.append(f"{scenario}: {e}")
        continue

    elapsed = time.time() - t0
    print(f"  ‚úÖ {scenario:45s}: {f_rows:>12,} rows  ({elapsed:.1f}s)   ")

elapsed_total = time.time() - t0
print(f"\n{'‚îÄ'*65}")
print(f"  Scan done: {elapsed_total:.1f}s  |  {full23_rows_v3:,} rows  |  {len(full23_label_counts_v3)} label values")
print(f"{'‚îÄ'*65}")

# ‚îÄ‚îÄ Display results ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
total = full23_rows_v3
print(f"\nüìä label FULL DISTRIBUTION ({len(full23_label_counts_v3)} unique values):")
for val, cnt in sorted(full23_label_counts_v3.items(), key=lambda x: -x[1]):
    pct = cnt / max(total,1) * 100
    print(f"   {str(val):30s} : {cnt:>12,}  ({pct:.4f}%)")

print(f"\nüìä detailed-label FULL DISTRIBUTION ({len(full23_det_label_counts_v3)} unique values):")
for val, cnt in sorted(full23_det_label_counts_v3.items(), key=lambda x: -x[1]):
    pct = cnt / max(total,1) * 100
    print(f"   {str(val):45s} : {cnt:>12,}  ({pct:.4f}%)")

# ‚îÄ‚îÄ Save artifact ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
combined_v3 = []
for k, v in full23_label_counts_v3.items():
    combined_v3.append({"label_level":"label","value":k,"count":v,
                         "pct":round(v/max(total,1)*100,4)})
for k, v in full23_det_label_counts_v3.items():
    combined_v3.append({"label_level":"detailed-label","value":k,"count":v,
                         "pct":round(v/max(total,1)*100,4)})
ldf_v3 = (pd.DataFrame(combined_v3)
             .sort_values(["label_level","count"], ascending=[True,False])
             .reset_index(drop=True))
ldf_v3.to_csv(artifacts_dir / "iot23_fullscan_label_distribution.csv", index=False)
print(f"\n‚úÖ iot23_fullscan_label_distribution.csv saved ({len(ldf_v3)} rows)")

Compound column name: 'tunnel_parents   label   detailed-label'
Split separator     : '   '
Expected sub-cols   : ['tunnel_parents', 'label', 'detailed-label']

Sanity check ‚Äî last col values in first 5 rows:
   raw: '-   benign   -'  ‚Üí  split: ['-', 'benign', '-']
   raw: '-   benign   -'  ‚Üí  split: ['-', 'benign', '-']
   raw: '-   benign   -'  ‚Üí  split: ['-', 'benign', '-']
   raw: '-   benign   -'  ‚Üí  split: ['-', 'benign', '-']
   raw: '-   benign   -'  ‚Üí  split: ['-', 'benign', '-']

üè∑Ô∏è  FINAL LABEL SCAN ‚Äî splitting compound column correctly
  ‚úÖ CTU-Honeypot-Capture-4-1                     :          452 rows  (0.1s)   
  ‚úÖ CTU-Honeypot-Capture-5-1                     :        1,374 rows  (0.1s)   
  ‚úÖ Somfy-01                                     :          130 rows  (0.1s)   
  ‚úÖ CTU-IoT-Malware-Capture-1-1                  :    1,008,748 rows  (12.8s)   
  ‚úÖ CTU-IoT-Malware-Capture-17-1                 :   54,659,855 rows  (778.8s)   
  ‚úÖ CTU-IoT-

---
## Phase 0.2 Complete ‚úÖ

### What this notebook established (read-only)

| Section | Content | Status |
|---------|---------|--------|
| 0.1 ‚Äî File Inventory | 23 CTU-* scenarios, TSV format, 8 metadata header lines, `#close` footer | ‚úÖ |
| 0.2 ‚Äî Column Inventory | dtype, null%, unique counts for all columns | ‚úÖ |
| 0.3 ‚Äî Categorical Analysis | Value distributions; dash/`?`/F/T/NaN sentinel counts per column | ‚úÖ |
| 0.4 ‚Äî Numerical Semantics | Stats table (min/max/mean/skew) + IoT-23 Zeek field semantic annotations | ‚úÖ |
| 0.5 ‚Äî Feature Meaning | Full description, protocol context, populated/empty conditions, TON-IoT cross-reference | ‚úÖ |
| 0.6 ‚Äî Role Classification | Behavioral / Identifier / Contextual / Label / Metadata tags (no dropping) | ‚úÖ |
| 0.7 ‚Äî Summary + Artifacts | 7 challenge flags, 6 CSV artifacts, full markdown report | ‚úÖ |

### Key findings
- **Format** ‚Äî Zeek/Bro TSV; 8 metadata headers; `-` is the null sentinel; `#close` is the footer
- **Scale** ‚Äî 23 CTU-* scenarios; 100k rows sampled per file to stay within RAM
- **Schema** ‚Äî All columns arrive as `object` (string); numeric cast is a Phase 1 task
- **Sentinels** ‚Äî `-` (not applicable), `?` (unknown), `F`/`T` (boolean) ‚Äî NOT missing data
- **Labels** ‚Äî Dual-label structure (`label` + `detailed-label`) confirmed; severe class imbalance
- **Cross-dataset** ‚Äî ToN-IoT ‚Üî IoT-23 column mapping documented in Section 0.5

### Next steps

| Phase | Notebook | Purpose |
|-------|----------|---------|
| Phase 0.3 | `Phase_0_3_BoT_IoT_Data_Understanding.ipynb` | EDA on BoT-IoT dataset |
| Phase 1 | `Phase_1_Preprocessing.ipynb` | Feature casting, encoding, imputation strategy |