## Step 1: train_model.ipynb

We completed this notebook to:
1. Generate synthetic normal sensor/network data.
2. Preprocess it for numerical input.
3. Train an Isolation Forest anomaly detection model.
4. Save the trained model for use in `client.py`.

These requirements come from the project specification (Step 1: Complete train_model.ipynb)

In [1]:
# Imports
import os
import json
import pandas as pd
import numpy as np
import random
from sklearn.ensemble import IsolationForest
import joblib

### Setup directories
- Create folders for dataset and models to ensure paths exist in Colab environment.

In [2]:
# Ensure dataset and models directories exist
os.makedirs("dataset", exist_ok=True)
os.makedirs("models", exist_ok=True)
print("Directories 'dataset/' and 'models/' are ready.")

Directories 'dataset/' and 'models/' are ready.


### Generate Synthetic Normal Data

Create synthetic normal network/sensor data samples. Each sample includes:
- `src_port`: randomly chosen from common ports.
- `dst_port`: random high port.
- `packet_size`: typical size.
- `duration_ms`: typical duration.
- `protocol`: either "TCP" or "UDP".

Save to `dataset/training_data.json`.

In [6]:
COMMON_PORTS = [80, 443, 22, 8080]

def generate_normal_data_entry():
    return {
        "src_port": random.choice(COMMON_PORTS),
        "dst_port": random.randint(1024, 65535),
        "packet_size": random.randint(100, 1500),
        "duration_ms": random.randint(50, 500),
        "protocol": random.choice(["TCP", "UDP"])
    }

# Number of samples
n_samples = 1000000
dataset = [generate_normal_data_entry() for _ in range(n_samples)]

# Save synthetic data
training_data_path = "dataset/training_data.json"
with open(training_data_path, "w") as f:
    json.dump(dataset, f, indent=2)
print(f"Generated {n_samples} synthetic normal samples and saved to {training_data_path}.")

Generated 1000000 synthetic normal samples and saved to dataset/training_data.json.


### Load and Inspect Raw Data
- Load from `dataset/training_data.json`.
- Display a sample and summary.

In [7]:
# Load synthetic data
with open("dataset/training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
print("Raw data sample:")
display(df.head())
print("DataFrame info:")
display(df.info())
print("DataFrame descriptive statistics:")
display(df.describe(include='all'))

Raw data sample:


Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
0,80,42620,1136,355,UDP
1,8080,4051,972,394,TCP
2,8080,28568,150,418,UDP
3,80,7081,922,386,UDP
4,443,28675,467,410,UDP


DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   src_port     1000000 non-null  int64 
 1   dst_port     1000000 non-null  int64 
 2   packet_size  1000000 non-null  int64 
 3   duration_ms  1000000 non-null  int64 
 4   protocol     1000000 non-null  object
dtypes: int64(4), object(1)
memory usage: 38.1+ MB


None

DataFrame descriptive statistics:


Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000
unique,,,,,2
top,,,,,TCP
freq,,,,,500514
mean,2157.051925,33307.841194,800.216515,275.037956,
std,3424.045698,18613.927309,404.815326,130.152793,
min,22.0,1024.0,100.0,50.0,
25%,80.0,17214.0,449.0,162.0,
50%,443.0,33314.0,801.0,275.0,
75%,8080.0,49415.0,1151.0,388.0,


### Preprocessing Function

Convert categorical columns (e.g., `protocol`) via one-hot encoding and return numerical array.

In [8]:
def preprocess_data(df: pd.DataFrame) -> np.ndarray:
    """
    One-hot encode 'protocol' column (drop_first=True) and return numpy array of numerical features.
    """
    df_copy = df.copy()
    if 'protocol' in df_copy.columns:
        # One-hot encode protocol, drop first category to avoid multicollinearity
        df_encoded = pd.get_dummies(df_copy, columns=['protocol'], drop_first=True)
    else:
        df_encoded = df_copy
    # Return NumPy array
    return df_encoded.values

### Preprocess Dataset
- Apply `preprocess_data` to the loaded DataFrame.
- Inspect resulting shape and a few rows.

In [9]:
# Preprocess features
X = preprocess_data(df)
# Determine column names after encoding for display
encoded_df = pd.get_dummies(df, columns=['protocol'], drop_first=True)
print("Preprocessed feature shape:", X.shape)
print("Feature columns after encoding:", list(encoded_df.columns))
display(pd.DataFrame(X, columns=encoded_df.columns).head())

Preprocessed feature shape: (1000000, 5)
Feature columns after encoding: ['src_port', 'dst_port', 'packet_size', 'duration_ms', 'protocol_UDP']


Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol_UDP
0,80,42620,1136,355,True
1,8080,4051,972,394,False
2,8080,28568,150,418,True
3,80,7081,922,386,True
4,443,28675,467,410,True


### Train Isolation Forest

- Use `IsolationForest(n_estimators=100, contamination=0.01, random_state=42)` per spec.
- Fit on preprocessed data.

In [10]:
# Initialize Isolation Forest
model = IsolationForest(
    n_estimators=100,
    contamination=0.01,
    random_state=42
)

# Train model
model.fit(X)
print("Isolation Forest trained on synthetic normal data.")

Isolation Forest trained on synthetic normal data.


### Sanity Check on Training Set
- Predict on training data to see how many points are flagged as anomalies (~1% expected).

In [11]:
# Sanity check: how many anomalies in training data?
preds_train = model.predict(X)  # 1 for normal, -1 for anomaly
n_anomalies = np.sum(preds_train == -1)
print(f"Flagged anomalies in training data: {n_anomalies} out of {len(X)} (~{n_anomalies/len(X)*100:.2f}%)")

Flagged anomalies in training data: 10000 out of 1000000 (~1.00%)


### Save Trained Model

Serialize the trained model to `models/anomaly_model.joblib` for later use in `client.py`.

In [12]:
# Save model
model_path = "models/anomaly_model.joblib"
joblib.dump(model, model_path)
print(f"Model saved to {model_path}.")

Model saved to models/anomaly_model.joblib.


### Load and Test Saved Model
- Load from disk and make sample predictions to verify.

In [13]:
# Load the saved model
loaded_model = joblib.load("models/anomaly_model.joblib")
print("Loaded model from disk.")

# Sample prediction on first few training samples
sample_preds = loaded_model.predict(X[:5])
print("Sample predictions (1=normal, -1=anomaly):", sample_preds)

Loaded model from disk.
Sample predictions (1=normal, -1=anomaly): [1 1 1 1 1]


### Optional: Test on Synthetic Anomalies
- Generate out-of-distribution samples and check detection.

In [14]:
def generate_synthetic_anomalies(n=20):
    """Generate samples that differ significantly from normal distribution."""
    data = []
    for _ in range(n):
        data.append({
            # uncommon src_port values
            "src_port": random.choice([9999, 12345, 54321]),
            "dst_port": random.randint(1024, 65535),
            # much larger packet size
            "packet_size": random.randint(2000, 5000),
            # much longer duration
            "duration_ms": random.randint(1000, 5000),
            "protocol": random.choice(["TCP", "UDP"])
        })
    return pd.DataFrame(data)

# Generate and preprocess anomalies
df_anom = generate_synthetic_anomalies(20)
print("Synthetic anomaly samples:")
display(df_anom.head())
X_anom = preprocess_data(df_anom)

# Predict anomalies
preds_anom = loaded_model.predict(X_anom)
n_flagged = np.sum(preds_anom == -1)
print(f"Out of {len(X_anom)} synthetic anomalies, flagged as anomalies: {n_flagged}")
print("Predictions:", preds_anom)

Synthetic anomaly samples:


Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
0,9999,56978,3471,4958,TCP
1,54321,14508,3602,4647,TCP
2,12345,35849,2591,4836,UDP
3,9999,48596,3770,3229,TCP
4,12345,3295,2348,2566,TCP


Out of 20 synthetic anomalies, flagged as anomalies: 20
Predictions: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



---
Now the notebook is complete for Step 1: synthetic data generation, preprocessing, model training, sanity checks, and model saving, ready to be used by `client.py` for streaming anomaly detection with LLM alerts .