### 🧩 Imports

Import all the necessary libraries:
- `json` for reading/writing JSON files.
- `pandas` and `numpy` for data manipulation.
- `random` for generating random synthetic data.
- `IsolationForest` from `sklearn` for anomaly detection.
- `joblib` for saving the trained model.

In [27]:
import json
import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np
import random
import joblib
import random
from datetime import datetime

### 📊 Generate Dataset

Create synthetic "normal" network traffic data to train an anomaly detection model. Each sample includes:

- `src_port`: randomly selected from common service ports.
- `dst_port`: a random high port number.
- `packet_size`: typical packet sizes.
- `duration_ms`: duration of the communication.
- `protocol`: randomly selected between TCP and UDP.

This data is saved to `training_data.json` for future use.

In [28]:
def generate_normal_data():
    """Generate synthetic normal network traffic"""
    COMMON_PORTS = [80, 443, 22, 8080]
    current_hour = datetime.now().hour
    is_business_hours = 9 <= current_hour <= 17
    
    # Protocol-specific packet size distributions
    protocol = random.choice(["TCP", "UDP"])
    if protocol == "TCP":
        packet_size = int(random.gauss(1000, 200))  # TCP: larger packets, normal distribution
        packet_size = max(500, min(1500, packet_size))  # Constrain to realistic range
    else:
        packet_size = int(random.gauss(400, 100))  # UDP: smaller packets
        packet_size = max(100, min(800, packet_size))
    
    # Time-based traffic patterns
    if is_business_hours:
        duration_ms = int(random.gauss(150, 50))  # Shorter durations during business hours
        duration_ms = max(50, min(300, duration_ms))
    else:
        duration_ms = int(random.gauss(300, 100))  # Longer durations off-hours
        duration_ms = max(100, min(500, duration_ms))
    
    # Port-specific behaviors
    src_port = random.choice(COMMON_PORTS)
    if src_port == 80:  # HTTP traffic
        packet_size = max(300, packet_size)  # Minimum size for HTTP
        duration_ms = max(100, duration_ms)  # Slightly longer for HTTP
    elif src_port == 443:  # HTTPS traffic
        packet_size = max(400, packet_size)  # Minimum size for HTTPS
    elif src_port == 22:  # SSH traffic
        packet_size = min(packet_size, 600)  # Smaller packets for SSH
        duration_ms = max(200, duration_ms)  # Longer duration for SSH
    elif src_port == 8080:  # Alternative HTTP
        packet_size = max(350, packet_size)
    
    # Destination port varies by source port
    if src_port in [80, 443, 8080]:
        dst_port = random.randint(1024, 65535)  # High ports for client connections
    else:  # SSH (port 22)
        dst_port = random.randint(49152, 65535)  # Ephemeral ports
    
    return {
        "src_port": src_port,
        "dst_port": dst_port,
        "packet_size": packet_size,
        "duration_ms": duration_ms,
        "protocol": protocol
    }

In [29]:
with open("../dataset/training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
display(df)

Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
0,22,42597,1086,101,UDP
1,443,31622,164,350,TCP
2,80,40056,270,422,TCP
3,80,10894,787,452,UDP
4,443,63261,786,159,TCP
...,...,...,...,...,...
995,8080,45137,554,306,TCP
996,443,44194,1216,448,UDP
997,443,5546,411,438,TCP
998,80,55060,687,378,TCP


### 🧼 Preprocessing Function

Machine learning models like Isolation Forest require **numerical input only**. Any categorical variables, such as the `protocol` column (`TCP`, `UDP`), must be converted into numbers.

We handle this with **one-hot encoding**, using `pd.get_dummies`.

#### 🛠️ Preprocessing Steps:

1. **Identify categorical columns**:
   - In our case, the `protocol` column is categorical (`TCP`, `UDP`).

2. **Use `pd.get_dummies`**:
   - This creates a new binary column for each category.
   - For instance:
     ```
     protocol
     ---------
     TCP   →   protocol_UDP = 0
     UDP   →   protocol_UDP = 1
     ```
   - Setting `drop_first=True` prevents multicollinearity by dropping the first category (`TCP` here), as it can be inferred from the others.

3. **Return a DataFrame with all numerical values**:
   - This is ready for model input.

> ✅ This preprocessing is essential to avoid errors during training and ensure the model can learn from categorical variables.

In [30]:
def preprocess_data(df):
    """
    Preprocess the data for machine learning model input.
    Convert categorical variables to numerical using one-hot encoding.
    """
    # One-hot encode the protocol column (categorical variable)
    df_encoded = pd.get_dummies(df, columns=['protocol'], drop_first=True)
    
    # Convert to numpy array for model input
    return df_encoded.values

In [31]:
# Load and preprocess the training data
with open("../dataset/training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
preprocessed_data = preprocess_data(df)

### 🤖 Train Isolation Forest

The `IsolationForest` algorithm is an unsupervised model used to detect anomalies. It isolates observations by randomly selecting features and splitting values.

- `n_estimators=100`: number of trees in the forest.
- `contamination=0.01`: assumes 1% of the data is anomalous.
- `random_state=42`: ensures reproducibility.

The model is trained on the preprocessed numerical dataset.

In [32]:
model = IsolationForest(contamination=0.1, random_state=42, n_estimators=100)
model.fit(preprocessed_data)

### 💾 Save Trained Model

Save the trained model using `joblib`, which allows for efficient serialization and deserialization. This saved model can be reused later for inference or deployment.


In [33]:
joblib.dump(model, "anomaly_model.joblib")

['anomaly_model.joblib']

# predict data

In [34]:
sample_predictions = model.predict(preprocessed_data[:10])
print("Sample predictions:", sample_predictions)
print("Legend: 1 = Normal, -1 = Anomaly")

Sample predictions: [1 1 1 1 1 1 1 1 1 1]
Legend: 1 = Normal, -1 = Anomaly


In [35]:
# Check model performance
anomaly_scores = model.decision_function(preprocessed_data)
print(f"Model trained successfully!")
print(f"Training data shape: {preprocessed_data.shape}")
print(f"Average anomaly score: {np.mean(anomaly_scores):.3f}")

Model trained successfully!
Training data shape: (1000, 5)
Average anomaly score: 0.049
