### 🧩 Imports

Import all the necessary libraries:
- `json` for reading/writing JSON files.
- `pandas` and `numpy` for data manipulation.
- `random` for generating random synthetic data.
- `IsolationForest` from `sklearn` for anomaly detection.
- `joblib` for saving the trained model.

In [35]:
import json
import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np
import random
import joblib

### 📊 Generate Dataset

Create synthetic "normal" network traffic data to train an anomaly detection model. Each sample includes:

- `src_port`: randomly selected from common service ports.
- `dst_port`: a random high port number.
- `packet_size`: typical packet sizes.
- `duration_ms`: duration of the communication.
- `protocol`: randomly selected between TCP and UDP.

This data is saved to `training_data.json` for future use.

In [36]:
COMMON_PORTS = [80, 443, 22, 8080]

def generate_normal_data():
    return {
        "src_port": random.choice(COMMON_PORTS),
        "dst_port": random.randint(1024, 65535),
        "packet_size": random.randint(100, 1500),
        "duration_ms": random.randint(50, 500),
        "protocol": random.choice(["TCP", "UDP"])
    }

dataset = [generate_normal_data() for _ in range(1000)]

with open("training_data.json", "w") as f:
    json.dump(dataset, f, indent=2)

print("Generated and saved training_data.json")

Generated and saved training_data.json


In [37]:
with open("training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
display(df.head())

Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
0,22,23646,1447,392,UDP
1,443,51361,601,468,TCP
2,80,59470,1236,391,TCP
3,8080,64832,257,302,TCP
4,443,54966,136,350,TCP


### 🧼 Preprocessing Function

Machine learning models like Isolation Forest require **numerical input only**. Any categorical variables, such as the `protocol` column (`TCP`, `UDP`), must be converted into numbers.

We handle this with **one-hot encoding**, using `pd.get_dummies`.

#### 🛠️ Preprocessing Steps:

1. **Identify categorical columns**:
   - In our case, the `protocol` column is categorical (`TCP`, `UDP`).

2. **Use `pd.get_dummies`**:
   - This creates a new binary column for each category.
   - For instance:
     ```
     protocol
     ---------
     TCP   →   protocol_UDP = 0
     UDP   →   protocol_UDP = 1
     ```
   - Setting `drop_first=True` prevents multicollinearity by dropping the first category (`TCP` here), as it can be inferred from the others.

3. **Return a DataFrame with all numerical values**:
   - This is ready for model input.

> ✅ This preprocessing is essential to avoid errors during training and ensure the model can learn from categorical variables.

In [38]:
def preprocess_data(df):
    df_processed = pd.get_dummies(df, columns=['protocol'], drop_first=True, dtype=int)

    required_cols = ['src_port', 'dst_port', 'packet_size', 'duration_ms', 'protocol_UDP']
    for col in required_cols:
        if col not in df_processed.columns:
            df_processed[col] = 0

    return df_processed[required_cols]

df_processed = preprocess_data(df.copy())
display(df_processed.head())

Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol_UDP
0,22,23646,1447,392,1
1,443,51361,601,468,0
2,80,59470,1236,391,0
3,8080,64832,257,302,0
4,443,54966,136,350,0


### 🤖 Train Isolation Forest

The `IsolationForest` algorithm is an unsupervised model used to detect anomalies. It isolates observations by randomly selecting features and splitting values.

- `n_estimators=100`: number of trees in the forest.
- `contamination=0.01`: assumes 1% of the data is anomalous.
- `random_state=42`: ensures reproducibility.

The model is trained on the preprocessed numerical dataset.

In [39]:
model = IsolationForest(n_estimators=100, contamination=0.2, random_state=42)

model.fit(df_processed)

print("Isolation Forest model trained successfully.")

Isolation Forest model trained successfully.


### 💾 Save Trained Model

Save the trained model using `joblib`, which allows for efficient serialization and deserialization. This saved model can be reused later for inference or deployment.


In [40]:
joblib.dump(model, "anomaly_model.joblib")

print("Model saved to anomaly_model.joblib")

Model saved to anomaly_model.joblib


# predict data

In [41]:
preds = model.predict(df_processed)
print(np.unique(preds, return_counts=True))

(array([-1,  1]), array([200, 800]))
