### 🧩 Imports

Import all the necessary libraries:
- `json` for reading/writing JSON files.
- `pandas` and `numpy` for data manipulation.
- `random` for generating random synthetic data.
- `IsolationForest` from `sklearn` for anomaly detection.
- `joblib` for saving the trained model.

In [3]:
import json
import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np
import random
import joblib

### 📊 Generate Dataset

Create synthetic "normal" network traffic data to train an anomaly detection model. Each sample includes:

- `src_port`: randomly selected from common service ports.
- `dst_port`: a random high port number.
- `packet_size`: typical packet sizes.
- `duration_ms`: duration of the communication.
- `protocol`: randomly selected between TCP and UDP.

This data is saved to `training_data.json` for future use.

In [4]:
import random
import json
from datetime import datetime

COMMON_SERVICES = {
    "HTTP": 80,
    "HTTPS": 443,
    "SSH": 22,
    "WEB_API": 8080
}

def generate_realistic_network_sample():
    """Generate a realistic synthetic normal network traffic sample"""

    service_name = random.choice(list(COMMON_SERVICES.keys()))
    src_port = COMMON_SERVICES[service_name]

    if service_name == "SSH":
        protocol = "TCP"
    else:
        protocol = random.choices(["TCP", "UDP"], weights=[0.8, 0.2])[0]

    hour = datetime.now().hour
    is_night = hour < 6 or hour > 22

    if protocol == "TCP":
        packet_size = int(random.gauss(1000, 180))
    else:
        packet_size = int(random.gauss(400, 100))

    packet_size = max(100, min(1500, packet_size))

    if is_night:
        duration_ms = int(random.gauss(250, 80))
    else:
        duration_ms = int(random.gauss(120, 40))

    duration_ms = max(50, min(500, duration_ms))

    if src_port in [80, 443, 8080]:
        dst_port = random.randint(1024, 65535)  # Client-side ports
    else:  # SSH
        dst_port = random.randint(49152, 65535)  # Ephemeral client port

    return {
        "src_port": src_port,
        "dst_port": dst_port,
        "packet_size": packet_size,
        "duration_ms": duration_ms,
        "protocol": protocol
    }


dataset = [generate_realistic_network_sample() for _ in range(1000)]

with open("training_data.json", "w") as f:
    json.dump(dataset, f, indent=2)




In [5]:
with open("/content/training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
display(df)

Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
0,443,20395,1027,102,TCP
1,443,10401,1110,101,TCP
2,80,41435,920,138,TCP
3,22,52716,802,91,TCP
4,22,50021,865,134,TCP
...,...,...,...,...,...
995,8080,29831,1097,87,TCP
996,80,53738,1012,105,TCP
997,443,14182,480,125,UDP
998,80,19227,1175,98,TCP


### 🧼 Preprocessing Function

Machine learning models like Isolation Forest require **numerical input only**. Any categorical variables, such as the `protocol` column (`TCP`, `UDP`), must be converted into numbers.

We handle this with **one-hot encoding**, using `pd.get_dummies`.

#### 🛠️ Preprocessing Steps:

1. **Identify categorical columns**:
   - In our case, the `protocol` column is categorical (`TCP`, `UDP`).

2. **Use `pd.get_dummies`**:
   - This creates a new binary column for each category.
   - For instance:
     ```
     protocol
     ---------
     TCP   →   protocol_UDP = 0
     UDP   →   protocol_UDP = 1
     ```
   - Setting `drop_first=True` prevents multicollinearity by dropping the first category (`TCP` here), as it can be inferred from the others.

3. **Return a DataFrame with all numerical values**:
   - This is ready for model input.

> ✅ This preprocessing is essential to avoid errors during training and ensure the model can learn from categorical variables.

In [6]:
def preprocess_data(df):
    #TODO 1: Implement preprocessing steps
    df_encoded = pd.get_dummies(df, columns=["protocol"], drop_first=True)
    return df_encoded.values

In [10]:
with open("/content/training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
preprocessed_data = preprocess_data(df)


### 🤖 Train Isolation Forest

The `IsolationForest` algorithm is an unsupervised model used to detect anomalies. It isolates observations by randomly selecting features and splitting values.

- `n_estimators=100`: number of trees in the forest.
- `contamination=0.01`: assumes 1% of the data is anomalous.
- `random_state=42`: ensures reproducibility.

The model is trained on the preprocessed numerical dataset.

In [13]:
model = IsolationForest(contamination=0.1, random_state=42, n_estimators=100)
model.fit(preprocessed_data)

### 💾 Save Trained Model

Save the trained model using `joblib`, which allows for efficient serialization and deserialization. This saved model can be reused later for inference or deployment.


In [14]:
joblib.dump(model, "anomaly_model.joblib")


['anomaly_model.joblib']

# predict data

In [16]:
sample_predictions = model.predict(preprocessed_data[:10])
print("Sample predictions:", sample_predictions)
print("Legend: 1 = Normal, -1 = Anomaly")

Sample predictions: [1 1 1 1 1 1 1 1 1 1]
Legend: 1 = Normal, -1 = Anomaly


In [17]:
anomaly_scores = model.decision_function(preprocessed_data)
print(f"Model trained successfully!")
print(f"Training data shape: {preprocessed_data.shape}")
print(f"Average anomaly score: {np.mean(anomaly_scores):.3f}")

Model trained successfully!
Training data shape: (1000, 5)
Average anomaly score: 0.101
