### 🧩 Imports

Import all the necessary libraries:
- `json` for reading/writing JSON files.
- `pandas` and `numpy` for data manipulation.
- `random` for generating random synthetic data.
- `IsolationForest` from `sklearn` for anomaly detection.
- `joblib` for saving the trained model.

In [30]:
import json
import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np
import random
import joblib
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [4]:
!unzip dataset.zip

Archive:  dataset.zip
replace dataset/training_data.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
 extracting: dataset/training_data.json  


### 📊 Generate Dataset

Create synthetic "normal" network traffic data to train an anomaly detection model. Each sample includes:

- `src_port`: randomly selected from common service ports.
- `dst_port`: a random high port number.
- `packet_size`: typical packet sizes.
- `duration_ms`: duration of the communication.
- `protocol`: randomly selected between TCP and UDP.

This data is saved to `training_data.json` for future use.

In [82]:
COMMON_PORTS = [80, 443, 22, 8080]

def generate_normal_data():
    return {
        "src_port": random.choice(COMMON_PORTS),
        "dst_port": random.randint(1024, 65535),
        "packet_size": random.randint(100, 1500),
        "duration_ms": random.randint(50, 500),
        "protocol": random.choice(["TCP", "UDP"])
    }

dataset = [generate_normal_data() for _ in range(1000)]

with open("../content/dataset/training_data.json", "w") as f:
    json.dump(dataset, f, indent=2)


In [83]:
with open("/content/dataset/training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
display(df)

Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
0,80,47068,1281,350,TCP
1,22,45564,563,326,UDP
2,80,56343,110,333,UDP
3,443,39196,1228,301,TCP
4,80,38909,750,130,TCP
...,...,...,...,...,...
995,80,52829,588,351,UDP
996,8080,25852,555,189,TCP
997,22,29520,233,63,TCP
998,8080,21166,386,463,TCP


### 🧼 Preprocessing Function

Machine learning models like Isolation Forest require **numerical input only**. Any categorical variables, such as the `protocol` column (`TCP`, `UDP`), must be converted into numbers.

We handle this with **one-hot encoding**, using `pd.get_dummies`.

#### 🛠️ Preprocessing Steps:

1. **Identify categorical columns**:
   - In our case, the `protocol` column is categorical (`TCP`, `UDP`).

2. **Use `pd.get_dummies`**:
   - This creates a new binary column for each category.
   - For instance:
     ```
     protocol
     ---------
     TCP   →   protocol_UDP = 0
     UDP   →   protocol_UDP = 1
     ```
   - Setting `drop_first=True` prevents multicollinearity by dropping the first category (`TCP` here), as it can be inferred from the others.

3. **Return a DataFrame with all numerical values**:
   - This is ready for model input.

> ✅ This preprocessing is essential to avoid errors during training and ensure the model can learn from categorical variables.

In [84]:
def preprocess_data(df):
    #TODO 1: Implement preprocessing steps
    df = pd.get_dummies(df, columns=['protocol'], drop_first=True, dtype=int)
    return df.to_numpy()

preprocessed_data = preprocess_data(df)
preprocessed_data

array([[   80, 47068,  1281,   350,     0],
       [   22, 45564,   563,   326,     1],
       [   80, 56343,   110,   333,     1],
       ...,
       [   22, 29520,   233,    63,     0],
       [ 8080, 21166,   386,   463,     0],
       [ 8080, 32005,   583,   187,     1]])

### 🤖 Train Isolation Forest

The `IsolationForest` algorithm is an unsupervised model used to detect anomalies. It isolates observations by randomly selecting features and splitting values.

- `n_estimators=100`: number of trees in the forest.
- `contamination=0.01`: assumes 1% of the data is anomalous.
- `random_state=42`: ensures reproducibility.

The model is trained on the preprocessed numerical dataset.

In [85]:
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(preprocessed_data)

### 💾 Save Trained Model

Save the trained model using `joblib`, which allows for efficient serialization and deserialization. This saved model can be reused later for inference or deployment.


In [86]:
joblib.dump(model, "anomaly_model.joblib")

['anomaly_model.joblib']

# predict data

In [87]:
predictions = model.predict(preprocessed_data)