### 🧩 Imports

Import all the necessary libraries:
- `json` for reading/writing JSON files.
- `pandas` and `numpy` for data manipulation.
- `random` for generating random synthetic data.
- `IsolationForest` from `sklearn` for anomaly detection.
- `joblib` for saving the trained model.

In [3]:
import json
import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np
import random
import joblib

### 📊 Generate Dataset

Create synthetic "normal" network traffic data to train an anomaly detection model. Each sample includes:

- `src_port`: randomly selected from common service ports.
- `dst_port`: a random high port number.
- `packet_size`: typical packet sizes.
- `duration_ms`: duration of the communication.
- `protocol`: randomly selected between TCP and UDP.

This data is saved to `training_data.json` for future use.

In [5]:
COMMON_PORTS = [80, 443, 22, 8080]

def generate_normal_data():
    return {
        "src_port": random.choice(COMMON_PORTS),
        "dst_port": random.randint(1024, 65535),
        "packet_size": random.randint(100, 1500),
        "duration_ms": random.randint(50, 500),
        "protocol": random.choice(["TCP", "UDP"])
    }

dataset = [generate_normal_data() for _ in range(1000)]

with open("training_data.json", "w") as f:
    json.dump(dataset, f, indent=2)


In [6]:
with open("training_data.json") as f:
    raw_data = json.load(f)

df = pd.DataFrame(raw_data)
display(df)

Unnamed: 0,src_port,dst_port,packet_size,duration_ms,protocol
0,22,35902,453,316,TCP
1,80,54768,1163,486,TCP
2,443,3109,439,197,TCP
3,443,64096,1020,482,TCP
4,80,22246,365,282,UDP
...,...,...,...,...,...
995,80,19917,101,344,TCP
996,8080,17287,1067,322,UDP
997,22,45671,513,118,TCP
998,8080,9923,1205,96,UDP


### 🧼 Preprocessing Function

Machine learning models like Isolation Forest require **numerical input only**. Any categorical variables, such as the `protocol` column (`TCP`, `UDP`), must be converted into numbers.

We handle this with **one-hot encoding**, using `pd.get_dummies`.

#### 🛠️ Preprocessing Steps:

1. **Identify categorical columns**:
   - In our case, the `protocol` column is categorical (`TCP`, `UDP`).

2. **Use `pd.get_dummies`**:
   - This creates a new binary column for each category.
   - For instance:
     ```
     protocol
     ---------
     TCP   →   protocol_UDP = 0
     UDP   →   protocol_UDP = 1
     ```
   - Setting `drop_first=True` prevents multicollinearity by dropping the first category (`TCP` here), as it can be inferred from the others.

3. **Return a DataFrame with all numerical values**:
   - This is ready for model input.

> ✅ This preprocessing is essential to avoid errors during training and ensure the model can learn from categorical variables.

In [7]:
def preprocess_data(df):
    #TODO 1: Implement preprocessing steps
    # تبدیل ستون protocol به One-Hot Encoding
    df_processed = pd.get_dummies(df, columns=['protocol'], drop_first=True)
    # تبدیل DataFrame به آرایه NumPy
    return np.array(df_processed)

### 🤖 Train Isolation Forest

The `IsolationForest` algorithm is an unsupervised model used to detect anomalies. It isolates observations by randomly selecting features and splitting values.

- `n_estimators=100`: number of trees in the forest.
- `contamination=0.01`: assumes 1% of the data is anomalous.
- `random_state=42`: ensures reproducibility.

The model is trained on the preprocessed numerical dataset.

In [8]:
processed_data = preprocess_data(df)
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(processed_data)

### 💾 Save Trained Model

Save the trained model using `joblib`, which allows for efficient serialization and deserialization. This saved model can be reused later for inference or deployment.


In [9]:
joblib.dump(model, "anomaly_model.joblib")

['anomaly_model.joblib']

# predict data

In [10]:
# تولید داده‌های آزمایشی
test_data = [
    {"src_port": 80, "dst_port": 5000, "packet_size": 500, "duration_ms": 100, "protocol": "TCP"},  # نرمال
    {"src_port": 9999, "dst_port": 23, "packet_size": 5000, "duration_ms": 10, "protocol": "UDP"},  # ناهنجار
    {"src_port": 443, "dst_port": 8080, "packet_size": 200, "duration_ms": 200, "protocol": "TCP"}   # نرمال
]
test_df = pd.DataFrame(test_data)

# پیش‌پردازش داده‌های آزمایشی
test_processed = preprocess_data(test_df)

# پیش‌بینی
predictions = model.predict(test_processed)
# -1 برای ناهنجاری، 1 برای نرمال
for i, pred in enumerate(predictions):
    status = "Anomaly" if pred == -1 else "Normal"
    print(f"Sample {i+1}: {test_data[i]} -> {status}")

Sample 1: {'src_port': 80, 'dst_port': 5000, 'packet_size': 500, 'duration_ms': 100, 'protocol': 'TCP'} -> Normal
Sample 2: {'src_port': 9999, 'dst_port': 23, 'packet_size': 5000, 'duration_ms': 10, 'protocol': 'UDP'} -> Anomaly
Sample 3: {'src_port': 443, 'dst_port': 8080, 'packet_size': 200, 'duration_ms': 200, 'protocol': 'TCP'} -> Normal
