# Threat and Anomaly Detection

I have used a unsupervised machine learning model (Isolation Forest) to find unusual network activity.

Since I do not have labeled data for intrusions or threats, this method will help us identify patterns that stand out based on behavior — like very high traffic or rare IP combinations.


In [None]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path("data")
df = pd.read_csv(DATA_DIR / "combined_firewall.csv", parse_dates=["timestamp"])
df["timestamp"] = df["timestamp"].dt.tz_localize(None)
df.tail()


Unnamed: 0,timestamp,src_ip,dst_ip,bytes_sent,bytes_received,action,application,url_category,user_id,total_traffic_byte
66527,2025-08-11 11:00:00,10.0.135.112,172.16.192.192,21252.0,21087.0,allow,ftp,news,user306,42339.0
66528,2025-08-11 12:00:00,10.0.73.163,172.16.35.73,74696.0,85451.0,allow,ssl,unknown,user228,160147.0
66529,2025-08-11 13:00:00,10.0.192.1,172.16.130.86,81963.0,97926.0,allow,snmp,news,user24,179889.0
66530,2025-08-11 14:00:00,10.0.140.253,172.16.6.215,67884.0,75313.0,allow,web-browsing,malware,user553,143197.0
66531,2025-08-11 15:00:00,10.0.71.72,172.16.186.133,49910.0,9898.0,allow,snmp,malware,user255,59808.0


## Feature Selection

We'll start with a simple set of numeric features:
- `bytes_sent`
- `bytes_received`

These represent how much data is going out and coming in.  
We’ll later add more features if needed.

In [12]:
df["bytes_sent"] = df["bytes_sent"].fillna(0)
df["bytes_received"] = df["bytes_received"].fillna(0)

In [13]:
if "total_traffic_bytes" not in df.columns:
    df["total_traffic_bytes"] = df["bytes_sent"] + df["bytes_received"]
df["hour_of_day"] = df["timestamp"].dt.hour
num_features = ["bytes_sent", "bytes_received", "total_traffic_bytes", "hour_of_day"]
X = df[num_features].fillna(0)

## Feature Scaling
We scale the features so that large values (like bytes) do not overpower the model.

In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [15]:
X_scaled

array([[-0.10587809, -0.10733896, -0.11391339, -1.66107401],
       [-0.10587809, -0.10733896, -0.11391339, -1.51660711],
       [-0.10587809, -0.10733896, -0.11391339, -1.37214021],
       ...,
       [11.69475434, 13.85847144, 13.65711127,  0.21699565],
       [ 9.66772868, 10.63349676, 10.84823226,  0.36146255],
       [ 7.07991994,  1.3042738 ,  4.46456246,  0.50592945]],
      shape=(66532, 4))

As a initial step i want to surface the records that are unusal for this im not choosing heavy trained model
I simply choose "Isolation Forest" for easiness

In [16]:
from pyod.models.iforest import IForest

model = IForest(contamination=0.02, random_state=42)
model.fit(X_scaled)

df["anomaly"] = model.predict(X_scaled)
df["anomaly_score"] = model.decision_function(X_scaled)

In [17]:
df.tail()

Unnamed: 0,timestamp,src_ip,dst_ip,bytes_sent,bytes_received,action,application,url_category,user_id,total_traffic_byte,total_traffic_bytes,hour_of_day,anomaly,anomaly_score
66527,2025-08-11 11:00:00,10.0.135.112,172.16.192.192,21252.0,21087.0,allow,ftp,news,user306,42339.0,42339.0,11,1,0.160813
66528,2025-08-11 12:00:00,10.0.73.163,172.16.35.73,74696.0,85451.0,allow,ssl,unknown,user228,160147.0,160147.0,12,1,0.29494
66529,2025-08-11 13:00:00,10.0.192.1,172.16.130.86,81963.0,97926.0,allow,snmp,news,user24,179889.0,179889.0,13,1,0.297328
66530,2025-08-11 14:00:00,10.0.140.253,172.16.6.215,67884.0,75313.0,allow,web-browsing,malware,user553,143197.0,143197.0,14,1,0.286194
66531,2025-08-11 15:00:00,10.0.71.72,172.16.186.133,49910.0,9898.0,allow,snmp,malware,user255,59808.0,59808.0,15,1,0.174371


### Sanity Check

As for i know if a firewall dropped or denied traffic that could be suspicious activity but we dont have any labelled attacks 
so we have a action column we can compare the deny or drop rows with this anamoly score

In [18]:
from sklearn.metrics import precision_recall_fscore_support

proxy_y = df["action"].isin(["deny", "drop"]).astype(int)
precision, recall, f1, _ = precision_recall_fscore_support(proxy_y, df["anomaly"], average="binary")

print(f"Precision: {precision:.2f}")
print(f"Recall   : {recall:.2f}")
print(f"F1‑score : {f1:.2f}")

Precision: 0.15
Recall   : 1.00
F1‑score : 0.26


above data is not really a accuracy but it just outliers how well it aligned with firewal already tagged it as drop or deny

In [None]:
import joblib
joblib.dump(model, "model.joblib")
joblib.dump(scaler, "scaler.joblib")
print("\nSaved artefacts:")
for fname in ("model.joblib", "scaler.joblib"):
    obj = joblib.load(fname)
    print(f"{fname} →", type(obj))


Saved artefacts:
model.joblib → <class 'pyod.models.iforest.IForest'>
scaler.joblib → <class 'sklearn.preprocessing._data.StandardScaler'>


In [20]:
import joblib
model_or_data = joblib.load("./model.joblib")
print("Model Type:", type(model_or_data))
print("Model:", model_or_data)
print("Model Parameters:", model_or_data.get_params())

Model Type: <class 'pyod.models.iforest.IForest'>
Model: IForest(behaviour='old', bootstrap=False, contamination=0.02,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
    random_state=42, verbose=0)
Model Parameters: {'behaviour': 'old', 'bootstrap': False, 'contamination': 0.02, 'max_features': 1.0, 'max_samples': 'auto', 'n_estimators': 100, 'n_jobs': 1, 'random_state': 42, 'verbose': 0}


In [21]:
import joblib
model_or_data = joblib.load("./scaler.joblib")
print("Model Type:", type(model_or_data))
print("Model:", model_or_data)
print("Model Parameters:", model_or_data.get_params())

Model Type: <class 'sklearn.preprocessing._data.StandardScaler'>
Model: StandardScaler()
Model Parameters: {'copy': True, 'with_mean': True, 'with_std': True}
