In [11]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, confusion_matrix

print("Project 8 Research Evaluation Ready")


Project 8 Research Evaluation Ready


In [18]:
# --- Z-Score Method ---

df["zscore_packet"] = (df["packet_size"] - df["packet_size"].mean()) / df["packet_size"].std()

# Flag anomaly if abs(zscore) > 2
df["zscore_prediction"] = df["zscore_packet"].apply(lambda x: 1 if abs(x) > 2 else 0)

df


Unnamed: 0,packet_size,duration,bytes,true_label,ml_prediction,zscore_packet,zscore_prediction
0,500,2.1,3000,0,0,-0.503545,0
1,520,2.3,3100,0,0,-0.474307,0
2,510,2.0,3050,0,0,-0.488926,0
3,495,2.2,2980,0,0,-0.510855,0
4,505,2.1,3020,0,0,-0.496236,0
5,2000,10.5,20000,1,1,1.689313,0
6,480,1.8,2900,0,0,-0.532783,0
7,490,2.0,2950,0,0,-0.518164,0
8,2100,12.0,25000,1,1,1.835504,0


In [12]:
data = [
    # packet_size, duration, bytes, true_label
    [500, 2.1, 3000, 0],
    [520, 2.3, 3100, 0],
    [510, 2.0, 3050, 0],
    [495, 2.2, 2980, 0],
    [505, 2.1, 3020, 0],
    [2000, 10.5, 20000, 1],  # anomaly
    [480, 1.8, 2900, 0],
    [490, 2.0, 2950, 0],
    [2100, 12.0, 25000, 1],  # anomaly
]

df = pd.DataFrame(data, columns=[
    "packet_size", "duration", "bytes", "true_label"
])

df


Unnamed: 0,packet_size,duration,bytes,true_label
0,500,2.1,3000,0
1,520,2.3,3100,0
2,510,2.0,3050,0
3,495,2.2,2980,0
4,505,2.1,3020,0
5,2000,10.5,20000,1
6,480,1.8,2900,0
7,490,2.0,2950,0
8,2100,12.0,25000,1


In [13]:
features = df[["packet_size", "duration", "bytes"]]

model = IsolationForest(contamination=0.2, random_state=42)
model.fit(features)

df["ml_prediction"] = model.predict(features)

# Convert prediction format (1 normal, -1 anomaly → 0 normal, 1 anomaly)
df["ml_prediction"] = df["ml_prediction"].apply(lambda x: 1 if x == -1 else 0)

df


Unnamed: 0,packet_size,duration,bytes,true_label,ml_prediction
0,500,2.1,3000,0,0
1,520,2.3,3100,0,0
2,510,2.0,3050,0,0
3,495,2.2,2980,0,0
4,505,2.1,3020,0,0
5,2000,10.5,20000,1,1
6,480,1.8,2900,0,0
7,490,2.0,2950,0,0
8,2100,12.0,25000,1,1


In [14]:
precision = precision_score(df["true_label"], df["ml_prediction"])
recall = recall_score(df["true_label"], df["ml_prediction"])
conf_matrix = confusion_matrix(df["true_label"], df["ml_prediction"])

print("Precision:", precision)
print("Recall:", recall)
print("\nConfusion Matrix:")
print(conf_matrix)


Precision: 1.0
Recall: 1.0

Confusion Matrix:
[[7 0]
 [0 2]]


In [15]:
precision = precision_score(df["true_label"], df["ml_prediction"])
recall = recall_score(df["true_label"], df["ml_prediction"])
conf_matrix = confusion_matrix(df["true_label"], df["ml_prediction"])

print("Precision:", precision)
print("Recall:", recall)
print("\nConfusion Matrix:")
print(conf_matrix)


Precision: 1.0
Recall: 1.0

Confusion Matrix:
[[7 0]
 [0 2]]


In [16]:
df.columns


Index(['packet_size', 'duration', 'bytes', 'true_label', 'ml_prediction'], dtype='str')

In [19]:
z_precision = precision_score(df["true_label"], df["zscore_prediction"])
z_recall = recall_score(df["true_label"], df["zscore_prediction"])
z_conf_matrix = confusion_matrix(df["true_label"], df["zscore_prediction"])

print("Z-Score Precision:", z_precision)
print("Z-Score Recall:", z_recall)
print("Z-Score Confusion Matrix:")
print(z_conf_matrix)



Z-Score Precision: 0.0
Z-Score Recall: 0.0
Z-Score Confusion Matrix:
[[7 0]
 [2 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Project 8 Research-Style Evaluation Study

Comparative Evaluation of Statistical and Machine Learning Methods for Network Traffic Anomaly Detection
Abstract
This study evaluates the effectiveness of a traditional statistical anomaly detection method (Z-Score) against an unsupervised machine learning approach (Isolation Forest) for identifying anomalous network traffic behavior. Using simulated connection-level traffic data, both methods were assessed using precision, recall, and confusion matrix metrics. The results demonstrate clear performance differences between distribution-based thresholding and behavioral isolation modeling.

Methodology
The dataset consisted of simulated network traffic observations containing:

Packet size

Connection duration

Total bytes transferred

Ground-truth anomaly labels

Two connections were intentionally designed to represent anomalous behavior through unusually large packet sizes and high data transfer volumes.

Two detection approaches were implemented:

1. Statistical Z-Score Method
The Z-Score method computed standardized values for packet size and flagged anomalies when:

|z| > 2

This method assumes that anomalous behavior significantly deviates from the statistical distribution of the dataset.

2. Isolation Forest
An unsupervised Isolation Forest model was trained using:

Packet size

Duration

Bytes transferred

The model isolates anomalies by recursively partitioning feature space and identifying observations that require fewer splits to isolate.

Results
Z-Score Method Performance
Precision: 0.0
Recall: 0.0

Confusion Matrix:

[[7 0]
 [2 0]]
The Z-Score method failed to detect the two true anomalies. Although all normal traffic was correctly classified, no anomalous instances were predicted.

This occurred because extreme values influenced the dataset’s mean and standard deviation, reducing the effectiveness of threshold-based detection in a small sample environment.

Isolation Forest Performance
Precision: 1.0
Recall: 1.0

Confusion Matrix:

[[7 0]
 [0 2]]
The Isolation Forest successfully identified both anomalous connections without producing false positives.

Discussion
The experimental results demonstrate a key cybersecurity insight:

Statistical threshold-based methods may underperform in small, skewed, or behaviorally complex datasets. Because Z-Score relies heavily on distribution assumptions, anomalies can distort statistical parameters and reduce detection sensitivity.

In contrast, Isolation Forest does not rely on distributional assumptions. Instead, it detects structural differences in behavior, making it more robust for intrusion detection scenarios where labeled attack data is limited or traffic patterns evolve dynamically.

Conclusion
This research-style evaluation highlights the superiority of unsupervised machine learning approaches over basic statistical techniques for behavioral anomaly detection in network traffic.

While statistical methods are simple and computationally inexpensive, unsupervised models such as Isolation Forest provide stronger detection capability in realistic cybersecurity environments.

These findings support the adoption of machine learning-based anomaly detection systems in modern network security monitoring pipelines.