# **TrafficNet- A Machine Learning Approach to Network Traffic Classification**
<h2>
Name: Ayushman Anupam</h2>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
import ipaddress
import json

import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

import joblib
from sklearn.metrics import accuracy_score, classification_report, f1_score, precision_score, recall_score, r2_score

### ----------- Loading Data -------------

In [None]:
file_path = r"netflow.csv"
df=pd.read_csv(file_path)
print("Shape of Datafile:",df.shape)

<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">

## -------- Description of the Data --------

- **Dataset File**: `netflow.csv`  
- **Dataset Size**: `2,16,23,118` rows × `14` columns  

In [None]:
df.info()

In [None]:
df.head()

In [None]:
print("columns name are:", df.columns)

### Column Description
* `IPV4_SRC_ADDR`: Source IPv4 address from which the network flow originated.
* `L4_SRC_PORT`: Source port number associated with the transport layer (Layer 4) protocol.
* `IPV4_DST_ADDR`: Destination IPv4 address to which the network flow is directed.
* `L4_DST_PORT`: Destination port number used by the transport layer protocol (e.g., TCP/UDP).
* `PROTOCOL`: Network protocol number used for communication (e.g., 6 for TCP, 17 for UDP).
* `L7_PROTO`: Application layer (Layer 7) protocol identifier; represents the detected application protocol type.
* `IN_BYTES`: Total number of bytes received in the inbound direction for the flow.
* `OUT_BYTES`: Total number of bytes sent in the outbound direction for the flow.
* `IN_PKTS`: Number of packets received in the inbound direction.
* `OUT_PKTS`: Number of packets transmitted in the outbound direction.
* `TCP_FLAGS`: TCP control flags observed in the flow (e.g., SYN, ACK, FIN).
* `FLOW_DURATION_MILLISECONDS`: Total duration of the network flow measured in milliseconds.

##### `Label`: Binary indicator where `0` represents *Benign* (normal traffic) and `1` represents *Attack* (malicious traffic).

##### `Attack`: Categorical class label specifying the exact type of traffic — e.g., *Benign* for normal activity or a specific *Attack Type* for malicious flows. ----- our label


### About the Target Variable

The dataset contains **two related target fields** — `Label` and `Attack`.

* **`Label`** serves as the **binary classification target**, distinguishing between *Benign (0)* and *Attack (1)* network flows.
* **`Attack`** provides **multi-class or categorical context**, indicating the specific attack category when applicable.

`Label` simplifies the prediction task to “normal vs malicious,” while `Attack` allows finer-grained classification into individual attack types, making this dataset suitable for both **binary and multi-class intrusion detection** tasks.



In [None]:
df.describe()

In [None]:
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = df.select_dtypes(include='object').columns.tolist()

print("Numerical Columns:",numerical_cols)
print("\nCategorical Columns:",categorical_cols)

In [None]:
distinct_counts = df.nunique()

print("Number of distinct entries per column:")
print(distinct_counts)

In [None]:
print("--- Null values BEFORE processing ---")
print("Null values per column")
print(df.isnull().sum())
print("\nTotal rows with at least one null value:",df.isnull().any(axis=1).sum())

<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px;">

## -------- What questions are we addressing --------

* Can we effectively distinguish between **benign** and **malicious network flows** based on extracted statistical and protocol-level features (e.g., bytes, packets, flow duration, TCP flags)?
* Which **machine learning models** (e.g., Logistic Regression, Random Forest, Gradient Boosting, or Neural Networks) perform best for **binary malware detection** and **multi-class attack classification**?
* What is the **impact of feature importance** (e.g., `IN_BYTES`, `OUT_BYTES`, `FLOW_DURATION_MILLISECONDS`) on predicting whether a flow is benign or malicious?
* What **insights and trends** emerge from exploratory data analysis (EDA) — such as volume of inbound/outbound traffic, flow durations, and port usage — that differentiate benign traffic from attack flows?

## -------- EDA and Plotting the data --------

In [None]:
# counts of each classes
category_counts = df["Attack"].value_counts()
top_labels = category_counts.index 

fig, axes = plt.subplots(1, 2, figsize=(18, 6))
sns.countplot(
    data=df[df["Attack"].isin(top_labels)],
    x="Attack",
    hue="Attack",
    order=top_labels,
    palette="tab20",
    legend=False,
    ax=axes[0]
)

axes[0].set_title("web_service Distribution - All Classes (Count)")
axes[0].tick_params(axis='x', rotation=45)

for container in axes[0].containers:
    axes[0].bar_label(container, fmt='%d', label_type='edge', fontsize=10)

# Pie chart (all classes)
colors = sns.color_palette("tab20", len(category_counts))
wedges, _ = axes[1].pie(
    category_counts,
    startangle=140,
    colors=colors
)
axes[1].set_title("web_service Distribution - All Classes")

total = category_counts.sum()
legend_labels = [
    f"{cat} - {count/total:.1%}" for cat, count in category_counts.items()
]

axes[1].legend(
    wedges,
    legend_labels,
    title="Attack",
    loc="center left",
    bbox_to_anchor=(1, 0, 0.5, 1),
    ncol=2,
    frameon=False
)

plt.tight_layout()
plt.show()


In [None]:
# Encode categorical columns
df_encoded = df.copy()
label_encoders = {}

for col in df_encoded.columns:
    if df_encoded[col].dtype == 'object':
        le = LabelEncoder()
        df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
        label_encoders[col] = le

# Compute correlation matrix
corr = df_encoded.corr()

# Plot interactive correlation heatmap
fig = go.Figure(
    data=go.Heatmap(
        z=corr.values,
        x=corr.columns,
        y=corr.columns,
        colorscale="RdBu",
        zmin=-1,
        zmax=1,
        colorbar=dict(title="Correlation"),
        hovertemplate=(
            "Feature 1: %{x}<br>"
            "Feature 2: %{y}<br>"
            "Correlation: %{z:.3f}<extra></extra>"
        )
    )
)

fig.update_layout(
    title="Interactive Correlation Heatmap — Network Traffic & Malware Detection Dataset",
    width=1100,
    height=600,
    xaxis=dict(tickangle=45),
    template="plotly_dark"
)

fig.show()


In [None]:
# Top 20 ports
top_src_ports = df['L4_SRC_PORT'].value_counts().head(20)
top_dst_ports = df['L4_DST_PORT'].value_counts().head(20)

fig, axes = plt.subplots(1, 2, figsize=(18,6))

# Source Ports Bar + KDE
sns.barplot(x=top_src_ports.index, y=top_src_ports.values, color='teal', ax=axes[0])
axes[0].set_title("Top 20 Source Ports")
axes[0].set_xlabel("L4_SRC_PORT")
axes[0].set_ylabel("Count")
axes[0].tick_params(axis='x', rotation=45)

# Destination Ports Bar + KDE
sns.barplot(x=top_dst_ports.index, y=top_dst_ports.values, color='orange', ax=axes[1])
axes[1].set_title("Top 20 Destination Ports")
axes[1].set_xlabel("L4_DST_PORT")
axes[1].set_ylabel("Count")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()


In [None]:
# IN_BYTES vs OUT_BYTES
fig = px.scatter(
    df,
    x="IN_BYTES",
    y="OUT_BYTES",
    color="Attack",
    title="Inbound vs Outbound Bytes by Attack Type",
    hover_data=["IPV4_SRC_ADDR", "IPV4_DST_ADDR", "FLOW_DURATION_MILLISECONDS"],
    height=400,
    width=1100,
    template="plotly_dark"
)
fig.show()

In [None]:
num_features = ['IN_BYTES', 'OUT_BYTES', 'IN_PKTS', 'OUT_PKTS', 'FLOW_DURATION_MILLISECONDS', 'TCP_FLAGS']

# First 4 features: 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(12, 6))
axes = axes.flatten()
for i, col in enumerate(num_features[:4]):
    sns.boxplot(data=df, x='Label', y=col, color='lightblue', ax=axes[i])
    axes[i].set_title(f"{col} Distribution by Traffic Type (0=Benign, 1=Attack)")
    axes[i].set_xlabel("Label")
    axes[i].set_ylabel(col)
plt.tight_layout()
plt.show()

# Last 2 features: 1x2 grid
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes = axes.flatten()
for i, col in enumerate(num_features[4:]):
    sns.boxplot(data=df, x='Label', y=col, color='lightgreen', ax=axes[i])
    axes[i].set_title(f"{col} Distribution by Traffic Type (0=Benign, 1=Attack)")
    axes[i].set_xlabel("Label")
    axes[i].set_ylabel(col)
plt.tight_layout()
plt.show()


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14,4))
axes[0].hist(df['IN_BYTES'], bins=50, color='teal')
axes[0].set_title("Inbound Bytes Distribution")
axes[0].set_xlabel("IN_BYTES")
axes[0].set_ylabel("Count")
axes[1].hist(df['OUT_BYTES'], bins=50, color='orange')
axes[1].set_title("Outbound Bytes Distribution")
axes[1].set_xlabel("OUT_BYTES")
axes[1].set_ylabel("Count")
plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14,4))
axes[0].hist(df['IN_PKTS'], bins=50, color='slateblue')
axes[0].set_title("Inbound Packets Distribution")
axes[0].set_xlabel("IN_PKTS")
axes[0].set_ylabel("Count")
axes[1].hist(df['OUT_PKTS'], bins=50, color='salmon')
axes[1].set_title("Outbound Packets Distribution")
axes[1].set_xlabel("OUT_PKTS")
axes[1].set_ylabel("Count")
plt.tight_layout()
plt.show()

## ------ Balacing the Data -------

In [None]:
label_percent = df['Label'].value_counts(normalize=True) * 100
print("\nPercentage:\n", label_percent.round(2))

In [None]:
# Percentage calculation
label_counts = df['Attack'].value_counts()
label_percent = df['Attack'].value_counts(normalize=True) * 100
print("\nPercentage:\n", label_percent.round(2))

In [None]:
min_prop = 0.06
total_len = len(df)
min_count = int(total_len * min_prop)
dfs = []
for attack_class, group in df.groupby('Attack'):
    if len(group) < min_count:
        oversampled = group.sample(min_count, replace=True, random_state=42)
        dfs.append(oversampled)
    else:
        dfs.append(group)
df_balanced = pd.concat(dfs).sample(frac=1, random_state=42).reset_index(drop=True)

# Check new class percentages
new_percent = df_balanced['Attack'].value_counts(normalize=True) * 100
print("New class distribution (%)\n", new_percent.round(2))


In [None]:
label_percent = df_balanced['Label'].value_counts(normalize=True) * 100
print("\nPercentage:\n", label_percent.round(2))

## ------- Train-Test Split ----------

In [None]:
df_balanced.columns

In [None]:
# Desired test percentages
test_percentages = {
    'Benign': 0.055,
    'Exploits': 0.005,
    'Fuzzers': 0.005,
    'Reconnaissance': 0.005,
    'Generic': 0.005,
    'DoS': 0.005,
    'Analysis': 0.005,
    'Backdoor': 0.005,
    'Shellcode': 0.005,
    'Worms': 0.005
}

test_list = []
train_list = []

df_balanced = df_balanced.drop(columns=['Label'])
for cls, frac in test_percentages.items():
    cls_data = df_balanced[df_balanced['Attack'] == cls]
    n_samples = int(len(df_balanced) * frac)
    if n_samples > len(cls_data):
        n_samples = len(cls_data)  # avoid oversampling
    test_samples = cls_data.sample(n=n_samples, random_state=42)
    train_samples = cls_data.drop(test_samples.index)
    test_list.append(test_samples)
    train_list.append(train_samples)

# Combine
test_df = pd.concat(test_list).sample(frac=1, random_state=42).reset_index(drop=True)
train_df = pd.concat(train_list).sample(frac=1, random_state=42).reset_index(drop=True)

# Verify distributions
train_percent = train_df['Attack'].value_counts(normalize=True) * 100
test_percent = test_df['Attack'].value_counts(normalize=True) * 100

print("Train class distribution (%)\n", train_percent.round(2))
print("\nTest class distribution (%)\n", test_percent.round(2))


<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">

# Models Used 
**Model 01: Boosting Classsifier - XG Boost**<br>
**Model 02: Bagging classifier - Random Fores**<br>
**Model 03: Clustering classifier - K-Nearest Neighbour (KNN)**


## --------- Model 01: Boosting Classsifier - XG Boost ----------

In [None]:
pd.options.mode.chained_assignment = None

feature_cols = [col for col in df_balanced.columns if col not in ['Label', 'Attack']]
X_train = train_df[feature_cols]
X_test  = test_df[feature_cols]

le_attack = LabelEncoder()
y_train = le_attack.fit_transform(train_df['Attack'])
y_test  = le_attack.transform(test_df['Attack'])

def ip_to_int(ip):
    return int(ipaddress.IPv4Address(ip))

for col in ['IPV4_SRC_ADDR', 'IPV4_DST_ADDR']:
    X_train[col] = X_train[col].apply(ip_to_int)
    X_test[col] = X_test[col].apply(ip_to_int)

In [None]:
# XGBoost classifier
xgb_clf = xgb.XGBClassifier(
    objective="multi:softmax", 
    num_class=len(le_attack.classes_),
    eval_metric="mlogloss",   
    random_state=42
)

# Train
xgb_clf.fit(X_train, y_train)

In [None]:
joblib.dump(xgb_clf, "xgb_clf.pkl")
print("XGBoost model saved as xgb_clf.pkl")

xgb_clf = joblib.load("xgb_clf.pkl")
print("XGBoost model loaded successfully")

In [None]:
# Prediction
y_pred_xg = xgb_clf.predict(X_test)

# Metrics
acc = accuracy_score(y_test, y_pred_xg )
prec = precision_score(y_test, y_pred_xg, average='weighted', zero_division=0)
rec = recall_score(y_test, y_pred_xg, average='weighted', zero_division=0)
r2 = r2_score(y_test, y_pred_xg)
n = len(y_test)
p = X_test.shape[1]
adj_r2 = 1 - (1-r2)*(n-1)/(n-p-1)

print(f"Accuracy: {acc:.4f}")
print(f"Precision (weighted): {prec:.4f}")
print(f"Recall (weighted): {rec:.4f}")
print(f"R² (pseudo): {r2:.4f}")
print(f"Adjusted R2: {adj_r2:.4f}")

## ------------- Model 02: Bagging classifier - Random Forest ---------------

In [None]:
# Train Random Forest
rf_clf = RandomForestClassifier(
    n_estimators=250,       
    max_depth=None,         
    random_state=42,
    n_jobs=-1             
)
rf_clf.fit(X_train, y_train)

In [None]:
# Save model
joblib.dump(rf_clf, "rf_clf.pkl")
print("Random Forest model saved as rf_clf.pkl")

# Load model
rf_clf = joblib.load("rf_clf.pkl")
print("Random Forest model loaded successfully")

In [None]:
y_pred_rf = rf_clf.predict(X_test)

# Metrics
acc = accuracy_score(y_test, y_pred_rf)
prec = precision_score(y_test, y_pred_rf, average='weighted', zero_division=0)
rec = recall_score(y_test, y_pred_rf, average='weighted', zero_division=0)
r2 = r2_score(y_test, y_pred_rf)
n = len(y_test)
p = X_test.shape[1]
adj_r2 = 1 - (1-r2)*(n-1)/(n-p-1)

print(f"Accuracy: {acc:.4f}")
print(f"Precision (weighted): {prec:.4f}")
print(f"Recall (weighted): {rec:.4f}")
print(f"R² (pseudo): {r2:.4f}")
print(f"Adjusted R2: {adj_r2:.4f}")

## ------------ Model 03: Clustering classifier - K-Nearest Neighbour (KNN) ------------------

In [None]:
# KNN Classifier
knn_clf = KNeighborsClassifier(
    n_neighbors=5,
    weights="distance",
    n_jobs=-1   
)

# Train
knn_clf.fit(X_train, y_train)

In [None]:
y_pred_knn = knn_clf.predict(X_test)

y_pred_list = y_pred_knn.tolist() 
with open("y_pred_knn.json", "w") as f:
    json.dump(y_pred_list, f)
print("Predictions saved to y_pred_knn.json")

with open("y_pred_knn.json", "r") as f:
    y_pred_knn = json.load(f)

In [None]:
# Metrics
acc = accuracy_score(y_test, y_pred_knn)
prec = precision_score(y_test, y_pred_knn, average='weighted', zero_division=0)
rec = recall_score(y_test, y_pred_knn, average='weighted', zero_division=0)
r2 = r2_score(y_test, y_pred_knn)
n = len(y_test)
p = X_test.shape[1]
adj_r2 = 1 - (1-r2)*(n-1)/(n-p-1)

print(f"Accuracy: {acc:.4f}")
print(f"Precision (weighted): {prec:.4f}")
print(f"Recall (weighted): {rec:.4f}")
print(f"R² (pseudo): {r2:.4f}")
print(f"Adjusted R²: {adj_r2:.4f}")

# -----  Multi class Result -------

In [None]:
# DataFrame for results
results_df = pd.DataFrame({
    "y_true": y_test,
    "y_pred_knn": y_pred_knn,
    "y_pred_rf": y_pred_rf,
    "y_pred_xg": y_pred_xg
})
results_df_decoded = results_df.copy()
results_df_decoded = results_df_decoded.apply(le.inverse_transform)
print(results_df.head())


## ------- Binary Result benign (0), malware (1) -------------

In [None]:
def binary_label(label):
    return 0 if label.lower() == "benign" else 1

results_df_decoded = results_df.copy()
results_df_decoded["y_true"] = le.inverse_transform(results_df["y_true"])
results_df_decoded["y_pred_knn"] = le.inverse_transform(results_df["y_pred_knn"])
results_df_decoded["y_pred_rf"] = le.inverse_transform(results_df["y_pred_rf"])
results_df_decoded["y_pred_xg"] = le.inverse_transform(results_df["y_pred_xg"])

results_0_1_df = pd.DataFrame({
    "y_true_binary": results_df_decoded["y_true"].apply(binary_label),
    "y_pred_knn_binary": results_df_decoded["y_pred_knn"].apply(binary_label),
    "y_pred_rf_binary": results_df_decoded["y_pred_rf"].apply(binary_label),
    "y_pred_xg_binary": results_df_decoded["y_pred_xg"].apply(binary_label),
})

print(results_0_1_df.head())


<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">


# Result Discussion for Multiclass - Classification

In [None]:
def model_metrics_df(y_true, y_pred, model_name, X_test):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    rec = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    r2 = r2_score(y_true, y_pred)

    n = len(y_true)
    p = X_test.shape[1]
    adj_r2 = 1 - (1-r2)*(n-1)/(n-p-1)

    return pd.DataFrame([{
        "Model": model_name,
        "Accuracy": acc,
        "Precision_weighted": prec,
        "Recall_weighted": rec,
        "F1_weighted": f1,
        "R2_pseudo": r2,
        "Adj_R2": adj_r2
    }])

metrics_list = []
metrics_list.append(model_metrics_df(y_test, y_pred_knn, "KNN", X_test))
metrics_list.append(model_metrics_df(y_test, y_pred_rf, "RandomForest", X_test))
metrics_list.append(model_metrics_df(y_test, y_pred_xg, "XGBoost", X_test))
metrics_df = pd.concat(metrics_list, ignore_index=True)

metrics_df.to_csv("metrics_df.csv", index=False)
print("metrics_df saved as metrics_df.csv")
metrics_df

In [None]:
sns.set_style("white")
metric_pairs = [
    ("Accuracy", "F1_weighted"),
    ("Precision_weighted", "Recall_weighted"),
    ("R2_pseudo", "Adj_R2")
]

fig, axes = plt.subplots(3, 2, figsize=(14, 7.5))
axes = axes.flatten()

for i, (m1, m2) in enumerate(metric_pairs):
    sns.barplot(
        data=metrics_df, x="Model", y=m1, hue="Model",
        palette="viridis", legend=False, ax=axes[2*i]
    )
    axes[2*i].set_title(f"{m1} Comparison")
    axes[2*i].set_ylabel(m1)
    axes[2*i].set_xlabel("")
    y_min, y_max = metrics_df[m1].min(), metrics_df[m1].max()
    axes[2*i].set_ylim(y_min - 0.1, y_max + 0.1)
    axes[2*i].grid(False)

    for container in axes[2*i].containers:
        axes[2*i].bar_label(container, fmt="%.3f", fontsize=12)

    sns.barplot(
        data=metrics_df, x="Model", y=m2, hue="Model",
        palette="magma", legend=False, ax=axes[2*i+1]
    )
    axes[2*i+1].set_title(f"{m2} Comparison")
    axes[2*i+1].set_ylabel(m2)
    axes[2*i+1].set_xlabel("")
    
    y_min, y_max = metrics_df[m2].min(), metrics_df[m2].max()
    axes[2*i+1].set_ylim(y_min - 0.1, y_max + 0.1)
    axes[2*i+1].grid(False)

    for container in axes[2*i+1].containers:
        axes[2*i+1].bar_label(container, fmt="%.3f", fontsize=12)

plt.tight_layout()
plt.show()


In [None]:
def model_report_df(y_true, y_pred, model_name, classes):
    """Returns classification report."""
    report_dict = classification_report(
        y_true, y_pred, target_names=classes,
        zero_division=0, output_dict=True
    )
    df = pd.DataFrame(report_dict).transpose()
    df = df.loc[classes, ["precision", "recall", "f1-score"]]  # only class rows
    df.columns = pd.MultiIndex.from_product([[model_name], df.columns])
    return df

# reports for each model
report_list = []
report_list.append(model_report_df(y_test, y_pred_knn, "KNN", le.classes_))
report_list.append(model_report_df(y_test, y_pred_rf, "RandomForest", le.classes_))
report_list.append(model_report_df(y_test, y_pred_xg, "XGBoost", le.classes_))

# Merging side by side
report_df_all = pd.concat(report_list, axis=1)
report_df_all.to_csv("report_df_all.csv", index=False)
print("report_df_all saved as report_df_all.csv")

report_df_all

<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px;">

## Result Discussion for Binary Result benign (0), malware (1)

In [None]:
def model_metrics_df_binary(y_true, y_pred, model_name, X_test):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)   # binary precision
    rec = recall_score(y_true, y_pred, zero_division=0)       # binary recall
    f1 = f1_score(y_true, y_pred, zero_division=0)            # binary F1
    r2 = r2_score(y_true, y_pred)

    n = len(y_true)
    p = X_test.shape[1]
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

    return pd.DataFrame([{
        "Model": model_name,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1": f1,
        "R2_pseudo": r2,
        "Adj_R2": adj_r2
    }])

# metrics DF for all models
metrics_list_01 = []
metrics_list_01.append(model_metrics_df_binary(results_0_1_df["y_true_binary"], results_0_1_df["y_pred_knn_binary"], "KNN", X_test))
metrics_list_01.append(model_metrics_df_binary(results_0_1_df["y_true_binary"], results_0_1_df["y_pred_rf_binary"], "RandomForest", X_test))
metrics_list_01.append(model_metrics_df_binary(results_0_1_df["y_true_binary"], results_0_1_df["y_pred_xg_binary"], "XGBoost", X_test))

metrics_df_01 = pd.concat(metrics_list_01, ignore_index=True)
metrics_df_01.to_csv("metrics_df_0_1.csv", index=False)
print("Binary (0-1) metrics saved as metrics_df_0_1.csv")
metrics_df_01


In [None]:
sns.set_style("white")
metric_pairs_01 = [
    ("Accuracy", "F1"),
    ("Precision", "Recall"),
    ("R2_pseudo", "Adj_R2")
]

fig, axes = plt.subplots(3, 2, figsize=(14, 7.5))
axes = axes.flatten()

for i, (m1, m2) in enumerate(metric_pairs_01):
    # Left plot (first metric)
    sns.barplot(
        data=metrics_df_01, x="Model", y=m1, hue="Model",
        palette="viridis", legend=False, ax=axes[2*i]
    )
    axes[2*i].set_title(f"{m1} Comparison (Binary 0–1)")
    axes[2*i].set_ylabel(m1)
    axes[2*i].set_xlabel("")
    y_min, y_max = metrics_df_01[m1].min(), metrics_df_01[m1].max()
    axes[2*i].set_ylim(y_min - 0.1, y_max + 0.1)
    axes[2*i].grid(False)

    for container in axes[2*i].containers:
        axes[2*i].bar_label(container, fmt="%.3f", fontsize=12)

    # Right plot (second metric)
    sns.barplot(
        data=metrics_df_01, x="Model", y=m2, hue="Model",
        palette="magma", legend=False, ax=axes[2*i + 1]
    )
    axes[2*i + 1].set_title(f"{m2} Comparison (Binary 0–1)")
    axes[2*i + 1].set_ylabel(m2)
    axes[2*i + 1].set_xlabel("")
    y_min, y_max = metrics_df_01[m2].min(), metrics_df_01[m2].max()
    axes[2*i + 1].set_ylim(y_min - 0.1, y_max + 0.1)
    axes[2*i + 1].grid(False)

    for container in axes[2*i + 1].containers:
        axes[2*i + 1].bar_label(container, fmt="%.3f", fontsize=12)

plt.tight_layout()
plt.show()


<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">

## **Results Summary**

The experiments reveal that all three models — **KNN**, **Random Forest**, and **XGBoost** — performed strongly in both multiclass and binary setups, effectively identifying network traffic behavior patterns and distinguishing benign flows from attacks.

### **Multiclass Classification Results**

The multiclass evaluation shows that the models could **differentiate between multiple attack types** and benign traffic using flow-level statistics.
Key findings:

* **Random Forest** achieved the **highest accuracy (0.909)** and the strongest weighted precision (0.937) and F1-score (0.917), indicating consistent and balanced classification across all 10 traffic categories.
* **KNN** performed reasonably well (accuracy = 0.904, F1 = 0.909) but struggled slightly with less frequent attack categories, likely due to distance-based sensitivity to data imbalance.
* **XGBoost** achieved good precision (0.920) but showed a slight performance drop (accuracy = 0.886), reflecting moderate sensitivity to class imbalance and limited hyperparameter tuning.

Overall, **tree-based ensemble models** captured **nonlinear feature interactions and protocol-level differences** effectively, outperforming KNN in handling diverse and imbalanced traffic types.


### **Binary Classification Results (Benign (0), Malware (1))**

When traffic types were simplified into **benign (0)** vs **malware (1)** categories, all models achieved **remarkably high performance**, confirming that the extracted features can **clearly separate normal and malicious flows**.

* **Random Forest** again led with the **best accuracy (0.996)** and F1 = 0.996, demonstrating exceptional precision and recall balance.
* **KNN** performed slightly lower (accuracy = 0.992, F1 = 0.991), but maintained near-perfect recall (0.9996), showing strong detection ability for malware.
* **XGBoost** achieved accuracy = 0.989 and F1 = 0.988, still highly reliable but marginally less stable compared to Random Forest.

These binary results confirm that the models generalize well when the task focuses on detecting **malicious vs benign** behavior, rather than distinguishing between specific attack categories.


### **Overall Insight**

* The models demonstrate **excellent generalization** and confirm that flow-level features are highly discriminative for intrusion detection.
* **Random Forest** consistently stands out for both **multiclass** and **binary** classification, combining interpretability, stability, and strong predictive power.
* **KNN** remains a simpler but effective baseline, while **XGBoost** offers competitive performance with potential for further tuning.



<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">

## **Discussion and Insights**

### **1. What the models captured well**

#### **Binary Classification (Benign vs. Attack)**

* All models achieved **high accuracy and F1-scores**, effectively distinguishing normal from malicious network traffic.
* Features such as **flow duration**, **byte counts**, and **packet rates** strongly contributed to the separation between benign and attack flows.
* Tree-based models (**Random Forest**, **XGBoost**) performed particularly well, capturing **nonlinear relationships** and handling feature interactions efficiently.
* The models showed **robust generalization** for large-volume attacks, where flow-level statistics were highly distinctive.

#### **Multiclass Classification (Attack Categories)**

* Random Forest and XGBoost effectively learned patterns in **dominant classes** such as DoS, DDoS, and Fuzzers.
* Models leveraged **IP-level and port-level** information to differentiate attacks targeting specific services or protocols.
* Feature importance analysis revealed that **byte and packet flow metrics**, along with **duration**, were key in classifying traffic types.


### **2. What the models struggled with**

#### **Binary Classification**

* Some **false positives** were observed where benign traffic resembled attack-like behavior (e.g., large data transfers).
* **Short-duration attacks** or **low-traffic malicious flows** were occasionally misclassified as benign due to weak statistical signals.

#### **Multiclass Classification**

* There was significant **overlap** between attack types such as **Reconnaissance**, **Exploits**, and **Shellcode**, leading to frequent misclassifications.
* **Minority classes** (e.g., Worms, Backdoor, or Analysis attacks) had **very low recall**, reflecting insufficient training data.
* Even after balancing, the models favored majority classes, demonstrating **class imbalance bias**.


### **3. Reasons for observed limitations**

* **High class imbalance**: Certain attack categories had very few samples, which restricted model learning.
* **Feature correlation**: High redundancy among features (e.g., between byte and packet counts) limited the ability of linear models to differentiate complex patterns.
* **Lack of payload-level features**: Flow-based statistics alone cannot capture deeper behavioral or protocol-level attack traits.
* **Encrypted traffic**: TLS/SSL flows obscure payload content, limiting discriminative power for subtle attacks.
* **Limited diversity in training samples**: Many rare attacks may not exhibit enough variability for generalization.


### **4. Key Takeaways**

* Binary classification is **highly reliable** for detecting whether a flow is malicious.
* Multiclass classification remains **challenging** due to overlapping statistical features and data imbalance.
* Incorporating **temporal**, **application-layer**, and **behavioral** features could substantially improve multiclass detection.
* Ensemble and deep learning models can further enhance **minority-class detection** and reduce false alarms.

</div>


## **Insights for each model**

<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">

### **Model 01: XGBoost**

XGBoost demonstrated robust performance across both **binary** and **multiclass** scenarios, leveraging its gradient boosting mechanism to iteratively minimize classification errors and model **nonlinear feature interactions** effectively.

#### **Binary Classification (Benign vs. Attack)**

* Achieved an **accuracy of 98.4%**, with precision, recall, and F1-scores all around **0.98–0.99**, indicating **excellent discrimination** between normal and malicious flows.
* The model effectively captured **high-volume and long-duration attack flows**, where statistical and packet-level differences were substantial.
* Few **false positives** were observed, showing XGBoost’s strong ability to generalize to unseen benign traffic.

**What it captured well:**
XGBoost’s boosting iterations allowed it to learn **subtle nonlinear distinctions** in flow behavior, improving detection of attacks with distinctive traffic signatures.

**Limitations:**
Occasional misclassification of **low-volume or stealthy attacks** (e.g., scanning or probing traffic) as benign, suggesting limits in flow-level resolution.

**Improvement:**
Introducing **time-based and entropy features** or **autoencoder-based prefilters** could further enhance the sensitivity to subtle anomalies.


#### **Multiclass Classification (Attack Categories)**

* Achieved an **overall accuracy of 88.6%**, with weighted precision, recall, and F1-scores in the **0.89–0.92** range.
* Strong at identifying **dominant attack types** such as DoS, DDoS, and Fuzzers, thanks to their distinct traffic signatures.
* Struggled with **minority or overlapping classes** like Reconnaissance and Exploits, where feature distributions were similar.

**What it captured well:**
The model’s boosting structure enabled **better handling of heterogeneous attack behaviors** compared to linear models, improving recall on moderately represented classes.

**Limitations:**
Despite balanced training, **minority classes** (e.g., Worms, Backdoor) still suffered from low recall due to underrepresentation and overlapping feature patterns.

**Improvement:**
Applying **SMOTE-based augmentation**, **cost-sensitive learning**, or **focal loss** could help improve minority-class recall.


**Overall Summary:**
XGBoost remains a **high-performing and interpretable** model for both binary and multiclass intrusion detection tasks. It provides an optimal trade-off between **accuracy, robustness, and computational efficiency**, though **Random Forest** showed marginally higher performance for class-balanced datasets.

</div>

<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">

### **Model 02: Random Forest**

Random Forest delivered the **best overall performance** across both **binary** and **multiclass** classifications. Its ensemble of decision trees enabled strong generalization, robustness to noise, and the ability to model complex, nonlinear interactions between flow-based features.


#### **Binary Classification (Benign vs. Attack)**

* Achieved the **highest accuracy of 99.6%**, with precision, recall, and F1-scores all exceeding **0.99**, indicating **exceptional reliability** in distinguishing benign and malicious traffic.
* The model successfully learned dominant traffic signatures through **feature bagging and random splits**, making it highly resilient to overfitting.
* It captured both **high-volume DoS flows** and **low-volume attack variants**, achieving near-perfect recall (0.999).

**What it captured well:**
Random Forest excelled in identifying **attack behaviors across varying intensities**, leveraging ensemble diversity to handle noisy or correlated features effectively.

**Limitations:**
Minor **false positives** were occasionally observed in low-variance benign flows, likely due to overly complex decision boundaries.

**Improvement:**
Model interpretability can be enhanced by applying **SHAP feature analysis** to better understand key discriminative attributes in flow-based data.

#### **Multiclass Classification (Attack Categories)**

* Achieved an **accuracy of 90.9%**, outperforming both KNN and XGBoost, with strong weighted precision (0.94) and F1-score (0.92).
* The model handled **heterogeneous attack types** effectively, maintaining high recall even for mid-frequency classes.
* Random feature selection during tree growth ensured robustness against correlated metrics such as `IN_BYTES`, `OUT_BYTES`, and packet counts.

**What it captured well:**
The ensemble structure allowed the model to **learn diverse decision boundaries**, improving classification across attacks with overlapping traffic characteristics.

**Limitations:**
Slight performance degradation was noted for **minority attack classes** (e.g., Worms, Backdoor), due to the limited number of representative samples.

**Improvement:**
Combining Random Forest with **feature selection** or **synthetic sample generation (e.g., SMOTE)** could enhance recognition of underrepresented classes.

**Overall Summary:**
Random Forest stands out as the **most reliable and balanced classifier** in this study. Its strong interpretability, resistance to overfitting, and superior overall performance make it a **benchmark model** for both binary and multiclass intrusion detection tasks.

</div>

<div style="border:2px solid #3344ffff; padding:10px; border-radius:8px; overflow-x:auto; width:100%; box-sizing:border-box;">

### **Model 03: K-Nearest Neighbors (KNN)**

KNN achieved **competitive performance** across both binary and multiclass tasks, demonstrating its effectiveness in identifying patterns based purely on **distance metrics** and **local feature similarities**. Although computationally heavier on large datasets, its simplicity and interpretability remain valuable.

#### **Binary Classification (Benign vs. Attack)**

* Achieved a strong **accuracy of 99.2%**, with precision (0.983), recall (0.999), and F1-score (0.991).
* The model’s **high recall** reflects its strong sensitivity to detecting malicious samples, making it a good fit for intrusion detection scenarios prioritizing **low false negatives**.
* KNN effectively identified distinct traffic clusters corresponding to benign and attack flows in the feature space.

**What it captured well:**
Its distance-based approach performed well on **clearly separable attack behaviors**, especially those with large deviations in packet or byte statistics.

**Limitations:**
KNN can be **sensitive to feature scaling and local noise**, sometimes leading to false positives when benign and attack flows are close in feature space.

**Improvement:**
Applying **feature normalization**, **optimal K selection**, and **distance weighting** could further stabilize performance and reduce misclassifications.


#### **Multiclass Classification (Attack Categories)**

* Achieved an **accuracy of 90.36%**, with weighted precision (0.919) and F1-score (0.909), indicating reliable performance across most classes.
* Performed well for **dominant attack types**, but showed limited generalization for **minority or overlapping attacks**.
* The algorithm’s non-parametric nature enabled flexible adaptation to varying feature distributions.

**What it captured well:**
KNN accurately classified **well-separated attack families**, benefitting from its ability to preserve local neighborhood structure.

**Limitations:**
Performance dropped for **classes with overlapping statistical patterns**, as KNN lacks internal mechanisms for feature weighting or decision boundaries.

**Improvement:**
Incorporating **metric learning**, **dimensionality reduction (e.g., PCA)**, or **hybrid ensemble variants (e.g., KNN + RF)** could improve its scalability and robustness.


**Overall Summary:**
KNN remains a **simple yet powerful baseline** model, achieving high recall and competitive accuracy. While it lacks the structural learning advantages of tree-based models, its strong detection capability makes it suitable for **real-time, distance-based anomaly detection systems**.

</div>