XG BOOST

Our data is an imbalances data (more entry on normal than the attack)

XGBoost (Extreme Gradient Boosting) is a good choice cuz-

  Imbalance Handling: It naturally supports the scale_pos_weight parameter. This parameter tells the algorithm to weigh misclassifications of the rare class (Attacks) much higher, forcing the model to pay more attention to the anomalies.

  Ensemble Power: XGBoost is an ensemble model built on Decision Trees. It learns sequentially: each new tree corrects the errors (residuals) of the previous tree. This incremental learning is highly effective at capturing the complex, subtle patterns that define an anomaly, outperforming simpler models like Logistic Regression.

  Its like smaller decision trees are made and put together.

Importing necessary things

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np

# Load the saved, preprocessed, and scaled dataset
df_sampled = pd.read_csv("Dataset.csv")

# -------------------------------------------------------------

'''
so when i run the XG BOOST normally - it got a data leakage issue cus of  the column FTP Command Count  as it is still an object (string) type and XG wants numeric data

so to fix the issue and convert problematic 'object' column to numeric

1) We use errors='coerce' to turn any remaining non-numeric text (like ' - ') into NaN,
2) then we fill those NaNs with 0 (or the column median, but 0 is safe here).

'''

# FIX:
df_sampled['FTP Command Count'] = pd.to_numeric(
    df_sampled['FTP Command Count'],
    errors='coerce' # Convert non-numbers to NaN
)
df_sampled['FTP Command Count'] = df_sampled['FTP Command Count'].fillna(0)

# Verification (Check to ensure no other 'object' types remain)
object_cols = df_sampled.select_dtypes(include=['object']).columns
if len(object_cols) > 0:
    print(f"WARNING: Still found object columns: {object_cols.tolist()}")
else:
    print("SUCCESS: All feature columns are now numerical.")

# Separate Features (X) and Target (y)
X = df_sampled.drop('Label', axis=1)
y = df_sampled['Label']


SUCCESS: All feature columns are now numerical.


SO i faced a leaky data problem - where the model memorised the outcome by some clues (leaky data) ...  

so i added the below code



In [None]:
# Assuming X and y are loaded from Dataset.csv

# -------------------------------------------------------------
# ðŸš¨ CRITICAL FIX: Remove features that leak the target (Label)
# -------------------------------------------------------------

# 1. Identify all OHE columns created from 'Attack Category'
leaky_cols = [col for col in X.columns if 'Attack Category' in col]

# 2. Also drop 'FTP Command Count' as a precaution due to its initial 'object' type and potential leakage
leaky_cols.append('FTP Command Count')

# Remove the leaky columns from the features X
X = X.drop(columns=leaky_cols, errors='ignore')

print(f"Dropped {len(leaky_cols)} leaky features.")
print(f"New Feature Count: {X.shape[1]}")

Dropped 15 leaky features.
New Feature Count: 199


Split the data

In [None]:
# Split 70% for Training, 30% for Testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y     # CRITICAL: Preserves the Attack/Normal ratio
)

print(f"Training Set Size: {X_train.shape[0]} rows")
print(f"Testing Set Size: {X_test.shape[0]} rows")

Training Set Size: 350000 rows
Testing Set Size: 150000 rows


Train XGBoost (Handling Imbalance)

We use the scale_pos_weight parameter to address the class imbalance, which gives more importance to the rare Attack class (1).

In [None]:
# Calculate the imbalance ratio (Count of Normal / Count of Attack)
ratio = y_train.value_counts()[0] / y_train.value_counts()[1]
print(f"Class Imbalance Ratio (0/1): {ratio:.2f}")

# Initialize and train the XGBoost model
xgb_model = XGBClassifier(
    objective='binary:logistic',
    n_estimators=100,
    learning_rate=0.1,
    scale_pos_weight=ratio, # Applying the weight to balance classes
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

print("Starting XGBoost training...")
xgb_model.fit(X_train, y_train)
print("XGBoost training complete.")

Class Imbalance Ratio (0/1): 6.91
Starting XGBoost training...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost training complete.


Results

In [None]:
# Predict on the held-out TEST data
y_pred = xgb_model.predict(X_test)
y_proba = xgb_model.predict_proba(X_test)[:, 1]

# Calculate Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_proba)

# Print the Centralized Baseline Results for your paper
print("\n--- XGBoost Centralized Baseline Results ---")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f} (Benchmark Metric!)")

# You should save these 5 numbers as your first benchmark!


--- XGBoost Centralized Baseline Results ---
Accuracy:  0.9881
Precision: 0.9141
Recall:    0.9999
F1 Score:  0.9551
ROC-AUC:   0.9997 (Benchmark Metric!)
