# Machine Learning Assignment 2  
## Cybersecurity Intrusion Detection Classification

This notebook implements multiple machine learning classification models
on a cybersecurity intrusion detection dataset (NSL-KDD subset).
The workflow includes preprocessing, training, evaluation, and model saving
for deployment in a Streamlit web application.


# Machine Learning Assignment 2  
## Cybersecurity Intrusion Detection Classification

This notebook implements multiple machine learning classification models
on a cybersecurity intrusion detection dataset (NSL-KDD subset).
The workflow includes preprocessing, training, evaluation, and model saving
for deployment in a Streamlit web application.


In [1]:
import pandas as pd
import numpy as np
import joblib
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, roc_auc_score,
    precision_score, recall_score,
    f1_score, matthews_corrcoef
)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


## Step 2: Load Cybersecurity Dataset

We use the NSL-KDD intrusion detection dataset.
It contains network traffic features used to classify
connections as normal or attack.


In [2]:
url = "https://raw.githubusercontent.com/defcom17/NSL_KDD/master/KDDTrain+.txt"

columns = [
    "duration","protocol_type","service","flag","src_bytes","dst_bytes",
    "land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted",
    "num_root","num_file_creations","num_shells","num_access_files",
    "num_outbound_cmds","is_host_login","is_guest_login","count",
    "srv_count","serror_rate","srv_serror_rate","rerror_rate",
    "srv_rerror_rate","same_srv_rate","diff_srv_rate",
    "srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate",
    "dst_host_same_src_port_rate","dst_host_srv_diff_host_rate",
    "dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate",
    "target","difficulty"
]

df = pd.read_csv(url, names=columns)

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (125973, 43)


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target,difficulty
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21


## Step 3: Data Preprocessing

- Convert attack labels into binary classification  
- Remove unnecessary columns  
- Convert categorical variables into numeric form  


In [3]:
# Convert attack labels to binary
df["target"] = df["target"].apply(lambda x: 0 if x == "normal" else 1)

# Drop difficulty column
df.drop("difficulty", axis=1, inplace=True)

# Convert categorical features
df = pd.get_dummies(df, drop_first=True)

print("Processed dataset shape:", df.shape)


Processed dataset shape: (125973, 120)


## Step 4: Feature and Target Split

Separate independent variables (X) and
target variable (y).


In [4]:
X = df.drop("target", axis=1)
y = df["target"]


## Step 5: Train-Test Split

Split dataset into training and testing sets
to evaluate model performance.


In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


## Step 6: Feature Scaling

StandardScaler is used to normalize feature values
for better performance of ML models.


In [6]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## Step 7: Model Training

The following classification models are implemented:

- Logistic Regression  
- Decision Tree  
- KNN  
- Naive Bayes  
- Random Forest  
- XGBoost  


In [7]:
os.makedirs("models", exist_ok=True)

models = {
    "Logistic Regression": LogisticRegression(max_iter=2000),
    "Decision Tree": DecisionTreeClassifier(),
    "KNN": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "XGBoost": XGBClassifier(eval_metric='logloss')
}


## Step 8: Model Evaluation

Evaluation metrics used:

- Accuracy  
- AUC Score  
- Precision  
- Recall  
- F1 Score  
- Matthews Correlation Coefficient (MCC)


In [8]:
results = []

for name, model in models.items():

    print(f"Training {name}...")

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    metrics = {
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "AUC": roc_auc_score(y_test, y_prob),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred),
        "MCC": matthews_corrcoef(y_test, y_pred)
    }

    results.append(metrics)

    joblib.dump(model, f"models/{name}.pkl")


Training Logistic Regression...
Training Decision Tree...
Training KNN...
Training Naive Bayes...
Training Random Forest...
Training XGBoost...


## Step 9: Save Models and Results

Models and evaluation metrics are saved
for Streamlit deployment and comparison.


In [9]:
joblib.dump(scaler, "models/scaler.pkl")

results_df = pd.DataFrame(results)
results_df.to_csv("model_results.csv", index=False)

results_df


Unnamed: 0,Model,Accuracy,AUC,Precision,Recall,F1 Score,MCC
0,Logistic Regression,0.972177,0.996372,0.976736,0.963159,0.9699,0.944116
1,Decision Tree,0.998452,0.998486,0.9977,0.998977,0.998338,0.99689
2,KNN,0.996745,0.999616,0.997267,0.995736,0.996501,0.99346
3,Naive Bayes,0.840683,0.978025,0.997548,0.659304,0.7939,0.711069
4,Random Forest,0.999008,0.999994,0.999488,0.99838,0.998933,0.998006
5,XGBoost,0.999325,0.999989,0.999318,0.999232,0.999275,0.998644
