# Network Intrusion Detection Pipeline

This pipeline outlines the steps involved in creating a machine learning model for detecting network intrusions using the NSL-KDD dataset. The process involves data preprocessing, handling class imbalance, feature engineering, model training, and evaluation using various classifiers.

---

## 1. Project Setup

### 1.1 Import Libraries
Start by importing the necessary Python libraries such as pandas, numpy, sklearn, and imbalanced-learn for data manipulation, preprocessing, and model evaluation.

---

## 2. Load and Prepare the Data

### 2.1 Load the Dataset
We use two datasets: a training set and a test set. The NSL-KDD dataset is loaded into pandas DataFrames from the provided URLs.

### 2.2 Initial Data Inspection
Inspect the dimensions of both the training and test sets to understand the size of the data and ensure the data has been loaded correctly.

---

## 3. Data Preprocessing

### 3.1 Data Cleaning
Perform data cleaning by:
- Removing spaces and converting column names to lowercase for consistency.
- Dropping rows with missing values to ensure clean data.
- Removing duplicate rows to avoid redundant information.

### 3.2 Data Type Optimization
Convert high-memory data types (e.g., int64, float64) to more memory-efficient types (e.g., int32, float32) to reduce memory usage during processing.

---

## 4. Feature Engineering

### 4.1 Separating Categorical and Numerical Features
Identify and separate the categorical columns (e.g., protocol_type, service, flag) from the numerical features. This step is necessary for different treatment in the preprocessing pipeline.

### 4.2 Feature Scaling and Encoding
- Scale numerical features to standardize them.
- Apply one-hot encoding to categorical features to convert them into numerical form that machine learning algorithms can understand.

---

## 5. Handling Class Imbalance

### 5.1 Oversampling with Random Oversampler
Since the dataset may have an imbalanced distribution of classes (e.g., more normal connections than attack connections), we use Random Oversampling to balance the classes. This technique replicates samples from the minority class to ensure equal representation during model training.

---

## 6. Feature Extraction

### 6.1 Clustering with K-Means
Introduce clustering as a way to generate meta-features:
- Apply K-Means clustering to the oversampled data to identify underlying patterns or groups.
- Add the resulting cluster labels as a new feature to the dataset to enhance the predictive power of the model.

### 6.2 Dimensionality Reduction with PCA
Apply Principal Component Analysis (PCA) to reduce the dimensionality of the dataset while preserving as much variance as possible. This helps in reducing the complexity of the model while retaining key information.

---

## 7. Model Selection and Training

### 7.1 Model Selection
Select various machine learning models to train and evaluate:
- **Decision Tree**
- **Random Forest**
- **Extra Trees**


These models are chosen for their ability to handle classification problems effectively, especially with structured tabular data.

### 7.2 Training with K-Fold Cross-Validation
Perform 10-fold cross-validation on the training dataset to:
- Split the dataset into 10 equal parts.
- Train the model on 9 parts and validate it on the 10th part.
- Repeat this process 10 times with different splits to ensure the model generalizes well to unseen data.

---

## 8. Model Evaluation

### 8.1 Evaluation Metrics
Evaluate each model using the following metrics:
- **Accuracy**: The proportion of correct predictions to the total predictions.
- **Precision**: The proportion of true positive predictions to all positive predictions.
- **Recall**: The proportion of true positive predictions to all actual positives.
- **F1-Score**: The harmonic mean of precision and recall.
- **ROC Curve**: Visualize the trade-off between true positive rate and false positive rate.

### 8.2 Compare Model Performance
Compare the performance of all models based on the evaluation metrics. Identify the model with the best performance for further tuning and deployment.

---

## 9. Final Model Selection

After evaluating the models, select the one with the best performance (e.g., based on accuracy, F1-score, or other metrics). This model can be used for further testing on the test dataset or future deployment.

---

## 10. Conclusion



The following table summarizes the performance of the three machine learning models evaluated in the pipeline:

| **Model**       | **Accuracy Scores (per fold)**                                                                                                                                                               | **Average Accuracy** |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| Decision Tree   | [0.9997, 0.9996, 0.9997, 0.9997, 0.9996, 0.9997, 0.9997, 0.9997, 0.9997, 0.9997]                                                                                                             | 0.9997               |
| Random Forest   | [0.9998, 0.9998, 0.9998, 0.9998, 0.9997, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998]                                                                                                             | 0.9998               |
| Extra Trees     | [0.9998, 0.9998, 0.9998, 0.9998, 0.9997, 0.9998, 0.9999, 0.9998, 0.9999, 0.9998]                                                                                                             | 0.9998               |

From the evaluation, both **Random Forest** and **Extra Trees** models performed slightly better than the **Decision Tree**, with an average accuracy of **0.9998** for both. The **Decision Tree** also achieved strong performance with an average accuracy of **0.9997**. Therefore, both Random Forest and Extra Trees are recommended for further tuning and potential deployment.

---



In [2]:
import pandas as pd
import numpy as np
import sys
import sklearn
import io
import random
train_url = 'https://raw.githubusercontent.com/merteroglu/NSL-KDD-Network-Instrusion-Detection/master/NSL_KDD_Train.csv'
test_url = 'https://raw.githubusercontent.com/merteroglu/NSL-KDD-Network-Instrusion-Detection/master/NSL_KDD_Test.csv'
col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]


df = pd.read_csv(train_url,header=None, names = col_names)

df_test = pd.read_csv(test_url, header=None, names = col_names)

print('Dimensions of the Training set:',df.shape)
print('Dimensions of the Test set:',df_test.shape)


Dimensions of the Training set: (125973, 42)
Dimensions of the Test set: (22544, 42)


In [4]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score


#  Preprocessing for df and df_test
def preprocess_df(df):
    # Replace column names with no spaces and lowercase
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

    # Handle missing values (drop rows with missing values for simplicity)
    df.dropna(inplace=True)

    # Drop duplicate rows
    df.drop_duplicates(inplace=True)

    # Convert int64 to int32 and float64 to float32 to reduce memory usage
    for col in df.select_dtypes(include=['int64']).columns:
        df[col] = df[col].astype(np.int32)
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].astype(np.float32)

    return df

df = preprocess_df(df)
df_test = preprocess_df(df_test)

# Feature Scaling for both df and df_test
# Separate categorical and numerical columns
categorical_cols = ['protocol_type', 'service', 'flag']
numeric_cols = df.select_dtypes(include=['int32', 'float32']).columns.tolist()

# Define transformers for categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

# Prepare features and labels for df
X_train = df.drop(columns=['label'])
y_train = df['label']

# Prepare features for df_test (labels are not included for test data)
X_test = df_test.drop(columns=['label'])
y_test = df_test['label'] if 'label' in df_test.columns else None

# Encode output features (labels) for df
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.fit_transform(y_test)

# Fit and transform the input features for df and df_test
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

#  Handle Class Imbalance Using Random Oversampling (RO)
ros = RandomOverSampler(random_state=42)

# Apply Random Oversampling to the training data
X_train_resampled, y_train_resampled = ros.fit_resample(X_train_preprocessed, y_train_encoded)
X_test_resampled, y_test_resampled = ros.fit_resample(X_test_preprocessed, y_test_encoded)

#  Clustering for Meta-Features
n_clusters = 10

# Apply KMeans clustering to the resampled dataset
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X_train_resampled)

# Add cluster labels as a new feature to the dataset
X_train_resampled_with_clusters = np.hstack((X_train_resampled, cluster_labels.reshape(-1, 1)))

#  Feature Extraction using PCA
n_components = 20 

# Apply PCA to the resampled dataset with clusters
pca = PCA(n_components=n_components, random_state=42)
X_train_pca = pca.fit_transform(X_train_resampled_with_clusters)

# Apply the same steps to the test data
# Apply KMeans clustering to the test data (using the already fitted model)
cluster_labels_test = kmeans.predict(X_test_resampled)

# Add cluster labels as a new feature to the test dataset
X_test_resampled_with_clusters = np.hstack((X_test_resampled, cluster_labels_test.reshape(-1, 1)))

# Apply PCA to the test data using the already fitted PCA model
X_test_pca = pca.transform(X_test_resampled_with_clusters)

# Define the number of splits (k-fold)
k = 10

# Initialize k-fold cross-validation
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize the models
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Extra Trees': ExtraTreesClassifier(random_state=42),
    
}


accuracy_scores = {model_name: [] for model_name in models}


for model_name, model in models.items():
    print(f"Evaluating {model_name}...")

   
    fold_accuracy_scores = []

    
    for train_index, test_index in kf.split(X_train_pca):  # X_train_pca is  PCA-transformed features
        # Split the data into train and test sets using the indices provided by kf.split
        X_train, X_test = X_train_pca[train_index], X_train_pca[test_index]
        y_train, y_test = y_train_resampled[train_index], y_train_resampled[test_index]
        
        # Train the model
        model.fit(X_train, y_train)
        
        
        y_pred = model.predict(X_test)
        
        # Evaluate the model 
        accuracy = accuracy_score(y_test, y_pred)
        
        
        fold_accuracy_scores.append(accuracy)

    # Compute the average accuracy for this model across all folds
    average_accuracy = np.mean(fold_accuracy_scores)

    
    accuracy_scores[model_name] = fold_accuracy_scores

   
    print(f"{model_name} - Accuracy scores for each fold: {fold_accuracy_scores}")
    print(f"{model_name} - Average accuracy: {average_accuracy:.4f}")

# Print the final accuracy scores for all models
print("\nFinal Model Evaluation:")
for model_name, scores in accuracy_scores.items():
    print(f"{model_name} - Mean accuracy: {np.mean(scores):.4f}")


Evaluating Decision Tree...
Decision Tree - Accuracy scores for each fold: [0.9996965568891271, 0.9995868008703007, 0.9997223818347333, 0.9996642757071192, 0.9995738883974976, 0.9997030131255286, 0.9996771881799223, 0.9997159255983317, 0.9997352943075364, 0.999696554930014]
Decision Tree - Average accuracy: 0.9997
Evaluating Random Forest...
Random Forest - Accuracy scores for each fold: [0.9997675754895441, 0.9997869441987488, 0.9998256816171581, 0.9998063129079534, 0.9997030131255286, 0.9998192253807565, 0.9997934004351503, 0.9998256816171581, 0.9998321378535596, 0.9997804865451164]
Random Forest - Average accuracy: 0.9998
Evaluating Extra Trees...
Extra Trees - Accuracy scores for each fold: [0.9997934004351503, 0.9997740317259457, 0.9998192253807565, 0.9997998566715519, 0.9997352943075364, 0.9998385940899612, 0.9998515065627643, 0.9998450503263627, 0.9998515065627643, 0.9997804865451164]
Extra Trees - Average accuracy: 0.9998

Final Model Evaluation:
Decision Tree - Mean accuracy: 

In [7]:
import pickle
with open('ML-Based IDS.pickle','wb') as f:
    pickle.dump(model,f)