Skip to content

ITU-AI-ML-in-5G-Challenge/Intrusion-and-Vulnerability-Detection-in-Software-Defined-Networks-SDN-Team-ML-IDS

Repository files navigation

Intrusion-and-Vulnerability-Detection-in-Software-Defined-Networks-SDN-Team-ML-IDS

Intrusion and Vulnerability Detection in Software-Defined Networks (SDN) by Team ML-IDS

Original and modified data for the experiments can be found here: https://drive.google.com/drive/folders/1hp8FB270BEYhK2dAIZJsBHIfNV8Fhwy1?usp=drive_link

Overview

This repository contains the research work conducted by Nasik Sami Khan, Md. Shamim Towhid, and Md Mahibul Hasan from the Department of Computer Science at the University of Regina, Canada. The goal of this research was to address the challenge posted by ULAK at ITU platform for tackling the IDS issue in SDN environment. The research focuses on developing effective intrusion detection systems for Software-Defined Networks (SDNs) using machine learning techniques.

Abstract

The transition from conventional networking architectures to SDNs has brought about significant advancements in network management. However, the centralization of control within SDNs poses a security risk, necessitating robust intrusion detection systems. This research explores the development of multiclass classifiers capable of identifying various intrusion types in SDN-enabled networks. A comprehensive dataset, including Normal flow data, DDoS flow data, Malware flow data, and web-based flow data, is provided by ULAK to facilitate research. Machine learning techniques are employed to create effective intrusion detection models, contributing to the protection of SDN-based networks against a wide range of threats.

Key Words

  • SDN
  • IDS
  • Data Imbalance
  • Machine Learning
  • Ensemble Techniques

Research Methodology

Dataset

The provided dataset consists of 1.78 million rows with 77 distinct columns, including a labeled column representing the output variable. The dataset is characterized by a class imbalance, requiring specialized techniques for fair treatment of all classes. Feature selection is performed using the Random Forest classifier, resulting in a subset of 28 significant features.

Feature Selection

The feature selection process involves evaluating each attribute's significance using the Random Forest classifier, followed by Principal Component Analysis (PCA) for validation. The final subset of 28 features enhances predictive ability and interpretability. We used embedded method and random forest to curate the 28 features.

Data Preprocessing

Data preprocessing includes cleaning, conversion to numeric values, handling class imbalance, and scaling feature values. The dataset is split into training and validation sets to ensure consistent and well-prepared data for model training and evaluation. For data imbalance, we down-sampled the number of samples for Benign (Majority) Class and augmented synthetic data from the minority classes using SMOTE technique. We sepearated data into 3 groups based on the class sample distribution.

Model Architecture

The model employs an ensemble approach, combining Random Forest, AdaBoost, and XGBoost classifiers. Random Forest serves as the meta-model, leveraging the strengths of individual models to enhance accuracy and robustness. We used stacking method of ensemble technique in tackling this problem.

4 Model_architecture

Result Analysis

The ensemble model demonstrates promising results with an average F1-score of 97.77% during 5-fold cross-validation on the validation set. Challenges include handling rare security threats and complex attack patterns. The model's performance is compared with existing methods, including a hybrid approach of CNN and Random Forest, a standalone Random Forest and another Standalone XGBoost algorithm.

Stacking model performance on test data

5 Stacking_on_test_data

Stacking model performance on seperate test data (CICIDS_2017) for proving model generalization:

6 Stacking_on_CIC_IDS_2017_data

Previous Baseline models

Standalone Random Forest Baseline

1 RF_Baseline_Test_Data

Standalone XGBoost Baseline

2 XGB_Baseline_Test_Data

CNN+Random Forest

3 CNN+RF_on_validation_set

Computational expence comparision

Model training time (sec):

7 Time

Inference time per sample (sec):

8 Time2

Ongoing research direction

We are working on building a hiererchical model to tackle the data imbalance issue in intrusuin detection. So far, it has given promising result on validation set, test set and seperate dataset.

Model Architecture Summary:

12 Ongoing_approach

Result Summary:

Classification report on test set:

image

Classification report on different dataset (CIC_IDS_2017):

image

Comparision with baseline models:

9 Ongoing_Result_graph

10 Ongoing_result

Performance on CICIDS_2017 dataset:

11 Ongoing_research_on_CICIDS_2017

Future Works:

We are optimistic in utilizing few-shot learning to tackle this problem set, because of the nature of the challenge. We are willing to explore custom ensemble models and voting mechanism to solve class imbalance issues.

Conclusion

The research highlights the importance of intrusion detection in SDN and proposes an ensemble model for effective threat detection. Future work includes addressing challenges with rare classes, exploring alternate feature engineering, optimizing threshold values, and refining the ensemble model architecture.

How to Use

Overview

  • Model Training: The Stacking_Model Train and Test Updated.ipynb notebook contains the code for training the ensemble model. The dataset is loaded, features are selected, and the data is preprocessed. The base classifiers (Random Forest, XGBoost, AdaBoost) are trained, and their predictions are used as input to the meta-model (Random Forest). The F1-score is used as the evaluation metric.

  • Model Evaluation: The model is evaluated on a validation set, and the classification report and confusion matrix are generated. The best meta-model is saved for future use.

  • Model Testing: The trained ensemble model is loaded, and a test dataset is used to make predictions. The classification report and confusion matrix for the test set are generated.

Files

  • Stacking_Model Train and Test Updated.ipynb: Jupyter notebook containing the code for model training and testing.
  • *.pkl: Pickle file containing the trained models, scaler, and label encoder. Just run the notebook and replace the test file with your own test files.

Usage

  1. Clone the repository:

    git clone https://github.com/your-username/ensemble-intrusion-detection.git](https://github.com/ITU-AI-ML-in-5G-Challenge/Intrusion-and-Vulnerability-Detection-in-Software-Defined-Networks-SDN-Team-ML-IDS.git)
  2. Open and run the Stacking_Model Train and Test Updated.ipynb notebook using Jupyter Notebook or Google Colab.

  3. After training, you can use the saved model to make predictions on your own test data. Update the model_path and data_path variables in the provided testing script.

Results

The model achieves an average F1-score of 0.9777 on the validation set and 0.8612 on the test set. Detailed classification reports and confusion matrices are provided in the notebook.

Contributors

About

Intrusion and Vulnerability Detection in Software-Defined Networks (SDN) by Team ML-IDS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published