# COCS 670 - CyberAI - Assignment 2

## Table of Contents
1. Assignment 2 Problem Description
2. Dataset Description: CICIoT2023 - IoT Attack Classification Dataset
   - Overview
   - Key Contributions
   - Challenges in IoT Security Data Production
   - Canadian Institute for Cybersecurity (CIC)
   - Supporting IoT Security Research 
   - Data Categories
   - Citation
3. 3 Classification Algorithms
4. Experiment: Comparing Classification Algorithms  

## Assigment 2 Problem Description

Choose a dataset that is well suited for classification. You can use any dataset related to cybersecurity that
you would like to classify. A good number of datasets can be found in the UCI machine learning data
repository (searching keywords such like cybersecurity, security, attack and defense, intrusion
detection, etc.) but feel free to use any dataset that you want. Make sure that you select a dataset that has
a class variable. Then use a tool such as python, R, Weka, or RapidMiner to classify the dataset. The
specific requirements for the assignment are as follows:

* Choose a dataset that is of interest to you and is well suited for classification
* Describe the dataset
* Research at least 3 different classification algorithms.
* Give an explanation of the algorithms that you are using.
* Design an experiment using training and testing (holdout method), cross-validation, or the
bootstrap method. Use a statistical test to validate your partition of the data.
* Compare the results of three or more classification methods using the same experimental setup
using one or more classification evaluation methods discussed in class. The metrics that you
choose are up to you and can include accuracy, error rate, sensitivity, specificity, precision, recall,
and F measure.
* Write a report that describes your experiment and results. The report should be in either ACM or
IEEE conference paper format and should include an introduction section that details the dataset
and the objectives of the analysis; a methodology section that explains the approach that you are
using to mine the dataset including the steps used to preprocess the data, the classification
algorithms and parameters, experimental setup (e.g. holdout, cross validation, bootstrapping),
accuracy metrics (e.g. precision, recall, f-measure, etc...); a results section that shows the results
of your analysis and any interesting patterns that you found; and a conclusion section that
summarizes your results, discusses the limitations of your approach, and any difficulties that you
had with your experiment.

## Dataset Description: CICIoT2023 - IoT Attack Classification Dataset

### Overview

CICIoT2023 is a real-time dataset designed to support research and development in the field of security analytics for Internet of Things (IoT) environments. The dataset comprises a wide range of IoT attack scenarios executed within a complex IoT topology consisting of 105 devices. These attacks are meticulously categorized into seven distinct classes, including Distributed Denial of Service (DDoS), Denial of Service (DoS), Reconnaissance (Recon), Web-based attacks, Brute Force attacks, Spoofing, and Mirai attacks. The primary goal of this dataset is to facilitate the development of machine learning and deep learning algorithms for the classification and detection of IoT network traffic as malicious or benign.

### Key Contributions

The creators of CICIoT2023 have made significant contributions to the field of IoT security:

1. Realistic IoT Attack Dataset: CICIoT2023 introduces a novel and realistic IoT attack dataset. Unlike many previous datasets that rely on simulated or limited IoT device setups, this dataset leverages an extensive topology composed of actual IoT devices, mimicking real-world IoT applications.

2. 33 Reproducible Attacks: The dataset includes 33 distinct attacks, carefully documented and collected as part of seven different classes. These attacks serve as valuable resources for understanding and reproducing IoT attack scenarios.

3. Performance Evaluation: The dataset enables the evaluation of machine learning and deep learning algorithms for classifying and detecting IoT network traffic as malicious or benign. Researchers can assess the effectiveness of various techniques in identifying IoT attacks.

### Challenges in IoT Security Data Production

Producing high-quality IoT security data is a challenging endeavor, primarily due to the following factors:
- Extensive Network Topology: Creating an IoT attack dataset that accurately reflects real-world scenarios requires an extensive network topology comprising multiple authentic IoT devices. Such setups involve substantial costs, specialized network equipment (e.g., switches, routers, network taps), and dedicated personnel for maintenance.

### Canadian Institute for Cybersecurity (CIC)

The Canadian Institute for Cybersecurity (CIC) is a prominent entity in the cybersecurity ecosystem, known for making substantial contributions to both industry and academia. CIC's achievements include the creation of datasets for developing new cybersecurity applications and establishing partnerships with industry stakeholders to enhance cybersecurity practices and develop innovative solutions.

### Supporting IoT Security Research

CIC's success has allowed the establishment of an IoT lab equipped with a dedicated network infrastructure to facilitate the development of IoT security solutions. By sharing the extensive dataset collected from this infrastructure, CIC aims to advance research in IoT security and support various initiatives addressing different aspects of IoT security.

### Data Categories

The dataset encompasses a wide range of IoT attacks categorized into seven classes:

* DDoS Attacks: Various DDoS attack types, including ACK fragmentation, UDP flood, SlowLoris, ICMP flood, RSTFIN flood, PSHACK flood, HTTP flood, UDP fragmentation, TCP flood, SYN flood, and SynonymousIP flood.
* Brute Force Attacks: Dictionary brute force attacks.
* Spoofing Attacks: Arp spoofing and DNS spoofing.
* DoS Attacks: Denial of Service attacks, including TCP flood, HTTP flood, SYN flood, and UDP flood.
* Reconnaissance Attacks: Reconnaissance activities such as ping sweeps, OS scans, vulnerability scans, port scans, and host discovery.
* Web-based Attacks: Web-based attack methods, including SQL injection, command injection, backdoor malware, uploading attacks, XSS (Cross-Site Scripting), and browser hijacking.
* Mirai Attacks: Specific Mirai attacks, such as GREIP flood, Greeth flood, and UDPPlain.

### Citation
E. C. P. Neto, S. Dadkhah, R. Ferreira, A. Zohourian, R. Lu, A. A. Ghorbani. "CICIoT2023: A real-time dataset and benchmark for large-scale attacks in IoT environment," Sensor (2023) – (submitted to Journal of Sensors).

This dataset serves as a valuable resource for researchers and practitioners aiming to enhance the security of IoT environments and develop robust intrusion detection and prevention systems.

## 3 Classification Algorithms

The three classification algorithms that I will be using are:
1. K-Nearest Neighbors
2. Random Forest
3. Support Vector Machine

## Experiment: Comparing Classification Algorithms

In this experiment, we aim to assess and compare the performance of three different classification algorithms on a dataset. The objective is to determine which algorithm provides the best results for the given classification task.

### Step 1: Data Preprocessing

1. Load the dataset: The dataset is loaded into a Pandas DataFrame. It contains information about IoT attacks classified into seven categories, namely DDoS, DoS, Recon, Web-based, Brute Force, Spoofing, and Mirai.

2. Split the dataset: The dataset is split into a training set and a test set. The training set will be used to train the classifiers, while the test set will be used for evaluation. A common split ratio is 70% training and 30% testing.

### Step 2: Model Selection and Training

For this experiment, three classification algorithms are chosen:

#### Decision Tree Classifier

3. Initialize the Decision Tree classifier.
4. Train the classifier on the training data.

#### K-Nearest Neighbors (KNN) Classifier

5. Initialize the KNN classifier.
6. Train the classifier on the training data.

#### Support Vector Machine (SVM) Classifier

7. Initialize the SVM classifier.
8. Train the classifier on the training data.

### Step 3: Model Evaluation

9. Use the trained classifiers to make predictions on the test data.
10. Evaluate the performance of each classifier using various metrics, including accuracy, precision, recall, F1-score, and the confusion matrix.

### Step 4: Results and Comparison

11. Compare the results of the three classifiers, considering their accuracy and other relevant metrics. Present the results in tables or visualizations.
12. Draw conclusions about which classifier performs best on the dataset based on the evaluation metrics.

Note: Hyperparameter tuning and cross-validation may be applied to optimize classifier performance.

By following this experiment, we aim to make an informed decision about the most suitable classification algorithm for the given dataset and classification task.

## Experiment

Import the required libraries.

In [None]:
# Import Pandas, Numpy
import pandas as pd
import numpy as np

Read in the a portion of the dataset.
**NOTE** Using only a portion of the dataset due to time and memory constraints.

In [None]:
dataset = pd.read_csv("./datasets/part-00000-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv")

View the structure of the dataframe.
The label column is the target variable.

In [None]:
dataset

Use the holdout method to split the dataset into training and testing sets.

In [None]:
# Split the data into training and testing sets
train = dataset.sample(frac=0.7,random_state=0) # random state is a seed value, so that the same sample is selected each time
test = dataset.drop(train.index) # drop the rows that are in the training set, so that the test set is disjoint

Get the observations and labels for the training set.

In [None]:
# Get the list of observations and classes
obs = list(dataset.columns)
# Remove the 'label' column
obs.remove('label')

# Set the target variable
clas = ['label']

Separate the observations from the class/target variable in both the training and testing sets.

Use of ravel() 