# Assignment 2: Machine Learning with Scikit-Learn
**Iftekhar Rafi B00871031**

**Abdulla Sadoun B00900541**

## Question 0: Revisiting Your Dataset (A1Q1)

### a) Dataset Selection
For this assignment, We have selected the **CSE-CIC-IDS2018** dataset from the **Canadian Institute for Cybersecurity at the University of New Brunswick** also used for Assignment 1. This dataset is well-suited for analyzing **network intrusion detection**. Several methods of network intrusion have been explored in this research and subsequent dataset. I will be focusing on the dataset collected from **brute-force attack scenarios** as described in the research. The data is taken from their processed dataset for Wednesday, 14 February as described below.

#### Dataset Details
- **Source**: [CICIDS 2018 Dataset](https://www.unb.ca/cic/datasets/ids-2018.html)
- **Types of Attacks Covered**: Brute-force attacks (FTP and SSH)

| Attacker                     | Victim                          | Attack Name      | Date          | Attack Start Time | Attack Finish Time |
|------------------------------|---------------------------------|------------------|---------------|-------------------|--------------------|
| 172.31.70.4 (Valid IP:18.221.219.4) | 172.31.69.25 (Valid IP:18.217.21.148) | FTP-BruteForce   | Wed-14-02-2018 | 10:32             | 12:09              |
| 172.31.70.6 (Valid IP:13.58.98.64)  | 18.217.21.148- 172.31.69.25          | SSH-BruteForce   | Wed-14-02-2018 | 14:01             | 15:31              |


### b) Dataset Loading
The dataset was loaded into a Pandas DataFrame using the following code:




In [3]:
import pandas as pd
import numpy as np
#from google.colab import drive
#drive.mount('/content/drive')

# Load the dataset
file_path = 'Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 80 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Dst Port           1048575 non-null  int64  
 1   Protocol           1048575 non-null  int64  
 2   Timestamp          1048575 non-null  object 
 3   Flow Duration      1048575 non-null  int64  
 4   Tot Fwd Pkts       1048575 non-null  int64  
 5   Tot Bwd Pkts       1048575 non-null  int64  
 6   TotLen Fwd Pkts    1048575 non-null  int64  
 7   TotLen Bwd Pkts    1048575 non-null  int64  
 8   Fwd Pkt Len Max    1048575 non-null  int64  
 9   Fwd Pkt Len Min    1048575 non-null  int64  
 10  Fwd Pkt Len Mean   1048575 non-null  float64
 11  Fwd Pkt Len Std    1048575 non-null  float64
 12  Bwd Pkt Len Max    1048575 non-null  int64  
 13  Bwd Pkt Len Min    1048575 non-null  int64  
 14  Bwd Pkt Len Mean   1048575 non-null  float64
 15  Bwd Pkt Len Std    1048575 non-n

Unnamed: 0,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,0,0,14/02/2018 08:31:01,112641719,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56320859.5,139.300036,56320958,56320761,Benign
1,0,0,14/02/2018 08:33:50,112641466,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56320733.0,114.551299,56320814,56320652,Benign
2,0,0,14/02/2018 08:36:39,112638623,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56319311.5,301.934596,56319525,56319098,Benign
3,22,6,14/02/2018 08:40:13,6453966,15,10,1239,2273,744,0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,Benign
4,22,6,14/02/2018 08:40:23,8804066,14,11,1143,2209,744,0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,Benign


### c) Description of the Dataset

The dataset consists of network traffic data collected on **February 14, 2018**, as part of the **CSE-CIC-IDS2018** dataset. It captures various network activities, including normal (benign) traffic and brute-force attacks targeting FTP and SSH services. The dataset includes **1,048,575 rows** and **80 columns**, each representing different characteristics of the network flows.

Here is a breakdown of the dataset:
- **Total Number of Records**: 1,048,575
- **Total Number of Features (Columns)**: 80
- **Types of Data**:
  - **Numeric Features**: Most of the columns are numeric, such as packet counts, flow durations, and packet sizes. These can help analyze the behavior of network traffic.
  - **Categorical Features**: The dataset includes a column labeled "Label," which identifies whether a given flow is benign or part of a specific attack type.

### Key Features in the Dataset
- **Flow Duration**: Measures how long a network flow lasted.
- **Total Forward and Backward Packets** (`Tot Fwd Pkts`, `Tot Bwd Pkts`): Counts the number of packets sent in the forward and backward directions.
- **Packet Length Statistics**: Provides information about the maximum, minimum, average, and standard deviation of packet lengths in both directions.
- **Flow Rate (`Flow Byts/s`, `Flow Pkts/s`)**: Indicates the number of bytes or packets transmitted per second during a flow.
- **Inter-Arrival Time**: Measures the time between consecutive packets.
- **Flags**: Various TCP flags, such as `SYN`, `ACK`, and `RST`, are used to indicate specific network conditions.
- **Active and Idle Times**: Represents the time intervals when the flow was actively transmitting data and when it was idle.

### Dataset Usage
The data can be used to detect and analyze network intrusions, specifically **brute-force attacks**. By examining patterns in traffic, such as sudden spikes in flow rates or repeated login attempts, it is possible to identify suspicious behaviors indicative of an ongoing attack.

### Example Records
Each row in the dataset represents a network flow, containing information like the flow's duration, the total number of packets, and the size of packets transmitted in both directions. The dataset also provides a label indicating whether the flow is normal (benign) or an attack.

Overall, this dataset is useful for tasks such as detecting brute-force attacks and understanding the characteristics of network traffic during different types of events.

## Question 1: Utilizing Machine Learning (15 points)

### Experiemting with 3 machine learning Models, for each model:
#### a) Train your models by applying an appropriate training/test split.  
#### b) If applicable to that model, explore how using regularization can improve it. 
#### c) For each model, create visuals that highlight interesting aspects of your model and the results. 
#### d) For classification models, obtain the classification report and the confusion matrix. Comment on the results. 5. If applicable, calculate and visualize feature importance.

We first started by preprocessing the data, we have tried running but have noticed that there are errors due to infinity and NaN values in X, so I replaced the infinite values that are outside of the float range with NaN and then used an imputer to replace the NaN values with averages.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
#from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Prepare the data
X = df.drop(columns=['Timestamp', 'Label'])
y = df['Label']

# infinity x values fix (replaced with nan)
X.replace([np.inf, -np.inf], np.nan, inplace=True)

# Fill missing values with the mean of the column (filling nan values)
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

# Split the data into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.25, stratify=y) # 75% training and 25% test (used by professor)

# Reducing the size to avoid vm crash
X_train, y_train = X_train[:100000], y_train[:100000]  # Use only the first 100k samples for training to avoid crashing

We have also came to the conclusion that given our limited hardware capabilites, we will not be able to use as big of a dataset as we thought we could, our initial dataset had 1.5M records initially but we had to limited it down to XXXXXXX to avoid crashing  ##<---[TEST HERE IFTEKHAR]##

### Logistic Regression model

I have chosen to start with this model as it is the most used.  

In [5]:
from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model
# lr_model_1 = LogisticRegression(max_iter=1000)
# r_model_1 = LogisticRegression(max_iter=2000) # increased max_iter to 2000 to avoid convergence warning
#lr_model_1 = LogisticRegression(max_iter=2000, solver='saga') # Initialize the logistic regression model with increased max_iter and saga solver as 2000 didnt work
lr_model_1 = LogisticRegression(max_iter=1000, solver='liblinear') # trying this again after reducing the training data size 

# Train the model
lr_model_1.fit(X_train, y_train)

# Model Evaluation
accuracy = lr_model_1.score(X_test, y_test)
print('The test accuracy is {0:5.2f} %'.format(accuracy*100))

The test accuracy is 98.78 %


In [6]:
# Predict on the test data
y_pred = lr_model_1.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
                precision    recall  f1-score   support

        Benign       0.99      0.99      0.99    166907
FTP-BruteForce       0.98      1.00      0.99     48340
SSH-Bruteforce       0.99      0.96      0.98     46897

      accuracy                           0.99    262144
     macro avg       0.99      0.98      0.99    262144
  weighted avg       0.99      0.99      0.99    262144

Confusion Matrix:
[[165408    984    515]
 [     1  48339      0]
 [  1695     13  45189]]


In [8]:
# from sklearn.linear_model import LinearRegression
# lr_model = LinearRegression() #uses the closed-form solution
# print(lr_model)
# lr_model.fit(X_train, y_train)
# y_pred = lr_model.predict(X_test)

### Random Forest
Model often used for high-dimensional data and includes feature importance.

In [7]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model
rf_model.fit(X_train, y_train)

# Predict on the test data
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
accuracy_rf = rf_model.score(X_test, y_test)
print('The test accuracy of the Random Forest model is {0:5.2f} %'.format(accuracy_rf * 100))

print("Classification Report for Random Forest:")
print(classification_report(y_test, y_pred_rf))

print("Confusion Matrix for Random Forest:")
print(confusion_matrix(y_test, y_pred_rf))

The test accuracy of the Random Forest model is 100.00 %
Classification Report for Random Forest:
                precision    recall  f1-score   support

        Benign       1.00      1.00      1.00    166907
FTP-BruteForce       1.00      1.00      1.00     48340
SSH-Bruteforce       1.00      1.00      1.00     46897

      accuracy                           1.00    262144
     macro avg       1.00      1.00      1.00    262144
  weighted avg       1.00      1.00      1.00    262144

Confusion Matrix for Random Forest:
[[166905      1      1]
 [     0  48340      0]
 [     1      5  46891]]


### Support Vector Machine
Model that is often used when the classes are not linearly separable, but are altered greatly from scaling.

In [9]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Initialize the SVM model
svm_model = SVC(kernel='linear', max_iter=1000)

# Train the model
svm_model.fit(X_train, y_train)

# Predict on the test data
y_pred_svm = svm_model.predict(X_test)

# Evaluate the model

# Evaluate the model
svm_model = svm_model.score(X_test, y_test)
print('The test accuracy of the Support Vector Machine model is {0:5.2f} %'.format(svm_model * 100))

print("Classification Report for SVM:")
print(classification_report(y_test, y_pred_svm))

print("Confusion Matrix for SVM:")
print(confusion_matrix(y_test, y_pred_svm))



The test accuracy of the Support Vector Machine model is 64.22 %
Classification Report for SVM:
                precision    recall  f1-score   support

        Benign       0.80      0.58      0.67    166907
FTP-BruteForce       0.98      1.00      0.99     48340
SSH-Bruteforce       0.25      0.50      0.34     46897

      accuracy                           0.64    262144
     macro avg       0.68      0.69      0.67    262144
  weighted avg       0.74      0.64      0.67    262144

Confusion Matrix for SVM:
[[96551   739 69617]
 [    0 48340     0]
 [23432     5 23460]]


### K-Nearest Neighbors

In [8]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train the KNN model
knn_model.fit(X_train, y_train)

# Predict on the test data
y_pred_knn = knn_model.predict(X_test)

# Evaluate the model
accuracy_knn = knn_model.score(X_test, y_test)
print('The test accuracy of the K-Nearest Neighbors model is {0:5.2f} %'.format(accuracy_knn * 100))

print("Classification Report for K-Nearest Neighbors:")
print(classification_report(y_test, y_pred_knn))

print("Confusion Matrix for K-Nearest Neighbors:")
print(confusion_matrix(y_test, y_pred_knn))

The test accuracy of the K-Nearest Neighbors model is 99.94 %
Classification Report for K-Nearest Neighbors:
                precision    recall  f1-score   support

        Benign       1.00      1.00      1.00    166907
FTP-BruteForce       1.00      1.00      1.00     48340
SSH-Bruteforce       1.00      1.00      1.00     46897

      accuracy                           1.00    262144
     macro avg       1.00      1.00      1.00    262144
  weighted avg       1.00      1.00      1.00    262144

Confusion Matrix for K-Nearest Neighbors:
[[166805     25     77]
 [     1  48339      0]
 [    33     20  46844]]


### Linear Regression Section