# Network Intrusion Detection Using Machine Learning/Deep Learning
This notebook involves the making of machine learning & deep learning models to classify the given data of obtained as a network intrusion into differen classes (malignant or benign). Given a sample point, the objective of machine learning model will be to classify that whether the intrusion made is  **Benign** or is a **BruteForce** (either FTP or SSH).

# Importing Libraries
First, we will import libraries that we need to start our workflow. The libraries we are using are:
* NumPy
* Pandas
* Matplotlib
* Scikit-learn
* Keras
* TensorFlow

In [None]:
# import libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os, re, time, math, tqdm, itertools
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.offline as pyo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.neural_network import MLPClassifier
import keras
from keras.layers import Conv2D, Conv1D, MaxPooling2D, MaxPooling1D, Flatten, BatchNormalization, Dense
from tensorflow.keras.layers import Conv2D, Conv1D, MaxPooling2D, MaxPooling1D, Flatten, BatchNormalization, Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import CSVLogger, ModelCheckpoint

from keras.models import Sequential
from keras.callbacks import CSVLogger, ModelCheckpoint

In [None]:
# check the available data
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

We have a lot of data available to deal with in this notebook. We will perform analysis, preprocessing and modeling on one of the datasets and will conclude the results at the end.

In [None]:
!nvidia-smi

# Loading the Data
First step is to load the available data into our memory.

In [None]:
%%time
# Load the data into memory

# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Import pandas and load data from the correct path
import pandas as pd

# List of file paths to your datasets in Google Drive
file_paths = [
    '/content/drive/MyDrive/Intrsuion Detection/02-14-2018.csv',
    '/content/drive/MyDrive/Intrusion Detection/02-15-2018.csv',
]

# Initialize an empty list to hold the individual DataFrames
df_list = []

# Loop through each file path, read the CSV into a DataFrame, and append it to the list
for file_path in file_paths:
    try:
        df = pd.read_csv(file_path, low_memory=False)  # Suppress the DtypeWarning by reading the file in one go
        df_list.append(df)
    except Exception as e:
        print(f"Error loading {file_path}: {e}")

# Concatenate all DataFrames in the list into a single DataFrame
network_data = pd.concat(df_list, ignore_index=True)

# Step 4: Check if the data is loaded correctly
print(network_data.head())
print("Total samples loaded:", network_data.shape[0])


# EDA (Exploratory Data Analysis)
For making a proper undertanding of dataset we are using, we will perform a bief EDA (Exploratory Data Analysis). The EDA is sub-divided into:
* Data Visuals
* Data Understanding
* Data Analysis

In [None]:
# check the shape of data
network_data.shape

In [None]:
# check the number of rows and columns
print('Number of Rows (Samples): %s' % str((network_data.shape[0])))
print('Number of Columns (Features): %s' % str((network_data.shape[1])))

We have a total of **1 million+** samples and **80** features in data.

In [None]:
network_data.head(4)

In [None]:
# check the columns in data
network_data.columns

In [None]:
# check the number of columns
print('Total columns in our data: %s' % str(len(network_data.columns)))

The dataset is huge. We have a total of **80** columns in our data.

In [None]:
network_data.info()

The following information tells us that:
* We have a huge amount of data, containing **1 million+** entries (samples)
* There are a total of **80** columns belinging to each sample
* There are missing values in our data, which need to be filled or dropped for proper modelling
* The memory consumption of data is **700 MB**

In [None]:
# check the number of values for labels
network_data['Label'].value_counts()

Most of the network intrusions in our data are benign, as output from above code cell.

## Data Visualizations
After getting some useful information about our data, we now make visuals of our data to see how the trend in our data goes like. The visuals include bar plots, distribution plots, scatter plots, etc.

In [None]:
# make a plot number of labels
sns.set(rc={'figure.figsize':(12, 6)})
plt.xlabel('Attack Type')
sns.set_theme()
ax = sns.countplot(x='Label', data=network_data)
ax.set(xlabel='Attack Type', ylabel='Number of Attacks')
plt.show()

In [None]:
# Get the column names of the dataframe
df.columns

# Identify categorical data where there are only 2 unique values in the column
categorical_data = [row for row in df.columns if len(pd.unique(df[row])) <= 2]


In [None]:
# Plot histograms of all binary categorical data columns
df[categorical_data].hist(figsize=(25, 25))


In [None]:
import plotly.express as px
import plotly.offline as pyo
import pandas as pd

# Initialize Plotly in offline mode
pyo.init_notebook_mode(connected=True)

# Example DataFrame loading (Replace this with your actual DataFrame loading code)
# network_data = pd.read_csv('path_to_your_network_data.csv')

# Create scatter plot
fig = px.scatter(
    network_data,
    x="Bwd Pkts/s",
    y="Fwd Seg Size Min",
    title="Scatter Plot of Network Data",
    labels={"Bwd Pkts/s": "Bwd Pkts/s", "Fwd Seg Size Min": "Fwd Seg Size Min"}
)

# Show plot
fig.show()


In [None]:
%%time
sns.set(rc={'figure.figsize':(12, 6)})
sns.scatterplot(x=network_data['Bwd Pkts/s'][:50000], y=network_data['Fwd Seg Size Min'][:50000],
                hue='Label', data=network_data)

From the graphs, we came to know that:
* Most of the attacks made by intruders are malignant (almost 700k)
* **FTP-BruteFore** and **SSH-BruteForce** type attacks are less in numbers (less than 200k)
* Most of the intruders try to make a malignant attack on network systems

In [None]:
# check the dtype of timestamp column
(network_data['Timestamp'].dtype)

# Data Preprocessing
Data preprocessing plays an important part in the process of data science, since data may not be fully clean and can contain missing or null values. In this step, we are undergoing some preprocessing steps that will help us if there is any null or missing value in our data.

In [None]:
# check for some null or missing values in our dataset
network_data.isna().sum().to_numpy()

All features in the data have no null or missing values, except one feature that contains **2277** missing values. We need to remove this column from our data, so that our data may get cleaned.

In [None]:
# drop null or missing columns
cleaned_data = network_data.dropna()
cleaned_data.isna().sum().to_numpy()

After removing the missing valued column in our data, we have now no feature that contains any missing or null value. Data is cleaned now.

### Label Encoding
The Label feature in the data contains 3 labels as **Benign**, **BruteForceFTP** and **BruteForceSSH**. All these are in string format. For our neural network, we need to convert them into numbers so that our NN may understand their representations.

In [None]:
# encode the column labels
label_encoder = LabelEncoder()
cleaned_data['Label']= label_encoder.fit_transform(cleaned_data['Label'])
cleaned_data['Label'].unique()

In [None]:
# check for encoded labels
cleaned_data['Label'].value_counts()

## Shaping the data for CNN
For applying a convolutional neural network on our data, we will have to follow following steps:
* Seperate the data of each of the labels
* Create a numerical matrix representation of labels
* Apply resampling on data so that can make the distribution equal for all labels
* Create X (predictor) and Y (target) variables
* Split the data into train and test sets
* Make data multi-dimensional for CNN
* Apply CNN on data

In [None]:
# make 3 seperate datasets for 3 feature labels
data_1 = cleaned_data[cleaned_data['Label'] == 0]
data_2 = cleaned_data[cleaned_data['Label'] == 1]
data_3 = cleaned_data[cleaned_data['Label'] == 2]

# make benign feature
y_1 = np.zeros(data_1.shape[0])
y_benign = pd.DataFrame(y_1)

# make bruteforce feature
y_2 = np.ones(data_2.shape[0])
y_bf = pd.DataFrame(y_2)

# make bruteforceSSH feature
y_3 = np.full(data_3.shape[0], 2)
y_ssh = pd.DataFrame(y_3)

# merging the original dataframe
X = pd.concat([data_1, data_2, data_3], sort=True)
y = pd.concat([y_benign, y_bf, y_ssh], sort=True)

In [None]:
y_1, y_2, y_3

In [None]:
print(X.shape)
print(y.shape)

In [None]:
# checking if there are some null values in data
X.isnull().sum().to_numpy()

The output of above cell shows that there are no null values in our data, and the data can now be used for model fitting. We have two types of datasets, normal and abnormal, and they'll be used for model fitting.

## Data Argumentation
Ti avoid biasing in data, we need to use data argumentation on it so that we can remove bias from data and make equal distributions.

In [None]:
from sklearn.utils import resample

data_1_resample = resample(data_1, n_samples=20000,
                           random_state=123, replace=True)
data_2_resample = resample(data_2, n_samples=20000,
                           random_state=123, replace=True)
data_3_resample = resample(data_3, n_samples=20000,
                           random_state=123, replace=True)

In [None]:
train_dataset = pd.concat([data_1_resample, data_2_resample, data_3_resample])
train_dataset.head(2)

In [None]:
# viewing the distribution of intrusion attacks in our dataset
plt.figure(figsize=(10, 8))
circle = plt.Circle((0, 0), 0.7, color='white')
plt.title('Intrusion Attack Type Distribution')
plt.pie(train_dataset['Label'].value_counts(), labels=['Benign', 'BF', 'BF-SSH'], colors=['blue', 'magenta', 'cyan'])
p = plt.gcf()
p.gca().add_artist(circle)

## Making X & Y Variables (CNN)

In [None]:
test_dataset = train_dataset.sample(frac=0.1)
target_train = train_dataset['Label']
target_test = test_dataset['Label']
target_train.unique(), target_test.unique()

In [None]:
y_train = to_categorical(target_train, num_classes=3)
y_test = to_categorical(target_test, num_classes=3)

## Data Splicing
This stage involves the data split into train & test sets. The training data will be used for training our model, and the testing data will be used to check the performance of model on unseen dataset. We're using a split of **80-20**, i.e., **80%** data to be used for training & **20%** to be used for testing purpose.

In [None]:
train_dataset = train_dataset.drop(columns = ["Timestamp", "Protocol","PSH Flag Cnt","Init Fwd Win Byts","Flow Byts/s","Flow Pkts/s", "Label"], axis=1)
test_dataset = test_dataset.drop(columns = ["Timestamp", "Protocol","PSH Flag Cnt","Init Fwd Win Byts","Flow Byts/s","Flow Pkts/s", "Label"], axis=1)

In [None]:
# making train & test splits
X_train = train_dataset.iloc[:, :-1].values
X_test = test_dataset.iloc[:, :-1].values
X_test

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
# reshape the data for CNN
X_train = X_train.reshape(len(X_train), X_train.shape[1], 1)
X_test = X_test.reshape(len(X_test), X_test.shape[1], 1)
X_train.shape, X_test.shape



In [None]:
# making the deep learning function
def model():
    model = Sequential()
    model.add(Conv1D(filters=64, kernel_size=6, activation='relu',
                    padding='same', input_shape=(72, 1)))
    model.add(BatchNormalization())

    # adding a pooling layer
    model.add(MaxPooling1D(pool_size=(3), strides=2, padding='same'))

    model.add(Conv1D(filters=64, kernel_size=6, activation='relu',
                    padding='same', input_shape=(72, 1)))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=(3), strides=2, padding='same'))

    model.add(Conv1D(filters=64, kernel_size=6, activation='relu',
                    padding='same', input_shape=(72, 1)))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=(3), strides=2, padding='same'))

    model.add(Flatten())
    model.add(Dense(64, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(3, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
model = model()
model.summary()

In [None]:
logger = CSVLogger('logs.csv', append=True)
his = model.fit(X_train, y_train, epochs=20, batch_size=32,
          validation_data=(X_test, y_test), callbacks=[logger])



## Visualization of Results (CNN)
Let's make a graphical visualization of results obtained by applying CNN to our data.

In [None]:
# check the model performance on test data
scores = model.evaluate(X_test, y_test)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

In [None]:
# check history of model
history = his.history
history.keys()

In [None]:
epochs = range(1, len(history['loss']) + 1)
acc = history['accuracy']
loss = history['loss']
val_acc = history['val_accuracy']
val_loss = history['val_loss']

# visualize training and val accuracy
plt.figure(figsize=(10, 5))
plt.title('Training and Validation Accuracy (CNN)')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.plot(epochs, acc, label='accuracy')
plt.plot(epochs, val_acc, label='val_acc')
plt.legend()

# visualize train and val loss
plt.figure(figsize=(10, 5))
plt.title('Training and Validation Loss(CNN)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.plot(epochs, loss, label='loss', color='g')
plt.plot(epochs, val_loss, label='val_loss', color='r')
plt.legend()

# Conclusion after CNN Training
After training our deep CNN model on training data and validating it on validation data, it can be interpreted that:
* Model was trained on 50 epochs and then on 30 epochs
* CNN performed exceptionally well on training data and the accuracy was **99%**
* Model accuracy was down to **83.55%** on valiadtion data after **50** iterations, and gave a good accuracy of **92%** after **30** iterations. Thus, it can be interpreted that optimal number of iterations on which this model can perform are **30**.

In [None]:
# SGD OPTIMIZATION

In [None]:
print("Columns in the dataset: ", df.columns)

# Step 2: Identifying binary categorical columns
categorical_data = [row for row in df.columns if len(pd.unique(df[row])) <= 2]
print("Binary categorical columns: ", categorical_data)

# Step 3: Visualizing the distribution of binary categorical columns
df[categorical_data].hist(figsize=(25, 25))
plt.suptitle('Distribution of Binary Categorical Columns', fontsize=16)
plt.show()

In [None]:
# Import SGD optimizer
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import to_categorical

# Define a new model with SGD optimizer
def model_with_sgd():
    model = Sequential()
    model.add(Conv1D(filters=64, kernel_size=6, activation='relu',
                    padding='same', input_shape=(72, 1)))
    model.add(BatchNormalization())

    # Adding a pooling layer
    model.add(MaxPooling1D(pool_size=(3), strides=2, padding='same'))

    model.add(Conv1D(filters=64, kernel_size=6, activation='relu',
                    padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=(3), strides=2, padding='same'))

    model.add(Conv1D(filters=64, kernel_size=6, activation='relu',
                    padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=(3), strides=2, padding='same'))

    model.add(Flatten())
    model.add(Dense(64, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(3, activation='softmax'))

    # Compile the model with SGD optimizer
    sgd = SGD(learning_rate=0.01, momentum=0.9)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

# Instantiate and summarize the model
model_sgd = model_with_sgd()
model_sgd.summary()

# Define a CSV logger for the SGD model
logger_sgd = CSVLogger('logs_sgd.csv', append=True)

# Make sure labels are one-hot encoded
target_train = to_categorical(target_train, num_classes=3)
target_test = to_categorical(target_test, num_classes=3)

# Check Shapes
print("X_train shape:", X_train.shape)
print("y_train shape:", target_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", target_test.shape)

# Train the model with SGD optimizer
his_sgd = model_sgd.fit(X_train, target_train, epochs=20, batch_size=32,
                        validation_data=(X_test, target_test), callbacks=[logger_sgd])

# Evaluate the model performance
scores_sgd = model_sgd.evaluate(X_test, target_test)
print("%s: %.2f%%" % (model_sgd.metrics_names[1], scores_sgd[1] * 100))

# Check the history of the SGD model
history_sgd = his_sgd.history
history_sgd.keys()

# Plot training and validation accuracy for SGD model
epochs_sgd = range(1, len(history_sgd['loss']) + 1)
acc_sgd = history_sgd['accuracy']
loss_sgd = history_sgd['loss']
val_acc_sgd = history_sgd['val_accuracy']
val_loss_sgd = history_sgd['val_loss']

plt.figure(figsize=(10, 5))
plt.title('Training and Validation Accuracy (SGD)')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.plot(epochs_sgd, acc_sgd, label='accuracy')
plt.plot(epochs_sgd, val_acc_sgd, label='val_acc')
plt.legend()

# Plot training and validation loss for SGD model
plt.figure(figsize=(10, 5))
plt.title('Training and Validation Loss (SGD)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.plot(epochs_sgd, loss_sgd, label='loss', color='g')
plt.plot(epochs_sgd, val_loss_sgd, label='val_loss', color='r')
plt.legend()


In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

# Predict the classes for the test set
y_pred = model_sgd.predict(X_test)
y_pred_classes = y_pred.argmax(axis=-1)
y_true_classes = target_test.argmax(axis=-1)

# Confusion Matrix
conf_matrix = confusion_matrix(y_true_classes, y_pred_classes)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix (SGD)')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.show()

# Classification Report
print('Classification Report (SGD):')
print(classification_report(y_true_classes, y_pred_classes))


In [None]:
from sklearn.metrics import roc_curve, auc

# Plot ROC curve for each class
for i in range(3):
    fpr, tpr, _ = roc_curve(target_test[:, i], y_pred[:, i])
    roc_auc = auc(fpr, tpr)

    plt.plot(fpr, tpr, label=f'Class {i} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()


In [None]:
from sklearn.metrics import precision_recall_curve

# Plot precision-recall curve for each class
for i in range(3):  # Assuming 3 classes
    precision, recall, _ = precision_recall_curve(target_test[:, i], model_sgd.predict(X_test)[:, i])

    plt.plot(recall, precision, label=f'Class {i}')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve (SGD)')
plt.legend(loc="lower left")
plt.show()


In [None]:
# Assuming `y_pred_classes` and `y_true_classes` from before
residuals = y_true_classes - y_pred_classes

plt.scatter(y_true_classes, residuals)
plt.title('Residual Plot (SGD)')
plt.xlabel('True Values')
plt.ylabel('Residuals')
plt.show()


In [None]:
# Plot the CDF of model predictions
import numpy as np

y_pred_proba = model_sgd.predict(X_test).max(axis=1)
y_pred_sorted = np.sort(y_pred_proba)
cdf = np.arange(len(y_pred_sorted)) / float(len(y_pred_sorted))

plt.plot(y_pred_sorted, cdf)
plt.title('Cumulative Density Function (CDF) of Predictions (SGD)')
plt.xlabel('Prediction Probability')
plt.ylabel('CDF')
plt.show()
