#  <span style="color:#0b186c;">Introduction to Intrusion Detection Systems (IDS)</span>

---

An Intrusion Detection System (IDS) is a system that **monitors** network traffic for suspicious activity and issues alerts when such activity is discovered. These systems can be applications that sit at specific network locations and can be tailored to search for known threats as well as abnormal behavior.

<br></br>
## <span style="color:#0b186c;">Table of Contents:</span>
* [Project Description](#first-bullet)
* [Dataset Information](#second-bullet)
* [Data Preprocessing](#third-bullet)
* [Initial Model Development](#fourth-bullet)
* [Dimensionality Reduction](#fifth-bullet)
* [Conclusion](#sixth-bullet)

#  <span style="color:#0b186c;">Project Description</span><a class="anchor" id="first-bullet"></a>

---

Intrusion Detection Systems can use different methods to detect suspicious activities, which can be broadly divided into:

- **Signature-based intrusion detection** – These systems compare the incoming traffic with a pre-existing database of known attack patterns known as signatures.

- **Anomaly-based intrusion detection** – It uses statistics to form a baseline usage of the networks at different time intervals. They were introduced to detect unknown attacks. This system uses machine learning to create a model simulating regular activity and then compares new behaviour with the existing model.

## <span style="color:#0b186c;">Required Imports:</span>

<div class="alert alert-warning">

<b>Note:</b> If you have not previously installed these `packages`, you can use the cell below to perform the required `pip` installs.

</div>

In [None]:
# In case you still need to perform some pip installs:
! pip install --user pandas -q
! pip install --user numpy -q
! pip install --user scikit-learn -q

In [None]:
# Dataframe and array libraries
import pandas as pd
import numpy as np

# Libraries for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns

# Required for performing standardization robust to outliers
from sklearn.preprocessing import RobustScaler

# Required for performing encoding on categorical input variables
from sklearn.preprocessing import OrdinalEncoder

# Required for performing encoding categorical input variables into new columns
from sklearn.preprocessing import OneHotEncoder

# Required for instantiating and running a Decision Tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Classification metrics and confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score, plot_confusion_matrix, ConfusionMatrixDisplay

# Required for performing dimensionality reduction
from sklearn.decomposition import PCA

# Filters out warning messages
import warnings
warnings.filterwarnings('ignore')

#  <span style="color:#0b186c;">Dataset Information</span><a class="anchor" id="second-bullet"></a>

---

Widely considered the *Hello World* of IDS Machine Learning, the KDD '99 dataset is used to create Supervised Machine Learning models capable of distinguishing normal network behavior from intrusions of varied attack types. The original dataset originated from the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 the Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

NSL-KDD is a dataset suggested to solve some of the inherent problems of the KDD'99 data set, which included a large amount of redundant and duplicate records. The improvements made on the dataset reduce the potential for bias towards better detection rates on the more frequent records. The dataset is maintained by the Canandian Institute of Cybersecurity:

https://www.unb.ca/cic/datasets/nsl.html

In [None]:
from google.colab import files
uploaded = files.upload()

Saving KDDTest.csv to KDDTest.csv
Saving KDDTrain.csv to KDDTrain.csv


In [None]:
# Column names for the dataset
cols = ['duration','protocol_type','service','flag','src_bytes','dst_bytes','land','wrong_fragment','urgent','hot',
        'num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
        'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','srv_count',
        'serror_rate','srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate','diff_srv_rate',
        'srv_diff_host_rate','dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate',
        'dst_host_same_src_port_rate','dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate',
        'dst_host_rerror_rate','dst_host_srv_rerror_rate','target','difficulty']

# Read in the training dataset
train_df = pd.read_csv("/content/KDDTrain.csv", names = cols)

# Read in the training dataset
test_df = pd.read_csv("/content/KDDTest.csv", names = cols)

# Output the top 5 records of the training set
print("Training Set:")
train_df

Training Set:


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target,difficulty
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,neptune,20
125969,8,udp,private,SF,105,145,0,0,0,0,...,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal,21
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal,18
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,20


In [None]:
# Output the top 5 records of the test set
print("Test Set:")
test_df.head()

In [None]:
# Drop the difficulty feature from both datasets
train_df.drop(columns=['difficulty'], inplace=True)
test_df.drop(columns=['difficulty'], inplace=True)

# Output dataframe information for the training set
train_df.info()

In [None]:
# Output dataframe information for the test set
test_df.info()

#  <span style="color:#0b186c;">Data Preprocessing</span><a class="anchor" id="third-bullet"></a>

---

Data Preprocessing is the process of selecting and/or transforming existing features (columns) in the raw data to make the data compatible with the chosen machine learning model and improve the model’s predictive performance against the data. The appropriate preprocessing techniques for a particular model requires knowledge of how the model interprets the different features, as well as domain expertise pertaining to the data itself. Otherwise, if done improperly, the resulting predictions could be inaccurate, or the dependencies could be misinterpreted by the model. 

Common Data Preprocessing techniques include:

- Imputation
- Log Transformations
- Encoding
- Feature Selection
- Scaling


## <span style="color:#0b186c;">Numerical Features</span>

In [None]:
# Select the columns in the training set with numerical dtypes
num_cols = train_df.select_dtypes(include=['int64', 'float64']).columns

# Create a 4x5 figure to plot the first 20 box plots
fig, axes = plt.subplots(4, 5)
fig.suptitle('Training Set', fontsize=20)
for ax, col in zip(axes.ravel(), num_cols[0:20]):
    train_df[col].value_counts().plot(ax=ax, kind='box', figsize=(20, 20), fontsize=10)
    ax.set_title(str(col), fontsize = 12)
plt.show()

In [None]:
# Create a 3x6 figure to plot the remaining 18 box plots
fig, axes = plt.subplots(3, 6)
fig.suptitle('Training Set', fontsize=20)
for ax, col in zip(axes.ravel(), num_cols[20:]):
    train_df[col].value_counts().plot(ax=ax, kind='box', figsize=(20, 20), fontsize=10)
    ax.set_title(str(col), fontsize = 12)
plt.show()

In [None]:
# Select the columns in the test set with numerical dtypes
num_cols = test_df.select_dtypes(include=['int64', 'float64']).columns

# Create a 4x5 figure to plot the first 20 box plots
fig, axes = plt.subplots(4, 5)
fig.suptitle('Test Set', fontsize=20)
for ax, col in zip(axes.ravel(), num_cols[0:20]):
    test_df[col].value_counts().plot(ax=ax, kind='box', figsize=(20, 20), fontsize=10)
    ax.set_title(str(col), fontsize = 12)
plt.show()

In [None]:
# Create a 3x6 figure to plot the remaining 18 box plots
fig, axes = plt.subplots(3, 6)
fig.suptitle('Test Set', fontsize=20)
for ax, col in zip(axes.ravel(), num_cols[20:]):
    test_df[col].value_counts().plot(ax=ax, kind='box', figsize=(20, 20), fontsize=10)
    ax.set_title(str(col), fontsize = 12)
plt.show()

In [None]:
# Instantiate the scaler
scaler = RobustScaler()

# Fit and transform the numerical columns in the training set
train_df[num_cols] = scaler.fit_transform(train_df[num_cols])

# Review the changes made to the training set
train_df

In [None]:
# Transform the numerical test columns in the test set
test_df[num_cols] = scaler.transform(test_df[num_cols])

# Review the changes made to the test set
test_df

## <span style="color:#0b186c;">Categorical Features</span>

In [None]:
# Select the columns in the training set with dtype of object
obj_cols = train_df.select_dtypes(include='object').columns

# Create a 2x2 figure to plot pie charts of the categorical distributions
fig, axes = plt.subplots(2, 2)
fig.suptitle('Training Set', fontsize=20)
for ax, col in zip(axes.ravel(), obj_cols):
    train_df[col].value_counts().plot(ax=ax, kind='pie', figsize=(15, 15), fontsize=10, autopct='%1.0f%%')
    ax.set_title(str(col), fontsize = 12)
plt.show()

NameError: ignored

In [None]:
# Select the columns in the test set with dtype of object
obj_cols = test_df.select_dtypes(include='object').columns

# Create a 2x2 figure to plot pie charts of the categorical distributions
fig, axes = plt.subplots(2, 2)
fig.suptitle('Test Set', fontsize=20)
for ax, col in zip(axes.ravel(), obj_cols):
    test_df[col].value_counts().plot(ax=ax, kind='pie', figsize=(15, 15), fontsize=10, autopct='%1.0f%%')
    ax.set_title(str(col), fontsize = 12)
plt.show()

In [None]:
# Instantiate the ordinal encoder
oe = OrdinalEncoder(dtype=int, handle_unknown = 'use_encoded_value', unknown_value = 999)

# Fit the encoder on the flag feature  
oe.fit(train_df[['flag']])

# View the identified categories in the fitted feature
oe.categories_

In [None]:
# Transform the identified categories into numerical representations
train_df['flag'] = oe.transform(train_df[['flag']])

# Review the changes made to the flag feature in the training set
train_df

In [None]:
# Transform the identified categories into numerical representations
test_df['flag'] = oe.transform(test_df[['flag']])

# Review the changes made to the flag feature in the test set
test_df

In [None]:
# Fit on the service feature and transform in place
train_df['service'] = oe.fit_transform(train_df[['service']])

# Verify changes made to the training set
train_df

In [None]:
# Transform the categories identified in training for the test set
test_df['service'] = oe.transform(test_df[['service']])

# Verify changes made to the test set
test_df

In [None]:
# Instantiate the one hot encoder
ohe = OneHotEncoder(sparse=False, dtype=int)

# Fit the encoder on the appropriate column
ohe.fit(train_df[['protocol_type']])

# Select the categories identified by the encoder
col = ohe.categories_

# Run transform to one hot encode the column, adding the new columns to the dataframe
train_df[col[0]] = ohe.transform(train_df[['protocol_type']])

# Drop the original column, not needed anymore
train_df.drop(columns=['protocol_type'], inplace=True)

# Review the changes made to the training set
train_df

In [None]:
# Run transform to one hot encode the column, adding the new columns to the dataframe
test_df[col[0]] = ohe.transform(test_df[['protocol_type']])

# Drop the original column, not needed anymore
test_df.drop(columns=['protocol_type'], inplace=True)

# Review the changes made to the test set
test_df

In [None]:
# Assign 0 for normal and 1 for intrusion in the target variable in both sets
train_df.loc[(train_df.target == 'normal'), 'target'] = 0
train_df.loc[(train_df.target != 0), 'target'] = 1

test_df.loc[(test_df.target == 'normal'), 'target'] = 0
test_df.loc[(test_df.target != 0), 'target'] = 1

# Visualize the altered target variable distributions
fig, axes = plt.subplots(1, 2)
c1 = ['#1f77b4', '#d62728']
c2 = ['#d62728', '#1f77b4']
train_df.target.value_counts().plot(ax=axes[0], kind='pie', colors = c1, 
                                    figsize=(15, 15), fontsize=10, autopct='%1.0f%%')
test_df.target.value_counts().plot(ax=axes[1], kind='pie', colors = c2,
                                   figsize=(15, 15), fontsize=10, autopct='%1.0f%%')

#  <span style="color:#0b186c;">Initial Model Development</span><a class="anchor" id="fourth-bullet"></a>

---

Since the dataset contains a labeled target variable, which we have scoped for binary classification of intrusions and normal network activity, we can leverage classification models for predictions on future data. As discussed previously, Decision Trees are one of the most popular types of classification algorithms due to their flexibility and overall performance. 

First, we will split the data into independent variables (X) and the dependent variable (y) for both the training and test sets. Since the data has already been partioned for us, we will not need to use the `train_test_split()

In [None]:
# Separate X and y for the training set
y_train = train_df.pop('target').astype('int')
X_train = train_df

X_train

In [None]:
# Separate X and y for the test set
y_test = test_df.pop('target').astype('int')
X_test = test_df

X_test

The Decision Tree model can be loaded directly from `scikit-learn`.


https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
# Instantiate the classifier
classifier = DecisionTreeClassifier()

# Fit the model on the training data
classifier.fit(X_train, y_train)

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(classifier, 
                   max_depth = 2,
                   feature_names= X_train.columns,  
                   class_names = ['Normal', 'Intrusion'],
                   filled = True)

In [None]:
# Make predictions based on the X values in the test set
y_pred = classifier.predict(X_test)

# Calculate the accuracy score of the test set
score = round((accuracy_score(y_test, y_pred) * 100), 2)

# changing the rc parameters to adjust the size
plt.rcParams['figure.figsize'] = [10, 10]

#Plot the confusion Matrix for the predictions
fig = plot_confusion_matrix(classifier, X_test, y_test, cmap = plt.cm.Blues)
fig.ax_.set_title("Confusion Matrix")
plt.show()


print(f"Accuracy = {score}%")

#  <span style="color:#0b186c;">Dimensionality Reduction</span><a class="anchor" id="fifth-bullet"></a>

---

The number of input variables or features for a dataset is referred to as its dimensionality. More input features often make a predictive modeling task more challenging, generally due to the interpretability of feature importance and variance. Therefore, dimensionality reduction techniques are popular for handling data with hundreds, or even thousands, of features. One such technique, Principal Component Analysis (PCA), is a dimensionality reduction method that transforms the data to a new basis where the dimensions are non-redundant (low covariance) and have high variance. PCA is an Unsupervised Learning method and can be loaded directly from `scikit-learn`.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [None]:
# Review the training dataframe, it should not have the target label present
train_df

In [None]:
# Instantiate the PCA
pca = PCA()

# Fit pca on the training data
pca.fit(train_df)

# Plot the principal components against their inertia
features = range(pca.n_components_)
_ = plt.figure(figsize=(15, 5))
_ = plt.bar(features, pca.explained_variance_)
_ = plt.xlabel('Principal Component')
_ = plt.ylabel('Variance')
_ = plt.xticks(features)
_ = plt.title("Importance of the Principal Components Based on Inertia")
plt.show()

In [None]:
# Instantiate pca with the optimal number of components
pca = PCA(n_components=3)

# Since we reduced the number of features, we are going to use new column names
pc_columns = ['pc_%i' % i for i in range(3)]

# Transform the training data into reduced dimensions of 3 principal components
pca_train = pd.DataFrame(pca.fit_transform(train_df), columns = pc_columns, index = train_df.index)

# View the transformed dataframe
pca_train

In [None]:
# Transform the test data into reduced dimensions of 3 principal components
pca_test = pd.DataFrame(pca.transform(test_df), columns = pc_columns, index = test_df.index)

# View the transformed dataframe
pca_test

In [None]:
# Fit the model on the training data
classifier.fit(pca_train, y_train)

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(classifier, 
                   max_depth = 2,
                   feature_names= pca_train.columns,  
                   class_names = ['Normal', 'Intrusion'],
                   filled = True)

In [None]:
# Make predictions based on the X values in the test set
y_pred = classifier.predict(pca_test)

# Calculate the accuracy score of the test set
score = round((accuracy_score(y_test, y_pred) * 100), 2)

# changing the rc parameters to adjust the size
plt.rcParams['figure.figsize'] = [10, 10]

#Plot the confusion Matrix for the predictions
fig = plot_confusion_matrix(classifier, pca_test, y_test, cmap = plt.cm.Blues)
fig.ax_.set_title("Confusion Matrix")
plt.show()


print(f"Accuracy = {score}%")

#  <span style="color:#0b186c;">Conclusion</span><a class="anchor" id="sixth-bullet"></a>

---
