Author: Nik Alleyne <br>
Author Blog: https://www.securitynik.com <br>
Author GitHub:github.com/securitynik <br>
Author Books: [  <br>
                "https://www.amazon.ca/Learning-Practicing-Leveraging-Practical-Detection/dp/1731254458/",  <br>
                "https://www.amazon.ca/Learning-Practicing-Mastering-Network-Forensics/dp/1775383024/" <br>
            ] <br>


## 17. Beginning Deep Learning, - Classification, Pytorch

This post is part of my beginning machine learning series.  <br>
The series includes the following: <br>

01 - Beginning Numpy <br>
02 - Beginning Tensorflow  <br>
03 - Beginning PyTorch <br>
04 - Beginning Pandas <br>
05 - Beginning Matplotlib <br>
06 - Beginning Data Scaling <br>
07 - Beginning Principal Component Analysis (PCA) <br>
08 - Beginning Machine Learning Anomaly Detection - Isolation Forest and Local Outlier Factor <br>
09 - Beginning Unsupervised Machine Learning - Clustering - K-means and DBSCAN <br>
10 - Beginning Supervise Learning - Machine Learning - Logistic Regression, Decision Trees and Metrics <br>
11 - Beginning Linear Regression - Machine Learning <br>
12 - Beginning Deep Learning - Anomaly Detection with AutoEncoders, Tensorflow <br>
13 - Beginning Deep Learning - Anomaly Detection with AutoEncoders, PyTroch <br>
14 - Beginning Deep Learning, - Linear Regression, Tensorflow <br>
15 - Beginning Deep Learning, - Linear Regression, PyTorch <br>
16 - Beginning Deep Learning, - Classification, Tensorflow <br>
17 - Beginning Deep Learning, - Classification, Pytorch <br>
18 - Beginning Deep Learning, - Classification - regression - MIMO - Functional API Tensorflow <br> 
19 - Beginning Deep Learning, - Convolution Networks - Tensorflow <br>
20 - Beginning Deep Learning, - Convolution Networks - PyTorch <br>
21 - Beginning Regularization - Early Stopping, Dropout, L2 (Ridge), L1 (Lasso) <br>
23 - Beginning Model TFServing <br>

But conn.log is not the only file within Zeek. Let's build some models for DNS and HTTP logs. <br>
I choose unsupervised, because there are no labels coming with these data. <br>

24 - Continuing Anomaly Learning - Zeek DNS Log - Machine Learning <br>
25 - Continuing Unsupervised Learning - Zeek HTTP Log - Machine Learning <br>

This was a specific ask by someone in one of my class. <br>
26 - Beginning - Reading Executables and Building a Neural Network to make predictions on suspicious vs suspicious  <br><br>

With 26 notebooks in this series, it is quite possible there are things I could have or should have done differently.  <br>
If you find any thing, you think fits those criteria, drop me a line. <br>

In [68]:
# import some libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# In the notebooks on Pandas, Matplotlib and Scaling
#   04 - Beginning Pandas <br>
# we loaded our dataset such as
df_conn = pd.read_csv(r'df_conn_with_labels.csv', index_col=0)
df_conn

This file represents Zeek (formerly Bro) connection log - conn.log`. 
Zeek is a framework used for Network Security Monitoring. 
This entire series is based on using Zeek's data. 
The majority of the notebooks use the conn.log
You can learn more about Zeek here:
   
    https://zeek.org/

Alternatively, come hang out with us in the:
SANS SEC595: Applied Data Science and Machine Learning for Cybersecurity Professionals

        https://www.sans.org/cyber-security-courses/applied-data-science-machine-learning/ OR

SEC503 SEC503: Network Monitoring and Threat Detection In-Depth

        https://www.sans.org/cyber-security-courses/network-monitoring-threat-detection/


Here are also some blog posts on using Zeek for security monitoring
Installing Zeek: 

        https://www.securitynik.com/2020/06/installing-zeek-314-on-ubuntu-2004.html

Detecting PowerShell Empire Usage: 

        https://www.securitynik.com/2022/02/powershell-empire-detection-with-zeek.html

Detecting Log4J Vulnerability Exploitation: 

        https://www.securitynik.com/2021/12/continuing-log4shell-zeek-detection.html

In [None]:
# Drop the port column
df_conn = df_conn.drop(columns=['id.resp_p'], inplace=False)
df_conn

In [None]:
# Looking at above, we see a number of records with 0s. 
# These will add no value to our learning process
# Let's find all those records and drop them
# Reference: https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression
df_conn = df_conn.drop(df_conn[(df_conn.duration == 0 ) & (df_conn.orig_bytes == 0 ) \
                               & (df_conn.resp_bytes == 0 ) & (df_conn.orig_pkts == 0 )  \
                                & (df_conn.orig_ip_bytes == 0 ) & (df_conn.resp_pkts == 0 ) \
                                    & (df_conn.resp_ip_bytes == 0 )].index)
df_conn

In [None]:
# The graph below shows this dataset is highly imbalanced.
# As a result, using measures like accuracy is more than likely not the best approach, 
# to understand how well our eventual model has "learned"
# via the training data
plt.title('Bar graph showing highly imbalanced dataset')
plt.bar(x=['normal', 'suspicious'], height=[ df_conn[df_conn.label == 0].shape[0], \
                                            df_conn[df_conn.label == 1].shape[0] ])
plt.ylabel(ylabel='Number of Records')
plt.xlabel(xlabel='Normal vs Suspicious')
plt.xticks(rotation=45)
plt.show()


In [None]:
# Getting the percentage of samples that are considered suspicious in this dataset
# This is going to be quite a challenge for this learning algorithm
(df_conn[df_conn.label == 1].shape[0] / df_conn.shape[0]) * 100

In [None]:
# Extract the X_data
X_data = df_conn.drop(columns=['label'], inplace=False)
X_data

In [None]:
# Extract the labels
y_label = df_conn.label
y_label

In [None]:
# prepare to split the data into training and testing sets
from sklearn.model_selection import train_test_split

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_data.values, y_label, test_size=0.2, \
                                                    train_size=0.8, stratify=y_label, random_state=10)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
# With a statistical understanding of the normal and suspicious datasets, time to build the model
# Scaling was covered in 
#   06 - Beginning Data Scaling
# Scaling the data first
# import the scaler library
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Setup the scaler
min_max_scaler = MinMaxScaler(feature_range=(0,1))

# Fit on the training data
min_max_scaler.fit(X_train)

# Transform the train data
X_train = min_max_scaler.transform(X_train)
X_train

In [None]:
# Scale the test data
X_test = min_max_scaler.transform(X_test)
X_test

In [None]:
# Import PCA to leverage dimensionality reduction
# PCA was covered in notebook
#   07 - Beginning Principal Component Analysis (PCA)
from sklearn.decomposition import PCA

In [None]:
# Setup PCA to use 3 principal Components
pca = PCA(n_components=3, random_state=10)
pca

In [None]:
# Fit on the X_train
pca.fit(X_train)

# transform the training data
X_train = pca.transform(X_train)
X_train

In [None]:
# Use the opportunity to PCA transform the X_test
X_test = pca.transform(X_test)
X_test

In [None]:
# Import torch
import torch

# Import the torchinfo to get summary information
import torchinfo

In [None]:
# Convert the train and test data from numpy arrays to tensors
X_train, X_test = torch.tensor(data=X_train, dtype=torch.float32), torch.tensor(data=X_test, dtype=torch.float32)

# Get a snapshot of the data
X_train[:5], X_test[:5]

In [None]:
# Convert the panda series to a numpy array
# Make the array 2 dimensions
# Reshape to have multiple rows and 1 column
y_train = np.array(y_train.values, ndmin=2, dtype=np.float32).reshape(-1, 1)
y_test = np.array(y_test.values, ndmin=2, dtype=np.float32).reshape(-1, 1)

# Get 5 samples from each
y_train[:5], y_test[:5]

In [None]:
# Convert the labels to torch tensor
y_train = torch.tensor(data=y_train, dtype=torch.float32)
y_test = torch.tensor(data=y_test, dtype=torch.float32)

y_train[:5], y_test[:5]

In [None]:
# Setup the model using the Sequential Class
torch_clf_model = torch.nn.Sequential(
    torch.nn.Linear(in_features=3, out_features=8),
    torch.nn.ReLU(),
    torch.nn.Linear(in_features=8, out_features=1),
    torch.nn.Sigmoid()
)

In [None]:
# Get the summary of the model
torchinfo.summary(torch_clf_model)

In [None]:
# Prepare to visualize the model
# https://github.com/mert-kurttutan/torchview
from torchview import draw_graph

In [None]:
# Plot the model
# https://stackoverflow.com/questions/52468956/how-do-i-visualize-a-net-in-pytorch
model_graph = draw_graph(model=torch_clf_model, input_data=X_train, graph_name='torch_clf_model', \
                         expand_nested=True, save_graph=False,show_shapes=True, graph_dir='RL', \
                            roll=True, hide_inner_tensors=False, hide_module_functions=False)
model_graph.visual_graph

In [None]:
# Get a look at the initialized parameters - weights and bias
torch_clf_model.state_dict()

In [None]:
# Before training the model, let's see what ReLU does
# Setup some samples between -10 and 10, space them by 0.1
sample_numbers = np.arange(-10, 10, 0.1)
np.round(sample_numbers[95:106], 2)

In [None]:
# Plot the numbers that were created above
# we see there are 200 numbers between -10 and 10
plt.title('Range of values between -10 and +10')
plt.plot(sample_numbers)
plt.xlabel('Number of records')
plt.ylabel('Range of Values')

In [None]:
# With Relu, anything less than 0 will be made 0 and anything above 0 will be kept the same
# Setup a function to take care of this 
def my_relu(x: np.array) -> list:
    ''' Computes  RELU from the x '''
    return [ 0 if i <=0 else i for i in x ]

In [None]:
# Testing our function with 2 values
# One less than 0 and another greater than 0
# We see below when x is less than 0, the value returned is 0
my_relu(np.array([-10])), my_relu(np.array([10])), 

In [None]:
# Running Relu against our relu_samples
my_relu(sample_numbers)[96:106]

In [None]:
# Making this more visual by plotting the original numbers vs the numbers which RELU has been applied to
# As we can see, anything below 0 has now become 0
plt.plot(sample_numbers, lw=10, label='original values')
plt.plot(my_relu(sample_numbers), lw=2, linestyle='--', label='values after RELU activation')
plt.xlabel('Number of records')
plt.ylabel('Range of Values')
plt.legend()
plt.show()

In [None]:
# In the following notebooks:
#   13. Beginning Deep Learning - Anomaly Detection with AutoEncoders, PyTroch
#   15. Beginning Deep Learning, - Linear Regression, PyTorch
# the training was all done outside of a function. 
# Rather than rewriting the same code all the time, let's to create a function
def torch_training(model=None, epochs=10, learning_rate=0.01, x_train=X_train, \
                   y_train=y_train, x_test=X_test, y_test=y_test):
    ''' Performs training of the model '''
    # Create to lists to save the training and test loss respectively 
    training_loss, validation_loss = [], []

    # Setup the loss function
    clf_loss_fn = torch.nn.BCELoss()

    # Setup the optimizer
    clf_optimizer = torch.optim.Adam(params=model.parameters(), lr=learning_rate)
    
    for epoch in range(epochs):
        # Clear the gradients
        clf_optimizer.zero_grad()

        # Train the model
        model.train()

        # Make predictions on the training data
        train_preds = model(x_train)
    
        # Get the loss
        train_loss = clf_loss_fn(train_preds, y_train)
        training_loss.append(train_loss)

        # Calculate the gradients
        train_loss.backward()

        # Upgrade the gradients
        clf_optimizer.step()

        # Evaluate the model at the same time
        model.eval()
        with torch.inference_mode():
            val_preds = model(X_test)

            # Calculate the loss on the validation data
            val_loss = clf_loss_fn(val_preds, y_test)
            validation_loss.append(val_loss)

        if epoch %50 == 0:
            print(f'Epoch: {epoch} \t training loss: {train_loss} \t validation loss {val_loss}')
    
    return model, training_loss, validation_loss

In [None]:
# Set a random seed to make this repatable
torch.manual_seed(seed=10)

# Call the function with the associated parameters
#(model, train_loss, val_loss) = torch_training(model=torch_clf_model, epochs=300, learning_rate=0.01)
torch_clf_model = torch_training(model=torch_clf_model, epochs=300, learning_rate=0.01)

In [None]:
# What has the model returned?
torch_clf_model

In [None]:
# Get the model Learned parameters - Weights and Bias
torch_clf_model[0].state_dict()

In [None]:
# Plotting the training loss values
plt.title(f'Training vs Validation Loss after epochs ')

# Trying to plot on the "training_loss" by itself will not work
# Matplotlib will more than likely throw an error
# Hence we needed to do "torch.tensor(training_loss).detach().numpy()"
plt.plot(torch.tensor(torch_clf_model[1]).detach().numpy(), lw=3, label='train_loss')
plt.plot(torch.tensor(torch_clf_model[2]).detach().numpy(), lw=3, label='val_loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# That's a nice looking graph above there

In [None]:
# With the training loss trending downwards, this suggest a few more epochs may make the mode perform even better
# How did our model do for training
# import some metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Looking at y_true
y_test

In [None]:
# Make predictions on the test set
# We see the values are continuous
with torch.inference_mode():
    test_preds = torch_clf_model[0](X_test)

test_preds

In [None]:
# If we try to feed this to some of our metrics algorithm, it will get a value error such as
# " Classification metrics can't handle a mix of binary and continuous targets"
# As a result, I round it out instead
np.round(test_preds)

In [None]:
# Let's save these predictions out to a variable
x_test_preds = np.round(test_preds)
x_test_preds

In [None]:
# Get the accuracy score. 
# Grabbing the accuracy score
# With that understanding above, let grab the accuracy score
accuracy_score(y_true=y_test, y_pred=x_test_preds)

In [None]:
# Not a bad accuracy score
# import seaborn
# https://seaborn.pydata.org/generated/seaborn.heatmap.html
import seaborn as sns

In [None]:
# Looking at the confusion matrix, This does not seem so bad
# We learned about metrics in notebook:
#   10. Beginning Supervise Learning - Machine Learning - Logistic Regression, Decision Trees and Metrics
sns.heatmap(confusion_matrix(y_true=y_test, y_pred=x_test_preds), annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# Above should help us to understand that accuracy is not the best measurement for imbalanced classification problems
# Overall, this model is terrible. Context is important!!
# This is actually a model, I would not put in production for my security monitoring
# If I am able to ignore 41,509 records that's a good thing. 
# However, I have 13 false negatives. 
# This model also did not pick-up any true positives
# Obviously, no one wants these false negatives. Hopefully, we can catch those "threats" via threat hunting
# Looking at the classification report
print(classification_report(y_true=y_test, y_pred=x_test_preds))

In [None]:
# Confirming there are only 13 records flagged as suspicious
len(list(np.where(y_test == 1))[0])

In [None]:
# Let's define a sample with the features values: 
# duration	orig_bytes	resp_bytes	orig_pkts	orig_ip_bytes	resp_pkts	resp_ip_bytes
new_sample = np.array([141., 356138566,	11037090, 60, 3026679, 33, 982584], dtype=float, ndmin=2)

# Preprocess the new samples as was done with the training data
new_sample = pca.transform(min_max_scaler.transform(new_sample))

# Convert the sample to a torch tensor
new_sample = torch.tensor(data= new_sample, dtype=torch.float32)
new_sample

In [None]:
# Make a prediction on the sample
# Remember, the previously unseen data has to go through the same transformation as the training data
# See:
#   06 - Beginning Data Scaling
#   07 - Beginning Principal Component Analysis (PCA)
torch_clf_model[0].eval()
new_pred = torch_clf_model[0](new_sample)
new_pred

In [None]:
# Import the data time library
from datetime import datetime

In [None]:
# Report a sample as suspicious, if it's threshold is greater than 0.5
f'{datetime.now()} - [!] ALERT ** SUSPICIOUS ACTIVITY ** Zeek conn.log' if new_pred > 0.5  \
    else "[**] {datetime.now()} - Normal Traffic"

In [None]:
# Setup a suspicious sample
new_sample = np.array([5000., 356138566,	11037090, 6000, 3026679, 9999999, 982584], dtype=float, ndmin=2)

# Preprocess the new samples as was done with the training data
new_sample = pca.transform(min_max_scaler.transform(new_sample))

# Convert the sample to a torch tensor
new_sample = torch.tensor(data= new_sample, dtype=torch.float32)
new_sample

In [None]:
# Transform the sample
new_pred = torch_clf_model[0](new_sample)
new_pred

In [None]:
# Report a sample as suspicious, if it's threshold is greater than 0.5
f'{datetime.now()} - [!] ALERT ** SUSPICIOUS ACTIVITY ** Zeek conn.log' if new_pred > 0.5  \
    else "[**] {datetime.now()} - Normal Traffic"

In [None]:
# Interesting, both instances we have true negative classification. 
# Remember, this model failed to find any true or false positives. 
# Hence we should not trust what was done here
# Remember, the model also had false negatives. I would consider this a false negative.
# If you want to know why I think so, hit me up for my opinion.

In [None]:
# Import the os library
import os

In [None]:
# Create the location to save the model
PATH = './SAVED_MODELS/TORCH_classification/'
MODELS_PATH = os.makedirs(name=PATH, exist_ok=True)

In [None]:
# Save the model
torch.save(obj=torch_clf_model[0], f=F'{PATH}/torch_clf_model_saved_dict.pth')

In [None]:
loaded_clf_torch_model = torch.load(f=f'{PATH}torch_clf_model_saved_dict.pth')
loaded_clf_torch_model


In [None]:
# Make a prediction on the loaded model
loaded_clf_torch_model(new_sample)

In [None]:
# That's it! Moving on!!