<h1> Notebook 2: Introduction to Deep Learning and PyTorch </h1>

In [None]:
import torch.nn as nn
import pandas as pd
from pathlib import Path
from torch.utils.data import Dataset, DataLoader

<h2> Why deep learning? </h2>

The linear perceptron model described in yesterdays introduction to machine learning can learn a linear boundary for binary classification tasks. This is a powerful technique, and one that works well when the underlying optimal decision boundary is itself linear. However, many practical problems involve the need for decision boundaries that are nonlinear in nature, and our linear perceptron model isn’t expressive enough to capture this relationship.

Consider the following set of data:

<img src="../images/nonseparablelinear.png" />

We would like to separate the two colors, and clearly there is no way this can be done in a single dimension (a single dimensional decision boundary would be a point, separating the axis into two regions).
To fix this problem, we can add additional (potentially nonlinear) features to construct a decision boundary from. Consider the same dataset with the addition of x^2 as a feature:

<img src="../images/separablequadratic.png" />

With this additional piece of information, we are now able to construct a linear separator in the two dimensional space containing the points. In this case, we were able to fix the problem by mapping our data to a higher dimensional space by manually adding useful features to data points. However, in many high-dimensional problems, such as image classification, manually selecting features that are useful is a tedious problem. This requires domain-specific effort and expertise, and works against the goal of generalization across tasks. A natural desire is to learn these featurization or transformation functions as well, perhaps using a nonlinear function class that is capable of representing a wider variety of functions.

(the part above taken from this note: https://inst.eecs.berkeley.edu/~cs188/fa23/assets/notes/cs188-fa23-note24.pdf)

Now we will learn how to build a simple fully connected neural network, also known as a multi-layer perceptron. This model is used as a component in a vast variety of machine learning models. Some examples are the Transformer model, as well as image generation models (such as Dall-E). That means it is used to create stuff like this:

<img src="../images/banana.png" width=700 />

and this:

<img src="../images/gpt.jpg" width=700 />

<h2> Fully Connected Neural Network </h2>

Let's first explore the data we are working with. This is a dataset gathered from Kaggle that contains API call sequences that can potentially be used to classify patterns of online activity as normal or malware. For more information, see https://www.kaggle.com/datasets/ang3loliveira/malware-analysis-datasets-api-call-sequences. 

<br/>

We can use pandas to load the dataset and visualize it. It is now up to you how you wish to explore it! 

In [None]:
malware_data_path = Path.cwd().parent / "data" / "malware_detection" / "dynamic_api_call_sequence_per_malware_100_0_306.csv"
malware_df = pd.read_csv(malware_data_path).set_index("hash")
malware_df

In [None]:
# Explore away!

If we had a lot of time and energy, and liked suboptimal solutions, we could spend a lot of time analysing the data to look for features that can be used for a classic machine learning model. Instead, let us build a neural network that learns the features for us.

<h4> Preparing the data for our neural network: the Pytorch Dataset class </h4>

It is time to get the data on a format that our network can work with. We will use the PyTorch Dataset class, and define our own transformations to preprocess the input. Let's get started.

In [None]:
class MalwareDataset(Dataset):

    """
    This class inherits from the Pytorch Dataset class, so that it can be used with other convenient Pytorch 
    tools that we will shortly look at. This means that we need to define the 
    3 methods __init__, __len__, and __getitem___ that you can see below.
    """

    def __init__(
            self, 
            malware_df: pd.DataFrame,
            preprocess=True,
            to_torch=True
            ) -> None:
        """
        Initialize all the variables you need to return samples from the dataset.
        """
        super().__init__()

        self.malware_df = malware_df
        self.preprocess = preprocess
        self.to_torch = to_torch

        if self.preprocess:
            self.malware_df = MalwareDataset.preprocess_df(self.malware_df)
        

    def __len__(self):
        """
        Return the size of the dataset, that means the total number of samples.
        """

         ### Your code here

        pass

    def __getitem__(self, index) -> dict:
        """
        Return the sample at the specified index. Pytorch expects the output to be a dictionary.
        """

        ### Your code here


        # Remember to convert both the sequence vector and label to tensors if self.to_torch==True
        call_seq_tensor = None
        label = None

        sample = {
            "call_seq": call_seq_tensor,
            "label": label
        }

        return sample

    @staticmethod
    def preprocess_df(df):
        """
        Return the preprocessed dataframe. You can preprocess it any way you want,
        but we recommend you look into normalizing and mean-centering (aka standardizing) the data
        (https://www.geeksforgeeks.org/how-to-standardize-data-in-a-pandas-dataframe/).
        """

         ### Your code here
        
        pass

<h4> Preparing the data for batched processing: the Pytorch Dataloader class </h4>

One of the primary advantages of using a library like Pytorch is the built-in support for parallelizability. Training deep  models is considerably sped up by processing many examples simultaneously, usually on a GPU. In order to achieve this, Pytorch processes data in batches. A batch is simply a collection of data samples put together. Instead of running an example at a time through the model, and instead of optimising the model with one example at a time, we do this in batches. The DataLoader class converts the Dataset into an iterable object of batches:

In [None]:
batch_size = 32

malware_dataset = MalwareDataset(malware_df=malware_df, preprocess=True, to_torch=True)
train_loader = DataLoader(malware_dataset, batch_size=batch_size)
batch = next(iter(train_loader))['call_seq']

print(f"Sample shape (before the dataloader) = {malware_dataset[0]['call_seq'].shape}")
print(f"Batch shape (after the dataloader) = {batch.shape}")

<h4> Understanding Pytorch layers </h4>

Now that we have batched the input, we can explore the fundamental building blocks of deep neural networks, and how they change the data we pass in. A neural net is essentially a complicated function that consists of many, many nested functions. One such function is known as a layer. We build a neural net by passing the input through a layer, and then passing the output from this layer to the next one, and this process goes on until we have a final output. Let's explore some layers.

Let's first look at the raw logits of the first sample in the batch:

In [None]:
print(batch[0])

<h4> 1. Linear layer </h4>

This is just a linear transformation of the input. That means the input is transformed by this formula:

<img src="../images/linear.png" />

In [None]:
# Code it up yourself! Here is a link to the documentation: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
linear_layer = ... 
out = ...

print(f"Input shape = {batch.shape}")
print(f"Output shape = {out.shape}")

<h4> 2. ReLU layer </h4>

This is the famous ReLU activation function. If the input is positive, the output is positive, else the output is 0.

<img src="../images/relu.png" />


In [None]:
# Code it up yourself! Here is a link to the documentation: https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html
relu_layer = ... 
out = ...

print(f"Input shape = {batch.shape}")
print(f"Output shape = {out.shape}")

Use the next cell to print the first sample in the batch before passing it into the ReLU layer, and after passing it into the ReLU layer. What do you see?

In [None]:
# your code here

<h4> 2. Sigmoid layer </h4>

The sigmoid function is used to squeeze the input into the range [0, 1]. This is convenient when we want to turn the input into probability estimates.

<img src="../images/sigmoid.png" />

In [None]:
# Code it up yourself! Here is a link to the documentation: https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html
sigmoid_layer = ... 
out = ...

print(f"Input shape = {batch.shape}")
print(f"Output shape = {out.shape}")

Use the next cell to print the first sample in the batch before passing it into the sigmoid layer, and after passing it into the sigmoid layer. What do you see?

In [None]:
# your code here

<h4> Building the model </h4>

We now know all we need to create a Fully Connected Neural Network (also known as a Multilayer Perceptron). We recommend you try 3-4 hidden layers, where each hidden layer is composed of a Linear and ReLU layer. The Sigmoid layer can be used to compute the final output. Here is a link for inspiration: https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html. 

 Let's code it!

In [None]:
class MLP(nn.Module):

    def __init__(self, in_features: int, out_features: int):
        super(MLP, self).__init__()

        # Initialize the network
        # Input: a batched sample, output: the predicted label (0 for no malware, 1 for malware)
        #                                  (there should be one prediction per sample in thebatch)

        # The network should be composed of Linear, ReLU, and Sigmoid layers
        # We suggest looking into the Sequential module (https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html):
        self.network = ...

    def forward(self, x):

        # Your code here
        
        pass

Let's test if the model behaves as expected:

In [None]:
in_features = ...
out_features = ...

model = MLP(in_features, out_features)
out = model(batch)

print(f"Output shape = {out.shape}")
print(f"The predicted probability of the first being malware = {out[0].item()}")

<h2> Training and evaluating the model </h2>

We now have a model. What's left is to train it, and check how well it performs.

Training a neural network consists of the following steps:

- Select your hyperparameters. These are parameters that dictate how to train the model, separate from the model parameters that we optimise. The parameters we have included below are the most important ones:

    1.  Number of epochs: An epoch is one iteration through the entire training dataset. That means that the model runs through every example from the dataset once, and optimises the parameters based on each example once. One time through the data is usually not enough, and the model needs to see the same examples multiple times to learn all the necessary patterns.

    2. Learning rate: This is the gradient descent step size, i.e, how much you wish to change the parameters each time the model learns something new. If you set the value too big, it may not converge, as the parameters can "jump over" the minimum. If you set it too small, convergence may happen too slowly. You can look into learning rate schedulers that adapts the learning rate to the loss landscape (but it's probably not necessary for this notebook).

    3. Batch size: The size of the batches into which the DataLoader will divide the input (see the DataLoader discussion above). That means the number of examples that the the network runs through in parallel. It also means the number of examples the model uses to calculate the gradient. The loss will be computed for all examples in the batch (usually summing the loss for the individual samples), and the network updates its parameters based on this, using the gradient, one batch at a time. 

- Iterate through all of the train data once for each epoch, and validate the model once per epoch. That means the outer loop should be iterating through the epochs.

- The train loop. The train loop happens inside each epoch. This is where you iterate through all the train data, compute the loss, and run backpropagation. This is what needs to happen:

    1. Run the input through the model, and compute the output. This is the forward pass.

    2. Calculate the loss, and optimise the parameters with gradient descent and backpropagation.

- The validation loop. This happens at the end of an epoch, and is where you evaluate what the model learnt during that epoch. You have set aside a separate dataset to evaluate this. Here, you don't optimise the parameters of the model. Instead, you compute metrics that quantify the performance of the model. For this example, accuracy score (the number of correct predictions), precision, recall, and f1 are good metrics (you should do some Googling if you haven't seen them before). 


This sounds like a lot, but PyTorch does the heavy lifting. Here is some documentation: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html.

In [None]:
def train_and_evaluate(
        train_data,
        validation_data,
        num_epochs=20,
        lr=1e-03,
        batch_size=64,
):
    
    train_loader = DataLoader(train_data, batch_size=batch_size)
    validation_loader = DataLoader(validation_data, batch_size=batch_size)

    # Your code here ...

    optimizer = ...
    loss_fn = ...

    for epoch in num_epochs:
        print(f"RUNNING EPOCH {epoch}/{num_epochs}")

        # Insert your training and validation code here

<h4> Done already? Try to improve the model! </h4>