# Title

**AI and Cybersecurity**

**Spring 2024, University of New Haven**

**Instructor:** Dr. Behzadan

**Student:** Arshad Badfar

# Overview

The objective of the midterm project was to employ the MalConv Architecture with the Ember dataset via Amazon SageMaker. The task involved categorizing Portable Executable (PE) files as either benign or malicious.

In this project, we utilized the MalConv Architecture, leveraging the Ember dataset, to classify PE files into two categories: benign or malicious. This process was facilitated through Amazon SageMaker



![image.png](attachment:dac2d905-c7ab-46dd-b93b-4de4bda26cd2.png)

_Fig 1: MalConv Architecture_







# Task 1

## Dataset

Before feeding the data into a deep neural network, it's essential to preprocess and vectorize the Ember dataset. Understanding the dataset and the associated paper is crucial for this task.


## Limitations on SDK 

The data needs preprocessing and vectorization before it can be inputted into a deep neural network. It was important to consider constraints such as RAM, storage limitations, and compatibility issues with libraries when exploring platforms like Google Colab and SageMaker.

Options for handling these constraints include purchasing Colab Pro and then  migrating the project to SageMaker. Each option has its advantages and limitations, which should be carefully evaluated based on project requirements and available resources.

## Preprocess Data

Before standardizing or normalizing the data, it's necessary to eliminate data with a label of -1. This step ensures that only relevant data is considered for further processing.

Preprocessing involves standardizing or normalizing the data to facilitate neural network convergence, or at least to expedite the process. To achieve this, we opted for the StandardScaler from the sklearn library. Due to the large size of the data, it wasn't feasible to fit the scaler instance to the entire dataset at once. Instead, we utilized partial_fit on partitions of the training data.

## Model Architecture

The MalConv architecture is designed to analyze and classify Portable Executable (PE) files as either benign or malicious. Below is the architecture description:

- **Embedding Layer:**

Input Size: (Size of the vocabulary)

Output Size: 8

Purpose: Embeds the input tokens into a lower-dimensional space.


- **conv1:**
Input Channels: 8

Output Channels: 128

Purpose: Extracts features from the embedded input.


- **conv2:**

Same configuration as conv1.

Purpose: Another convolutional layer for feature extraction



- **Sigmoid activation function:**

Purpose: Controls the flow of information between the convolutional layers.

- **Global Max Pooling Layer:** 

Purpose: Aggregates the maximum value from each feature map.

- **fc1:**

Input Features: 128

Output Features: 128

Purpose: Transforms the features extracted by convolutional layers.

-**fc2:**

Input Features: 128

Output Features: 1

Purpose: Produces the final output indicating the probability of a file being malicious.

-**Sigmoid Activation:**


Purpose: Generates the final output probability between 0 and 1, indicating the likelihood of the input file being malicious.

This architecture is specifically tailored for the task of PE file classification, utilizing convolutional layers to capture local patterns within the files and fully connected layers for further feature transformation and classification.








## Training 

During the training process, multiple model checkpoints were generated with varying settings. It was recommended to adjust the batch size according to available memory to optimize training performance.

The final model checkpoint we have is trained for 20 epochs. Below are the results obtained from this training session

- Epochs: 20
- Batch Size: 128-4000 Adjusted according to memory constraints 
- Training Loss: 0.000212
- Validation Loss: 0.68

![image.png](attachment:742f2e6b-f729-46c6-8f10-d6ee6c860b2a.png)


## Test

**Test Accuracy: 0.6817**

**Precision: 0.8102**

**Recall: 0.4747**

# Result


During the training process, it was observed that the model started to overgeneralize after epoch 3, as indicated by an increase in validation loss. This phenomenon occurred consistently across different methods of scaling and sample sizes. However, the decision was made to retain this checkpoint as the final one since it was trained with the most data available.

The overgeneralization may be attributed to the functioning of the embedding layer and the feature engineering process. It is likely influenced by a partial understanding of the Ember dataset and the nuances of the embedding layer.

Despite encountering challenges during training, the test accuracy was deemed satisfactory. To achieve this result, the scale of the test dataset was adjusted by standardizing it.



In the deployment phase, significant time was invested, but unfortunately, it was ultimately unsuccessful. This could potentially be attributed to limitations imposed by the student account (AWS free tier), which restricts access to certain credentials necessary for deployment. Additionally, the lack of success may also stem from limitations in knowledge and expertise.

The deployment failure highlights the importance of considering account restrictions and technical limitations when attempting to deploy models in real-world scenarios.


