**Revised on 3/5/2024: Changed source files**

This is the skeleton code for Task 1 of the midterm project. The files that are downloaded in step 4 are based on the [Ember 2018 dataset](https://arxiv.org/abs/1804.04637), and contain the features (and corresponding labels) extracted from 1 million PE files, split into 80\% training and 20\% test datasets. The code used for for feature extraction is available [here](https://colab.research.google.com/drive/16q9bOlCmnTquPtVXVzxUj4ZY1ORp10R2?usp=sharing). However, the preprocessing and featurization process may take up to 3 hours on Google Colab. Hence, I recommend using the processed datasets (Step 4) to speed up your development.

Also, note that there is a new optional step 8.5 - To speed up your experiments, you may want to sample the original dataset of 800k training samples and 200k test samples to smaller datasets.

**Step 1:** Mount your Google Drive by clicking on "Mount Drive" in the Files section (panel to the left of this text.)

**Step 2:** Go to Runtime -> Change runtime type and select T4 GPU.

**Step 3:** Create a folder in your Google Drive, and rename it to "vMalConv"

**Step 4:** Download the pre-processed training and test datasets.

In [None]:
# ~8GB
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_train.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_test.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/y_train.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/y_test.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/metadata.csv

--2024-03-12 18:53:00--  https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_train.dat
Resolving dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)... 52.216.54.9, 3.5.28.203, 52.217.165.89, ...
Connecting to dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)|52.216.54.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7619200000 (7.1G) [binary/octet-stream]
Saving to: ‘X_train.dat’


2024-03-12 18:55:28 (49.4 MB/s) - ‘X_train.dat’ saved [7619200000/7619200000]

--2024-03-12 18:55:28--  https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_test.dat
Resolving dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)... 3.5.3.100, 54.231.138.185, 54.231.236.193, ...
Connecting to dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)|3.5.3.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1904800000 (1.8G) [binary/octet-stream]
Saving to: ‘X_test.dat’




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Step 5:** Copy the downloaded files to vMalConv

In [None]:
!cp /content/X_train.dat /content/drive/MyDrive/TModel/X_train.dat
!cp /content/X_test.dat /content/drive/MyDrive/TModel/X_test.dat
!cp /content/y_train.dat /content/drive/MyDrive/TModel/y_train.dat
!cp /content/y_test.dat /content/drive/MyDrive/TModel/y_test.dat
!cp /content/metadata.csv /content/drive/MyDrive/TModel/metadata.csv

**Step 6:** Download and install Ember:

In [None]:
!pip install git+https://github.com/PFGimenez/ember.git

Collecting git+https://github.com/PFGimenez/ember.git
  Cloning https://github.com/PFGimenez/ember.git to /tmp/pip-req-build-9m14wrgd
  Running command git clone --filter=blob:none --quiet https://github.com/PFGimenez/ember.git /tmp/pip-req-build-9m14wrgd
  Resolved https://github.com/PFGimenez/ember.git to commit 3b82fe63069884882e743af725d29cc2a67859f1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ember
  Building wheel for ember (setup.py) ... [?25l[?25hdone
  Created wheel for ember: filename=ember-0.1.0-py3-none-any.whl size=13050 sha256=5bc8861a8ee6b34eb556b94340ec3853d926d8743ee3d3921e66c3adf1f0322d
  Stored in directory: /tmp/pip-ephem-wheel-cache-8oi_lkre/wheels/8f/69/f9/1917c8df03b25fe53e8e2f6cb2c9f61a43dec179b19b10ab9f
Successfully built ember
Installing collected packages: ember
Successfully installed ember-0.1.0


In [None]:
!pip install lief

Collecting lief
  Downloading lief-0.14.1-cp310-cp310-manylinux_2_28_x86_64.manylinux_2_27_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lief
Successfully installed lief-0.14.1


**Step 7:** Read vectorized features from the data files.

In [13]:
import ember
X_train, y_train, X_test, y_test = ember.read_vectorized_features("drive/MyDrive/TModel/")
metadata_dataframe = ember.read_metadata("drive/MyDrive/TModel/")



**Step 8:** Get rid of rows with no labels.

In [14]:
labelrows = (y_train != -1)
X_train = X_train[labelrows]
y_train = y_train[labelrows]

In [15]:
import h5py
h5f = h5py.File('X_train.h5', 'w')
h5f.create_dataset('X_train', data=X_train)
h5f.close()
h5f = h5py.File('y_train.h5', 'w')
h5f.create_dataset('y_train', data=y_train)
h5f.close()

In [None]:
!cp /content/X_train.h5 /content/drive/MyDrive/TModel/X_train.h5
!cp /content/y_train.h5 /content/drive/MyDrive/TModel/y_train.h5

In [None]:
!cp /content/y_train.h5 /content/drive/MyDrive/TModel/sampled_y_train.h5
!cp /content/X_train.h5 /content/drive/MyDrive/TModel/sampled_x_train.h5

**Optional Step 8.5:** To speed up your experiments, you may want to sample the original dataset of 800k training samples and 200k test samples to smaller datasets. You can use the [Pandas Dataframe sample() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html), or come up with your own sampling methodology. Be mindful of the fact that the database is heavily imbalanced.

In [16]:
import pandas as pd
import numpy as np
import h5py

# Converting numpy arrays to Pandas DataFrames
X_train_df = pd.DataFrame(X_train)
y_train_df = pd.DataFrame(y_train)
X_test_df = pd.DataFrame(X_test)
y_test_df = pd.DataFrame(y_test)

# Sampling only a portion of the original dataset to create a smaller dataset
sample_size = 75000

# Sampling the training dataset
sampled_X_train = X_train_df.sample(n=sample_size, random_state=1)
sampled_y_train = y_train_df.loc[X_train_df.index.isin(sampled_X_train.index)]

# Sampling the test dataset
sampled_X_test = X_test_df.sample(n=sample_size, random_state=1)
sampled_y_test = y_test_df.loc[X_test_df.index.isin(sampled_X_test.index)]

# Saving the sampled datasets to h5 files
h5f = h5py.File('sampled_X_train.h5', 'w')
h5f.create_dataset('sampled_X_train', data=sampled_X_train)
h5f.close()

h5f = h5py.File('sampled_y_train.h5', 'w')
h5f.create_dataset('sampled_y_train', data=sampled_y_train)
h5f.close()

h5f = h5py.File('sampled_X_test.h5', 'w')
h5f.create_dataset('sampled_X_test', data=sampled_X_test)
h5f.close()

h5f = h5py.File('sampled_y_test.h5', 'w')
h5f.create_dataset('sampled_y_test', data=sampled_y_test)
h5f.close()

In [30]:
!cp /content/sampled_X_test.h5 /content/drive/MyDrive/TModel/sampled_X_test.h5
!cp /content/sampled_X_train.h5 /content/drive/MyDrive/TModel/sampled_X_train.h5
!cp /content/sampled_y_train.h5 /content/drive/MyDrive/TModel/sampled_y_train.h5
!cp /content/sampled_y_test.h5 /content/drive/MyDrive/TModel/sampled_y_test.h5

> **Task 1:** Complete the following code to build the architecture of MalConv in PyTorch:

In [6]:
import torch
import torch.nn as nn
import h5py

# Loading the sampled training data from Colab environment
h5f = h5py.File('/content/drive/MyDrive/TModel/sampled_X_train.h5', 'r')
sampled_X_train_tensor = torch.tensor(h5f['sampled_X_train'][:], dtype=torch.long)
h5f.close()

# Define the MalConv model class
class MalConv(nn.Module):
    def __init__(self, input_length=2000000, embedding_dim=8, window_size=500, output_dim=1):
        super(MalConv, self).__init__()
        self.embed = nn.Embedding(256, embedding_dim, padding_idx=0)  # Embedding layer for 256 unique bytes with specified dimension
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=128, kernel_size=32, stride=32)
        self.conv2 = nn.Conv1d(in_channels=128, out_channels=128, kernel_size=32, stride=32)
        self.fc1 = nn.Linear(128, 128)
        self.fc2 = nn.Linear(128, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.embed(x.clamp(min=0, max=255))  # Clamping indices to ensure they're within the valid range before embedding
        x = x.transpose(1, 2)  # Adjusting dimensions for Conv1d
        x = self.conv1(x)
        x = torch.relu(x)
        x = self.conv2(x)
        x = torch.relu(x)
        x = torch.squeeze(torch.max(x, dim=2)[0])  # Global max pooling
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

# Example input using the sampled training data
input_length = 2000000  # Fixed length for each input file
batch_size = 4
example_input = sampled_X_train_tensor[:batch_size, :input_length]  # Using the first 4 examples from sampled_X_train

# Creating and initializing the MalConv model
model = MalConv(input_length=input_length, embedding_dim=8, window_size=500, output_dim=1)

# Using the MalConv model for predictions
output = model(example_input)
print(output)  # Displaying the output probabilities for each example


tensor([[0.4864],
        [0.4845],
        [0.4859],
        [0.4843]], grad_fn=<SigmoidBackward0>)


**Step 8:** Partial fit the standardScaler to avoid overloading the memory:

In [9]:
import h5py
from sklearn.preprocessing import StandardScaler

# Load sampled_X_train
drive_path = '/content/drive/MyDrive/TModel/sampled_X_train.h5'
with h5py.File(drive_path, 'r') as hf:
    sampled_X_train = hf['sampled_X_train'][:]


mms = StandardScaler()
for x in range(0, len(sampled_X_train), 100000):
    mms.partial_fit(sampled_X_train[x:x+100000])

In [33]:
sampled_X_train = mms.transform(sampled_X_train)

In [8]:
import h5py
from sklearn.preprocessing import StandardScaler

# Load sampled_X_train
drive_path = '/content/drive/MyDrive/TModel/sampled_X_train.h5'
with h5py.File(drive_path, 'r') as hf:
    sampled_X_train = hf['sampled_X_train'][:]


mms = StandardScaler()
for x in range(0, len(sampled_X_train), 100000):
    mms.partial_fit(sampled_X_train[x:x+100000])

**Load, Tensorize, and Split** The following code takes care of converting the training data into Torch Tensors, and then splits it into 80% training and 20% validation datasets.

In [1]:
import h5py
import numpy as np
import torch
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Read sampled_y_train
drive_path_sampled_y_train = '/content/drive/MyDrive/TModel/sampled_y_train.h5'
with h5py.File(drive_path_sampled_y_train, 'r') as hf:
    sampled_y_train = hf['sampled_y_train'][:]

# Load sampled_X_train
drive_path = '/content/drive/MyDrive/TModel/sampled_X_train.h5'
with h5py.File(drive_path, 'r') as hf:
    sampled_X_train = hf['sampled_X_train'][:]

# Reshape the data if necessary
# Assuming the data is 2D, adjust the reshape accordingly
sampled_X_train = sampled_X_train.reshape(-1, 1)

# Scale the input data
scaler = StandardScaler()
sampled_X_train_scaled = scaler.fit_transform(sampled_X_train)

# Convert your numpy arrays to PyTorch tensors
sampled_X_train_tensor = torch.tensor(sampled_X_train_scaled, dtype=torch.float32)
sampled_y_train_tensor = torch.tensor(sampled_y_train, dtype=torch.float32)

# Split the data into training and validation sets (80% training, 20% validation)
sampled_X_train_tensor, sampled_y_train_tensor = sampled_X_train_tensor[:len(sampled_y_train_tensor)], sampled_y_train_tensor
sampled_X_train_split, sampled_X_val_split, sampled_y_train_split, sampled_y_val_split = train_test_split(
    sampled_X_train_tensor, sampled_y_train_tensor, test_size=0.2, random_state=42
)

# Create TensorDatasets and DataLoaders for training and validation sets
sampled_train_dataset = TensorDataset(sampled_X_train_split, sampled_y_train_split)
sampled_val_dataset = TensorDataset(sampled_X_val_split, sampled_y_val_split)

batch_size = 16
sampled_train_loader = DataLoader(sampled_train_dataset, batch_size=batch_size, shuffle=True)
sampled_val_loader = DataLoader(sampled_val_dataset, batch_size=batch_size, shuffle=False)


> **Task 2:** Complete the following code to train the model on the GPU for 15 epochs, with a batch size of 64. If you are on Google Colab, don't forget to change the kernel in Runtime -> Change runtime type -> T4 GPU.

In [17]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'


In [22]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import os

# Assuming MalConv class definition is already provided as above

# Initialize the MalConv model
model = MalConv()

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss for binary classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adjust learning rate as needed

# Directory to save model checkpoints
save_dir = "/content"

# Training Loop with Validation
num_epochs = 15  # Adjust the number of epochs as needed

for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0

    for inputs, labels in sampled_train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs.squeeze(), labels.squeeze())

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch+1}, Training Loss: {running_loss/len(sampled_train_loader)}')

    # Validation step
    model.eval()  # Set model to evaluation mode
    val_loss = 0.0
    with torch.no_grad():
        for inputs, labels in sampled_val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs.squeeze(), labels.squeeze())
            val_loss += loss.item()
    print(f'Validation Loss: {val_loss/len(sampled_val_loader)}')

    # Save checkpoint every 5 epochs
    if (epoch + 1) % 5 == 0:
        checkpoint_path = os.path.join(save_dir, f'model_epoch_{epoch+1}.pt')
        torch.save(model.state_dict(), checkpoint_path)
        print(f'Model checkpoint saved to {checkpoint_path}')


RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

In [None]:
!cp /content/drive/MyDrive/vMalConv/model_epoch_5.pt /content/drive/MyDrive/vMalConv1/model_epoch_5.pt
!cp /content/drive/MyDrive/vMalConv/model_epoch_10.pt /content/drive/MyDrive/vMalConv1/model_epoch_10.pt
!cp /content/drive/MyDrive/vMalConv/model_epoch_15.pt /content/drive/MyDrive/vMalConv1/model_epoch_15.pt


**Task 3:** Complete the following code to evaluate your trained model on the test data.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Reading sampled_y_test
drive_path_sampled_y_test = '/content/drive/MyDrive/vMalConv1/sampled_y_test.h5'
with h5py.File(drive_path_sampled_y_test, 'r') as hf:
    sampled_y_test = hf['sampled_y_test'][:]

# Loading sampled_X_test
drive_path_sampled_X_test = '/content/drive/MyDrive/vMalConv1/sampled_X_test.h5'
with h5py.File(drive_path_sampled_X_test, 'r') as hf:
    sampled_X_test = hf['sampled_X_test'][:]

# Converting test data to PyTorch tensors
sampled_X_test_tensor = torch.tensor(sampled_X_test, dtype=torch.long)
sampled_y_test_tensor = torch.tensor(sampled_y_test, dtype=torch.float32)

# Creating a TensorDataset and DataLoader for test data
sampled_test_dataset = TensorDataset(sampled_X_test_tensor, sampled_y_test_tensor)
sampled_test_loader = DataLoader(sampled_test_dataset, batch_size=batch_size, shuffle=False)

# Ensuring the model is in evaluation mode
model.eval()

# Lists to store model predictions and actual labels
predictions = []
labels = []

with torch.no_grad():
    for inputs, labels_batch in sampled_test_loader:
        inputs, labels_batch = inputs.to(device), labels_batch.to(device)
        outputs = model(inputs)
        predicted = torch.round(outputs)  # Rounding to 0 or 1
        predictions.extend(predicted.cpu().numpy())
        labels.extend(labels_batch.cpu().numpy())

# Computing metrics
accuracy = accuracy_score(labels, predictions)
precision = precision_score(labels, predictions)
recall = recall_score(labels, predictions)

print(f'Test Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')


Test Accuracy: 0.5077
Precision: 0.5068
Recall: 0.5950


**Task 4:** Comment on the results in this text box.


The overall evaluation of the model's performance might not be comprehensive due to the utilization of a small sample size of 75,000 instances. This size may not adequately reflect how effectively the model performs across all classes, particularly in datasets with imbalanced class distributions.

Precision and Recall: These metrics serve as complementary measures. High precision signifies a low false positive rate, indicating that when the model predicts a class, it's likely to be correct. On the other hand, high recall implies a low false negative rate, suggesting that the model can effectively capture instances of the positive class. However, it's important to note that a high recall might result in a higher number of false positives, potentially leading to a situation where benign instances are incorrectly classified as malware.

Training Loss: This metric represents the error incurred during the training phase of the model. It indicates how well the model is fitting the training data. Lower training loss values generally imply that the model is effectively learning the patterns in the training data.

Validation Loss: This metric measures the error on a separate validation dataset that the model hasn't seen during training. It helps assess how well the model generalizes to unseen data. A low validation loss indicates that the model is performing well not only on the training data but also on new, unseen instances.

Combining Precision, Recall, Training Loss, and Validation Loss provides a comprehensive understanding of the model's performance, considering both its predictive accuracy and generalization capabilities.