**Revised on 3/5/2024: Changed source files**

This is the skeleton code for Task 1 of the midterm project. The files that are downloaded in step 4 are based on the [Ember 2018 dataset](https://arxiv.org/abs/1804.04637), and contain the features (and corresponding labels) extracted from 1 million PE files, split into 80\% training and 20\% test datasets. The code used for for feature extraction is available [here](https://colab.research.google.com/drive/16q9bOlCmnTquPtVXVzxUj4ZY1ORp10R2?usp=sharing). However, the preprocessing and featurization process may take up to 3 hours on Google Colab. Hence, I recommend using the processed datasets (Step 4) to speed up your development.

Also, note that there is a new optional step 8.5 - To speed up your experiments, you may want to sample the original dataset of 800k training samples and 200k test samples to smaller datasets.

**Step 1:** Mount your Google Drive by clicking on "Mount Drive" in the Files section (panel to the left of this text.)

**Step 2:** Go to Runtime -> Change runtime type and select T4 GPU.

**Step 3:** Create a folder in your Google Drive, and rename it to "vMalConv"

**Step 4:** Download the pre-processed training and test datasets.

In [None]:
# ~8GB
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_train.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_test.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/y_train.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/y_test.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/metadata.csv

--2024-03-06 05:14:32--  https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_train.dat
Resolving dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)... 52.217.202.129, 3.5.8.176, 3.5.29.195, ...
Connecting to dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)|52.217.202.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7619200000 (7.1G) [binary/octet-stream]
Saving to: ‘X_train.dat’


2024-03-06 05:16:56 (50.8 MB/s) - ‘X_train.dat’ saved [7619200000/7619200000]

--2024-03-06 05:16:56--  https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_test.dat
Resolving dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)... 52.216.217.145, 52.216.78.196, 52.216.152.172, ...
Connecting to dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)|52.216.217.145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1904800000 (1.8G) [binary/octet-stream]
Saving to: ‘X_t

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Step 5:** Copy the downloaded files to vMalConv

In [None]:
!cp /content/X_train.dat /content/drive/MyDrive/vMalConv/X_train.dat
!cp /content/X_test.dat /content/drive/MyDrive/vMalConv/X_test.dat
!cp /content/y_train.dat /content/drive/MyDrive/vMalConv/y_train.dat
!cp /content/y_test.dat /content/drive/MyDrive/vMalConv/y_test.dat
!cp /content/metadata.csv /content/drive/MyDrive/vMalConv/metadata.csv

**Step 6:** Download and install Ember:

In [None]:
!pip install git+https://github.com/PFGimenez/ember.git

Collecting git+https://github.com/PFGimenez/ember.git
  Cloning https://github.com/PFGimenez/ember.git to /tmp/pip-req-build-q5scnzym
  Running command git clone --filter=blob:none --quiet https://github.com/PFGimenez/ember.git /tmp/pip-req-build-q5scnzym
  Resolved https://github.com/PFGimenez/ember.git to commit 3b82fe63069884882e743af725d29cc2a67859f1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ember
  Building wheel for ember (setup.py) ... [?25l[?25hdone
  Created wheel for ember: filename=ember-0.1.0-py3-none-any.whl size=13050 sha256=ad23fa927f31c132d3175502a3cde261709e3657d6cc19864db88d90928125fa
  Stored in directory: /tmp/pip-ephem-wheel-cache-w0kfar2d/wheels/8f/69/f9/1917c8df03b25fe53e8e2f6cb2c9f61a43dec179b19b10ab9f
Successfully built ember
Installing collected packages: ember
Successfully installed ember-0.1.0


In [None]:
!pip install lief

Collecting lief
  Downloading lief-0.14.1-cp310-cp310-manylinux_2_28_x86_64.manylinux_2_27_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lief
Successfully installed lief-0.14.1


**Step 7:** Read vectorized features from the data files.

In [None]:
import ember
X_train, y_train, X_test, y_test = ember.read_vectorized_features("drive/MyDrive/vMalConv/")
metadata_dataframe = ember.read_metadata("drive/MyDrive/vMalConv/")



**Step 8:** Get rid of rows with no labels.

In [None]:
# labelrows = (y_train != -1)
# X_train = X_train[labelrows]
# y_train = y_train[labelrows]

array([-1.,  0.,  1.], dtype=float32)

In [None]:
import h5py
h5f = h5py.File('X_train.h5', 'w')
h5f.create_dataset('X_train', data=X_train)
h5f.close()
h5f = h5py.File('y_train.h5', 'w')
h5f.create_dataset('y_train', data=y_train)
h5f.close()

KeyboardInterrupt: 

In [None]:
!cp /content/X_train.h5 /content/drive/MyDrive/vMalConv/X_train.h5
!cp /content/y_train.h5 /content/drive/MyDrive/vMalConv/y_train.h5

**Optional Step 8.5:** To speed up your experiments, you may want to sample the original dataset of 800k training samples and 200k test samples to smaller datasets. You can use the [Pandas Dataframe sample() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html), or come up with your own sampling methodology. Be mindful of the fact that the database is heavily imbalanced.

In [None]:
### Your code (optional) for sampling the original dataset.
import numpy as np

samples_per_class = 100000

unique_labels = np.unique(y_train[y_train!=-1])

selected_indices = []

for label in unique_labels:
    label_indices = np.where(y_train == label)[0]
    random_indices = np.random.choice(label_indices, samples_per_class, replace=True)
    selected_indices.extend(random_indices.tolist())

X_train = X_train[selected_indices]
y_train = y_train[selected_indices]


print(X_train.shape)
print(y_train.shape)

torch.Size([200000, 2381])
torch.Size([200000, 1])


> **Task 1:** Complete the following code to build the architecture of MalConv in PyTorch:

In [None]:
import torch
import torch.nn as nn

class MalConv(nn.Module):
    def __init__(self, input_length=2000000, window_size=5, output_dim=1):
        super(MalConv, self).__init__()
        self.window_size = window_size
        self.input_length = input_length
        self.flatten = nn.Flatten()

        self.conv1 = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=window_size, stride=window_size, bias=True)
        self.conv2 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=window_size, stride=window_size, bias=True)
        self.conv3 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=window_size, stride=window_size, bias=True)

        # Calculate the output size after the convolutional layers
        conv_output_length = 1 +((input_length - window_size) // window_size)
        conv_output_length = 1 + ((conv_output_length - window_size) // window_size)
        conv_output_length = 1 + ((conv_output_length - window_size) // window_size)
        conv_output_length *= 128

        self.fc1 = nn.Linear(conv_output_length, 128)
        self.fc2 = nn.Linear(128, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = x.unsqueeze(1)  # Add a channel dimension
        x = self.conv1(x)
        x = torch.relu(x)
        x = self.conv2(x)
        x = torch.relu(x)
        x = self.conv3(x)
        x = torch.relu(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x


input_size = 2381

# Create a random input tensor with the specified input size
random_input = torch.rand(10, input_size)

# Instantiate the MalConv model
model = MalConv(input_length=input_size)
print(model)

# Pass the random input through the model
output = model(random_input)

print("Input Size:", random_input.size())
print("Output Size:", output.size())
print(output)

MalConv(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (conv1): Conv1d(1, 32, kernel_size=(5,), stride=(5,))
  (conv2): Conv1d(32, 64, kernel_size=(5,), stride=(5,))
  (conv3): Conv1d(64, 128, kernel_size=(5,), stride=(5,))
  (fc1): Linear(in_features=2432, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)
Input Size: torch.Size([10, 2381])
Output Size: torch.Size([10, 1])
tensor([[0.5007],
        [0.5011],
        [0.5001],
        [0.5009],
        [0.5005],
        [0.5001],
        [0.5005],
        [0.5012],
        [0.5009],
        [0.5004]], grad_fn=<SigmoidBackward0>)


**Step 8:** Partial fit the standardScaler to avoid overloading the memory:

In [None]:
from sklearn.preprocessing import StandardScaler
mms = StandardScaler()
for x in range(0,150000,1000):
  mms.partial_fit(X_train[x:x+1000])

In [None]:
X_train = mms.transform(X_train)

In [None]:
## Reshape to create 3 channels ##
import numpy as np
X_train = np.reshape(X_train,(-1,2381))
y_train = np.reshape(y_train,(-1,1))

In [None]:
y_train.shape

torch.Size([200000, 1])

In [None]:
X_train.shape

(200000, 2381)

**Load, Tensorize, and Split** The following code takes care of converting the training data into Torch Tensors, and then splits it into 80% training and 20% validation datasets.

In [None]:
import numpy as np
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

# Assuming MalConv class definition is already provided as above

# Convert your numpy arrays to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)

# Split the data into training and validation sets (80% training, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

# Create TensorDatasets and DataLoaders for training and validation sets
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

batch_size = 2048  # Adjust based on your GPU memory
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

  y_train = torch.tensor(y_train, dtype=torch.float32)


> **Task 2:** Complete the following code to train the model on the GPU for 15 epochs, with a batch size of 64. If you are on Google Colab, don't forget to change the kernel in Runtime -> Change runtime type -> T4 GPU.

In [None]:
# Initialize the MalConv model
model = MalConv(input_length=2381)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

MalConv(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (conv1): Conv1d(1, 32, kernel_size=(5,), stride=(5,))
  (conv2): Conv1d(32, 64, kernel_size=(5,), stride=(5,))
  (conv3): Conv1d(64, 128, kernel_size=(5,), stride=(5,))
  (fc1): Linear(in_features=2432, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [None]:
import os

# Loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss for binary classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adjust learning rate as needed

# Directory to save model checkpoints
save_dir = "drive/MyDrive/vMalConv/"

# Training Loop with Validation
num_epochs = 20  # Adjust the number of epochs as needed

for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()  # Zero the gradients

        outputs = model(inputs)

        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch+1}, Training Loss: {running_loss/len(train_loader)}')

    # Validation step
    model.eval()  # Set model to evaluation mode
    val_loss = 0.0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            val_loss += loss.item()
    print(f'Validation Loss: {val_loss/len(val_loader)}')

    # Save checkpoint every 5 epochs
    if (epoch + 1) % 5 == 0:
        checkpoint_path = os.path.join(save_dir, f'model_epoch_{epoch+1}.pt')
        torch.save(model.state_dict(), checkpoint_path)
        print(f'Model checkpoint saved to {checkpoint_path}')


Epoch 1, Training Loss: 0.37229777683940113
Validation Loss: 0.26051189079880716
Epoch 2, Training Loss: 0.2110768337792988
Validation Loss: 0.1945260114967823
Epoch 3, Training Loss: 0.171075267410731
Validation Loss: 0.1829151801764965
Epoch 4, Training Loss: 0.1502069597002826
Validation Loss: 0.15531195029616357
Epoch 5, Training Loss: 0.13120943458774423
Validation Loss: 0.13977479115128516
Model checkpoint saved to drive/MyDrive/vMalConv/model_epoch_5.pt
Epoch 6, Training Loss: 0.11649830280979977
Validation Loss: 0.1545613318681717
Epoch 7, Training Loss: 0.10879426383519475
Validation Loss: 0.13119899928569795
Epoch 8, Training Loss: 0.0956240020975282
Validation Loss: 0.12038677595555783
Epoch 9, Training Loss: 0.0889210792470582
Validation Loss: 0.11693084761500358
Epoch 10, Training Loss: 0.08331964872305907
Validation Loss: 0.10859535969793796
Model checkpoint saved to drive/MyDrive/vMalConv/model_epoch_10.pt
Epoch 11, Training Loss: 0.07707774940925309
Validation Loss: 0.1

**Task 3:** Complete the following code to evaluate your trained model on the test data.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Convert test data to PyTorch tensors
X_test = mms.transform(X_test[:50000])
y_test = y_test[:50000]

X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32)

# Create a TensorDataset and DataLoader for test data
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

model.eval()

  y_test = torch.tensor(y_test, dtype=torch.float32)


MalConv(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (conv1): Conv1d(1, 32, kernel_size=(5,), stride=(5,))
  (conv2): Conv1d(32, 64, kernel_size=(5,), stride=(5,))
  (conv3): Conv1d(64, 128, kernel_size=(5,), stride=(5,))
  (fc1): Linear(in_features=2432, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [None]:
predictions = []
labels = []

with torch.no_grad():
    for inputs, labels_batch in test_loader:

        inputs, labels_batch = inputs.to(device), labels_batch.to(device)
        outputs = model(inputs)
        predicted = (outputs > 0.5).float()
        # Store predictions and labels
        predictions.extend(predicted.cpu().numpy())
        labels.extend(labels_batch.cpu().numpy())

# Compute metrics
accuracy = accuracy_score(labels, predictions)
precision = precision_score(labels, predictions)
recall = recall_score(labels, predictions)

print(f'Test Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')

Test Accuracy: 0.5973
Precision: 0.5686
Recall: 0.8131


**Task 4:** Comment on the results in this text box.

Test Accuracy (0.5973):

This metric indicates the proportion of correctly predicted instances out of the total instances in the test set.
An accuracy of approximately 0.5973 suggests that the model is correct in its predictions for roughly 59.73% of the instances in the test set.
While accuracy is a useful metric, it may not provide a complete picture of model performance, especially in the presence of class imbalance or misclassification costs.


Precision (0.5686):

Precision measures the proportion of true positive predictions out of all positive predictions made by the model.
A precision score of around 0.5686 indicates that when the model predicts a positive outcome, it is correct approximately 56.86% of the time.
Precision is particularly important in tasks where false positives are costly or undesirable, such as medical diagnosis or fraud detection.


Recall (0.8131):

Recall, also known as sensitivity or true positive rate, measures the proportion of true positive instances that were correctly predicted by the model out of all actual positive instances.
A recall score of about 0.8131 suggests that the model correctly identifies approximately 81.31% of the actual positive instances in the dataset.
Recall is crucial in scenarios where missing positive instances can have severe consequences, such as in medical screening tests.

Overall, the model seems to achieve reasonable performance based on these metrics. However, further analysis, such as considering the trade-offs between precision and recall or examining the model's performance across different classes or subsets of the data, may provide additional insights into its strengths and weaknesses. Additionally, it's essential to interpret these results in the context of the specific problem domain and the requirements of the application.