# ML in Cybersecurity: Task 3

## Team
  * **Team name**:  :flushed:
  * **Members**:  Tim Schneider (s8tiscne@stud.uni-saarland.de), Qiankun Zheng (qizh00001@stud.uni-saarland.de), Maximilian Löffler (s8maloef@stud.uni-saarland.de)


## Logistics
  * **Due date**: 9th December 2021, 23:59:59
  * Email the completed notebook to: `mlcysec_ws2022_staff@lists.cispa.saarland`
  * Complete this in **teams of 3**
  * Feel free to use the forum to discuss.
  
## Timeline
  * 26-Nov-2021: hand-out
  * **09-Dec-2021**: Email completed notebook
  
  
## About this Project
In this project, you will explore an application of ML to a popular task in cybersecurity: malware classification.
You will be presented with precomputed behaviour analysis reports of thousands of program binaries, many of which are malwares.
Your goal is to train a malware detector using this behavioural reports.


## A Note on Grading
The grading for this project will depend on:
 1. Vectorizing Inputs
   * Obtaining a reasonable vectorized representations of the input data (a file containing a sequence of system calls)
   * Understanding the influence these representations have on your model
 1. Classification Model  
   * Following a clear ML pipeline
   * Obtaining reasonable performances (>60\%) on held-out test set
   * Choice of evaluation metric
   * Visualizing loss/accuracy curves
 1. Analysis
   * Which methods (input representations/ML models) work better than the rest and why?
   * Which hyper-parameters and design-choices were important in each of your methods?
   * Quantifying influence of these hyper-parameters on loss and/or validation accuracies
   * Trade-offs between methods, hyper-parameters, design-choices
   * Anything else you find interesting (this part is open-ended)


## Grading Details
 * 40 points: Vectorizing input data (each input = behaviour analysis file in our case)
 * 40 points: Training a classification model
 * 15 points: Analysis/Discussion
 * 5 points: Clean code
 
## Filling-in the Notebook
You'll be submitting this very notebook that is filled-in with your code and analysis. Make sure you submit one that has been previously executed in-order. (So that results/graphs are already visible upon opening it). 

The notebook you submit **should compile** (or should be self-contained and sufficiently commented). Check tutorial 1 on how to set up the Python3 environment.


**The notebook is your project report. So, to make the report readable, omit code for techniques/models/things that did not work. You can use the final summary to provide a report about these.**

It is extremely important that you **do not** re-order the existing sections. Apart from that, the code blocks that you need to fill-in are given by:
```
#
#
# ------- Your Code -------
#
#
```
Feel free to break this into multiple-cells. It's even better if you interleave explanations and code-blocks so that the entire notebook forms a readable "story".


## Code of Honor
We encourage discussing ideas and concepts with other students to help you learn and better understand the course content. However, the work you submit and present **must be original** and demonstrate your effort in solving the presented problems. **We will not tolerate** blatantly using existing solutions (such as from the internet), improper collaboration (e.g., sharing code or experimental data between groups) and plagiarism. If the honor code is not met, no points will be awarded.

 
 ## Versions
  * v1.1: Updated deadline
  * v1.0: Initial notebook
  
  ---

In [7]:
import time 
 
import numpy as np 
import matplotlib.pyplot as plt 

import json 
import time 
import pickle 
import sys 
import csv 
import os 
import os.path as osp 
import shutil 
import pathlib
from pathlib import Path

from IPython.display import display, HTML
 
%matplotlib inline 
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots 
plt.rcParams['image.interpolation'] = 'nearest' 
plt.rcParams['image.cmap'] = 'gray' 
 
# for auto-reloading external modules 
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython 
%load_ext autoreload
%autoreload 2

In [8]:
# Some suggestions of our libraries that might be helpful for this project
from collections import Counter          # an even easier way to count
from multiprocessing import Pool         # for multiprocessing
from tqdm import tqdm                    # fancy progress bars

# Load other libraries here.
# Keep it minimal! We should be easily able to reproduce your code.

# We preload pytorch as an example
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, TensorDataset

In [9]:
# helpers

def loading(part, whole):
    assert(isinstance(part, int))
    assert(isinstance(whole, int))
    partial = (part + 1) * 100 / whole
    sys.stdout.write('\r{}%\r'.format(int(partial * 100) / 100))
    sys.stdout.flush()
    
malware_map = {
    'NothingFound': 0,
    'Basun': 1,
    'Patched': 2,
    'Texel': 3,
    'Swizzor': 4,
    'Virut': 5,
    'VB': 6,
    'AutoIt': 7,
    'Agent': 8,
    'Allaple': 9
}
    
def int_to_class(i):    
    for key in malware_map.keys():
        if malware_map[key] == i:
            return key
    
def class_to_int(c):
    return malware_map[c]

# Setup

  * Download the datasets: [train](https://nextcloud.mpi-klsb.mpg.de/index.php/s/pJrRGzm2So2PMZm) (128M) and [test](https://nextcloud.mpi-klsb.mpg.de/index.php/s/zN3yeWzQB3i5WqE) (92M)
  * Unpack them under `./data/train` and `./data/test`
  * Hint: you can execute shell scripts from notebooks using the `!` prefix, e.g., `! wget <url>`

In [9]:
# Check that you are prepared with the data
! printf '# train examples (Should be 13682) : '; ls data/train | wc -l
! printf '# test  examples (Should be 10000) : '; ls data/test | wc -l

# train examples (Should be 13682) :    13682
# test  examples (Should be 10000) :    10000


Now that you're set, let's briefly look at the data you have been handed.
Each file encodes the behavior report of a program (potentially a malware), using an encoding scheme called "The Malware Instruction Set" (MIST for short).
At this point, we highly recommend you briefly read-up Sec. 2 of the [MIST](http://www.mlsec.org/malheur/docs/mist-tr.pdf) documentation.

You will find each file named as `filename.<malwarename>`:
```
» ls data/train | head
00005ecc06ae3e489042e979717bb1455f17ac9d.NothingFound
0008e3d188483aeae0de62d8d3a1479bd63ed8c9.Basun
000d2eea77ee037b7ef99586eb2f1433991baca9.Patched
000d996fa8f3c83c1c5568687bb3883a543ec874.Basun
0010f78d3ffee61101068a0722e09a98959a5f2c.Basun
0013cd0a8febd88bfc4333e20486bd1a9816fcbf.Basun
0014aca72eb88a7f20fce5a4e000c1f7fff4958a.Texel
001ffc75f24a0ae63a7033a01b8152ba371f6154.Texel
0022d6ba67d556b931e3ab26abcd7490393703c4.Basun
0028c307a125cf0fdc97d7a1ffce118c6e560a70.Swizzor
...
```
and within each file, you will see a sequence of individual systems calls monitored duing the run-time of the binary - a malware named 'Basun' in the case:
```
» head data/train/000d996fa8f3c83c1c5568687bb3883a543ec874.Basun
# process 000006c8 0000066a 022c82f4 00000000 thread 0001 #
02 01 | 000006c8 0000066a 00015000
02 02 | 00006b2c 047c8042 000b9000
02 02 | 00006b2c 047c8042 00108000
02 02 | 00006b2c 047c8042 00153000
02 02 | 00006b2c 047c8042 00091000
02 02 | 00006b2c 047c8042 00049000
02 02 | 00006b2c 047c8042 000aa000
02 02 | 00006b2c 047c8042 00092000
02 02 | 00006b2c 047c8042 00011000
...
```
(**Note**: Please ignore the first line that begins with `# process ...`.)

Your task in this project is to train a malware detector, which given the sequence of system calls (in the MIST-formatted file like above), predicts one of 10 classes: `{ Agent, Allaple, AutoIt, Basun, NothingFound, Patched, Swizzor, Texel, VB, Virut }`, where `NothingFound` roughly represents no malware is present.
In terms of machine learning terminology, your malware detector $F: X \rightarrow Y$ should learn a mapping from the MIST-encoded behaviour report (the input $x \in X$) to the malware class $y \in Y$.

Consequently, you will primarily tackle two challenges in this project:
  1. "Vectorizing" the input data i.e., representing each input (file) as a tensor
  1. Training an ML model
  

### Some tips:
  * Begin with an extremely simple representation/ML model and get above chance-level classification performance
  * Choose your evaluation metric wisely
  * Save intermediate computations (e.g., a token to index mapping). This will avoid you parsing the entire dataset for every experiment
  * Try using `multiprocessing.Pool` to parallelize your `for` loops

---

# 1. Vectorize Data

## 1.a. Load Raw Data

In [10]:
def load_content(filepath):
    '''Given a filepath, returns (content, classname), where content = [list of lines in file]'''
    with open(filepath) as file:
        lines = list(map(lambda line: line[:-2], file.readlines()))
    return lines, filepath.split(".")[2]


def load_data(data_path, nworkers=10):
    '''Returns each data sample as a tuple (x, y), x = sequence of strings (i.e., syscalls), y = malware program class'''
    raw_data_samples = []
    files = os.listdir(data_path)
    for idx, file in enumerate(files):
        loading(idx, len(files))
        lines, label = load_content(data_path + "/" + file)
        raw_data_samples.append((lines, label))
    return raw_data_samples

In [11]:
print('=> Loading training data ... ')
train_raw_samples = load_data('./data/train')

=> Loading training data ... 
100.0%

In [13]:
project_mode = 'test'    # trainval, traintest, debug
np.random.seed(123)          # To perform the same split across multiple runs

train = list()
val = list()
test = list()

if project_mode == 'trainval':
    l = len(train_raw_samples)
    n = int(round(l * 0.8))
    train = train_raw_samples[0:n]
    val = train_raw_samples[n:l]
    
elif project_mode == 'test':
    print('=> Loading testing data ... ')
    test = load_data('./data/test')
    
elif project_mode == 'debug':
    l = len(train_raw_samples)
    n = int(round(l * 0.1))
    train = train_raw_samples[0:n]
    val = train_raw_samples[n+1:2*n+1]
else:
    raise ValueError('Unrecognized mode')
    
print('=> # Train samples = ', len(train))
print('=> # Val samples = ', len(val))
print('=> # Test  samples = ', len(test))

=> Loading testing data ... 
=> # Train samples =  0
=> # Val samples =  0
=> # Test  samples =  10000


## 1.b. Vectorize: Setup

Make one pass over the inputs to identify relevant features/tokens.

Suggestion:
  - identify tokens (e.g., unigrams, bigrams)
  - create a token -> index (int) mapping. Note that you might have a >10K unique tokens. So, you will have to choose a suitable "vocabulary" size.

In [21]:

import copy

def get_key_idx_map(input_sets, vocab_size, n=1):
    # counter is a mapping: token -> count
    # build vectorizer using vocab_size most common elements
    key_to_idx, idx_to_key = dict(), dict()
    for idx, input_set in enumerate(input_sets):
        loading(idx, len(input_sets))
        key_to_idx = aggregate_dicts(key_to_idx, ngrams(input_set, n))
    key_to_idx = cutoff_and_rank(key_to_idx, vocab_size)
    idx_to_key = invert_dict(key_to_idx)
    return key_to_idx, idx_to_key

def ngrams(input_set, n):
    new_counter = dict()
    for idx in range(len(input_set)):
        seq = ngram(input_set, idx, n)
        if seq in new_counter:
            new_counter[seq] += 1
        elif len(seq) == n:
            new_counter[seq] = 1
    return new_counter

def ngram(input_set, idx, n):
    return tuple(input_set[idx:idx + n])

def aggregate_dicts(d1, d2):
    idx = 0
    new_dict = copy.deepcopy(d1)
    for val in d2:
        count = d2[val]
        if val in new_dict:
            new_dict[val] += count
        else:
            new_dict[val] = count
    return new_dict

def cutoff_and_rank(counter, vocab_size):
    new_counter = dict()
    counter = copy.deepcopy(counter)
    rank = 1
    vocab_size = min(vocab_size, len(counter.items()))
    for _ in range(vocab_size):
        max_value = max(counter, key=counter.get)
        new_counter[max_value] = rank
        counter[max_value] = 0
        rank += 1
    return new_counter

def invert_dict(d):
    return {value: key for key, value in d.items()}

In [15]:
MAX_VOCAB_SIZE = 50
path = 'application_vocab_{}.pkl'.format(MAX_VOCAB_SIZE)

In [634]:
uni_grams = get_key_idx_map(list(map(lambda s: s[0], train)), MAX_VOCAB_SIZE)

In [635]:
with open(path, 'wb') as file:
    d = {'token_to_idx': uni_grams[0],
          'idx_to_token': uni_grams[1]}
    pickle.dump(d, file)

## 1.c. Vectorize Data

Use the (token $\rightarrow$ index) mapping you created before to vectorize your data

In [16]:
with open(path, 'rb') as file:
    d = pickle.load(file)
    token_to_idx = d['token_to_idx']
    idx_to_token = d['idx_to_token']

In [19]:
def vectorize_raw_samples(raw_samples, nworkers=10):
    vectorized_samples = []
    for idx, sample in enumerate(raw_samples):
        loading(idx, len(raw_samples))
        lines, lable = sample
        vectorized_samples.append((vectorize(lines), lable))
        idx += 1
    return vectorized_samples

def vectorize(sample):
    counter = list()
    for idx in range(len(token_to_idx)):
        elem = idx_to_token[idx + 1]
        count = occurence(sample, elem)
        counter.append(count)
    return counter

def occurence(lst, obj):
    count = 0
    for idx in range(len(lst)):
        seq = ngram(lst, idx, len(obj))
        if seq == obj:
            count += 1
    return count

In [47]:
print('=> Processing: Train')
train_data = vectorize_raw_samples(train)

=> Processing: Train
100.0%

In [None]:
print('=> Processing: Val')
val_data = vectorize_raw_samples(val)

=> Processing: Val
100.0%

In [22]:
print('=> Processing: Test')
test_data = vectorize_raw_samples(test)

=> Processing: Test
100.0%

In [23]:
train_x, train_y, val_x, val_y, test_x, test_y = list(), list(), list(), list(), list(), list()
    
if project_mode == 'trainval' or project_mode == 'debug':
    for vector, label in train_data:
        train_x.append(vector)
        train_y.append(class_to_int(label))
    for vector, label in val_data:
        val_x.append(vector)
        val_y.append(class_to_int(label))
else:
    for vector, label in test_data:
        test_x.append(vector)
        test_y.append(class_to_int(label))

In [24]:
assert(len(train_x) == len(train_y))
assert(len(val_x) == len(val_y))
assert(len(test_x) == len(test_y))

if project_mode == 'trainval' or project_mode == 'debug':
    trainset = TensorDataset(torch.tensor(train_x).float(), torch.tensor(train_y))
    valset = TensorDataset(torch.tensor(val_x).float(), torch.tensor(val_y))
    torch.save(trainset, 'trainset.pt')
    torch.save(valset, 'valset.pt')
else:
    testset = TensorDataset(torch.tensor(test_x).float(), torch.tensor(test_y))
    torch.save(testset, 'testset.pt')

In [25]:
test_data[:10]

[([297,
   263,
   6714,
   6704,
   6702,
   3351,
   3352,
   1038,
   35,
   401,
   16,
   12,
   0,
   0,
   348,
   348,
   346,
   346,
   346,
   346,
   346,
   346,
   288,
   0,
   199,
   0,
   0,
   182,
   172,
   24,
   173,
   173,
   173,
   173,
   0,
   0,
   0,
   51,
   0,
   0,
   0,
   159,
   0,
   0,
   0,
   25,
   25,
   25,
   0,
   4],
  'Basun'),
 ([36790,
   36785,
   0,
   0,
   0,
   0,
   0,
   1776,
   4,
   612,
   2,
   0,
   0,
   0,
   592,
   600,
   592,
   592,
   592,
   592,
   592,
   592,
   0,
   0,
   310,
   0,
   0,
   305,
   298,
   1,
   296,
   296,
   296,
   296,
   0,
   0,
   0,
   18,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   2],
  'Basun'),
 ([3759,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   44,
   38,
   14,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   31,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   45,
   0,
   0,
   0,
   0,
   0

# 2. Train Model

You will now train an ML model on the vectorized datasets you created previously.

_Note_: Although we often refer to each input as a 'vector' for simplicity, each of your inputs can also be higher dimensional tensors.

## 2.a. Helpers

In [13]:
def highest(tensor):
    max_value = torch.max(tensor)
    for idx, val in enumerate(tensor):
        if val == max_value:
            return idx

def save_data(eval_data, out_path):
    with open(out_path, 'wb') as wf:
        pickle.dump(eval_data, out_path)
        
def load_set():
    trainset = torch.load('trainset.pt')
    valset = torch.load('valset.pt')
    testset = torch.load('testset.pt')
    return trainset, valset, testset

In [14]:
trainset, valset, testset = load_set()

## 2.b. Define Model

Describe your model here.

In [23]:
# Fully connected neural networks
class Net(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(Net, self).__init__()
        # Layer definitions
        self.fc1 = nn.Linear(input_dim, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, output_dim)
        

    def forward(self, x):
        # Forward
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

## 2.c. Set Hyperparameters

In [32]:
# Hyperparameters are defined here
in_dims = trainset[0][0].shape[0]
out_dims = 11

n_epochs = 50
batch_size = 8
lr = 0.0002

malwareClassifier = Net(in_dims, out_dims)
loss_func = nn.CrossEntropyLoss()

## 2.d. Train your Model

In [33]:
loss_values = []
def train_model(model, 
                train_loader,
                valset,
                num_epochs,
                learning_rate,
                loss_function):
    model.train()
    optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.5)
    
    for epoch in tqdm(range(num_epochs)):
        print(f'---------- Started Epoch {epoch} ----------')
        running_loss = 0.0
        
        for train_data in train_loader:
            
            optimizer.zero_grad()
            
            inputs, targets = train_data
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
        # every 10 epochs, print accuracy once
        if epoch % 10 == 9:
            with torch.no_grad():
                model.eval()
                pred = model(valset[:][0])
                accuracy = (torch.argmax(pred, -1) == valset[:][1]).float().mean()
                print(f"Accuracy: {accuracy}")
        loss_value = running_loss/len(train_loader)
        loss_values.append(loss_value)
        # each epoch, print loss
        print("Loss: {:.4f}".format(loss_value))

In [35]:
# Data Loaders
trainloader = DataLoader(trainset, batch_size=batch_size, shuffle=True)

train_model(model=malwareClassifier, train_loader=trainloader, valset=valset, num_epochs=n_epochs,
            loss_function=loss_func, learning_rate=lr)

  0%|          | 0/50 [00:00<?, ?it/s]

---------- Started Epoch 0 ----------


  2%|▏         | 1/50 [00:00<00:33,  1.45it/s]

Loss: 0.8916
---------- Started Epoch 1 ----------


  4%|▍         | 2/50 [00:01<00:32,  1.47it/s]

Loss: 0.8892
---------- Started Epoch 2 ----------


  6%|▌         | 3/50 [00:02<00:31,  1.49it/s]

Loss: 0.8821
---------- Started Epoch 3 ----------


  8%|▊         | 4/50 [00:02<00:30,  1.51it/s]

Loss: 0.8811
---------- Started Epoch 4 ----------


 10%|█         | 5/50 [00:03<00:30,  1.50it/s]

Loss: 0.8747
---------- Started Epoch 5 ----------


 12%|█▏        | 6/50 [00:04<00:29,  1.49it/s]

Loss: 0.8669
---------- Started Epoch 6 ----------


 14%|█▍        | 7/50 [00:04<00:28,  1.50it/s]

Loss: 0.8654
---------- Started Epoch 7 ----------


 16%|█▌        | 8/50 [00:05<00:27,  1.50it/s]

Loss: 0.8600
---------- Started Epoch 8 ----------


 18%|█▊        | 9/50 [00:06<00:27,  1.49it/s]

Loss: 0.8673
---------- Started Epoch 9 ----------


 20%|██        | 10/50 [00:06<00:26,  1.50it/s]

Accuracy: 0.7174707651138306
Loss: 0.8515
---------- Started Epoch 10 ----------


 22%|██▏       | 11/50 [00:07<00:26,  1.50it/s]

Loss: 0.8540
---------- Started Epoch 11 ----------


 24%|██▍       | 12/50 [00:08<00:25,  1.49it/s]

Loss: 0.8427
---------- Started Epoch 12 ----------


 26%|██▌       | 13/50 [00:08<00:24,  1.50it/s]

Loss: 0.8428
---------- Started Epoch 13 ----------


 28%|██▊       | 14/50 [00:09<00:24,  1.49it/s]

Loss: 0.8337
---------- Started Epoch 14 ----------


 30%|███       | 15/50 [00:10<00:23,  1.49it/s]

Loss: 0.8324
---------- Started Epoch 15 ----------


 32%|███▏      | 16/50 [00:10<00:23,  1.48it/s]

Loss: 0.8400
---------- Started Epoch 16 ----------


 34%|███▍      | 17/50 [00:11<00:22,  1.47it/s]

Loss: 0.8331
---------- Started Epoch 17 ----------


 36%|███▌      | 18/50 [00:12<00:22,  1.45it/s]

Loss: 0.8270
---------- Started Epoch 18 ----------


 38%|███▊      | 19/50 [00:12<00:21,  1.45it/s]

Loss: 0.8294
---------- Started Epoch 19 ----------


 40%|████      | 20/50 [00:13<00:20,  1.46it/s]

Accuracy: 0.7236841917037964
Loss: 0.8251
---------- Started Epoch 20 ----------


 42%|████▏     | 21/50 [00:14<00:19,  1.47it/s]

Loss: 0.8167
---------- Started Epoch 21 ----------


 44%|████▍     | 22/50 [00:14<00:18,  1.48it/s]

Loss: 0.8096
---------- Started Epoch 22 ----------


 46%|████▌     | 23/50 [00:15<00:18,  1.48it/s]

Loss: 0.8118
---------- Started Epoch 23 ----------


 48%|████▊     | 24/50 [00:16<00:17,  1.48it/s]

Loss: 0.8096
---------- Started Epoch 24 ----------


 50%|█████     | 25/50 [00:16<00:16,  1.49it/s]

Loss: 0.8346
---------- Started Epoch 25 ----------


 52%|█████▏    | 26/50 [00:17<00:16,  1.49it/s]

Loss: 0.8233
---------- Started Epoch 26 ----------


 54%|█████▍    | 27/50 [00:18<00:15,  1.50it/s]

Loss: 0.8079
---------- Started Epoch 27 ----------


 56%|█████▌    | 28/50 [00:18<00:14,  1.50it/s]

Loss: 0.8051
---------- Started Epoch 28 ----------


 58%|█████▊    | 29/50 [00:19<00:14,  1.50it/s]

Loss: 0.7992
---------- Started Epoch 29 ----------


 60%|██████    | 30/50 [00:20<00:13,  1.50it/s]

Accuracy: 0.7244151830673218
Loss: 0.7931
---------- Started Epoch 30 ----------


 62%|██████▏   | 31/50 [00:20<00:12,  1.49it/s]

Loss: 0.7902
---------- Started Epoch 31 ----------


 64%|██████▍   | 32/50 [00:21<00:11,  1.51it/s]

Loss: 0.7864
---------- Started Epoch 32 ----------


 66%|██████▌   | 33/50 [00:22<00:11,  1.50it/s]

Loss: 0.7810
---------- Started Epoch 33 ----------


 68%|██████▊   | 34/50 [00:22<00:10,  1.51it/s]

Loss: 0.7794
---------- Started Epoch 34 ----------


 70%|███████   | 35/50 [00:23<00:09,  1.51it/s]

Loss: 0.7789
---------- Started Epoch 35 ----------


 72%|███████▏  | 36/50 [00:24<00:09,  1.51it/s]

Loss: 0.7762
---------- Started Epoch 36 ----------


 74%|███████▍  | 37/50 [00:24<00:08,  1.51it/s]

Loss: 0.7835
---------- Started Epoch 37 ----------


 76%|███████▌  | 38/50 [00:25<00:07,  1.51it/s]

Loss: 0.7913
---------- Started Epoch 38 ----------


 78%|███████▊  | 39/50 [00:26<00:07,  1.51it/s]

Loss: 0.7739
---------- Started Epoch 39 ----------


 80%|████████  | 40/50 [00:26<00:06,  1.51it/s]

Accuracy: 0.7266082167625427
Loss: 0.7710
---------- Started Epoch 40 ----------


 82%|████████▏ | 41/50 [00:27<00:05,  1.51it/s]

Loss: 0.7722
---------- Started Epoch 41 ----------


 84%|████████▍ | 42/50 [00:28<00:05,  1.51it/s]

Loss: 0.7707
---------- Started Epoch 42 ----------


 86%|████████▌ | 43/50 [00:28<00:04,  1.51it/s]

Loss: 0.7655
---------- Started Epoch 43 ----------


 88%|████████▊ | 44/50 [00:29<00:03,  1.51it/s]

Loss: 0.7615
---------- Started Epoch 44 ----------


 90%|█████████ | 45/50 [00:30<00:03,  1.51it/s]

Loss: 0.7611
---------- Started Epoch 45 ----------


 92%|█████████▏| 46/50 [00:30<00:02,  1.50it/s]

Loss: 0.7618
---------- Started Epoch 46 ----------


 94%|█████████▍| 47/50 [00:31<00:01,  1.50it/s]

Loss: 0.7572
---------- Started Epoch 47 ----------


 96%|█████████▌| 48/50 [00:32<00:01,  1.50it/s]

Loss: 0.7575
---------- Started Epoch 48 ----------


 98%|█████████▊| 49/50 [00:32<00:00,  1.50it/s]

Loss: 0.7519
---------- Started Epoch 49 ----------


100%|██████████| 50/50 [00:33<00:00,  1.49it/s]

Accuracy: 0.7339181303977966
Loss: 0.7505





## 2.e. Evaluate model

In [36]:
def test(model):
    model.eval()
    correct = 0
    total = len(testset)
    with torch.no_grad():
        for inputs, labels in testset:
            outputs = model(inputs)
            predicted = highest(outputs.data)
            correct += (predicted == labels).sum().item() 
    accuracy = correct / total
    print(f"Test Accuracy: {accuracy}")

In [37]:
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import warnings 
warnings.filterwarnings('ignore')

In [38]:
# def test(model):
#     model.eval()
#     correct = 0
#     total = len(valset)
#     with torch.no_grad():
#         for inputs, labels in valset:
#             outputs = model(inputs)
#             predicted = highest(outputs.data)
#             correct += (predicted == labels).sum().item() 
#     accuracy = correct / total
#     print(f"Test Accuracy: {accuracy}")

In [39]:
# here I use valset to test since testset isn't well defined
test(malwareClassifier)

Test Accuracy: 0.743


In [40]:
predictions = torch.argmax(malwareClassifier(valset[:][0]), -1)
# here use classification report to idenfity which malwares are difficult to detect.
print(classification_report(valset[:][1], predictions))
# the 1 class is missing from the report below ---- in the processed data, I can't find the class 1 as well.
# the predictions below should be conducted on the test set

              precision    recall  f1-score   support

           0       0.62      0.48      0.54       690
           2       0.00      0.00      0.00        94
           3       0.73      1.00      0.84       415
           4       0.65      0.89      0.75       654
           5       0.89      0.98      0.94       103
           6       0.91      0.76      0.83       251
           7       0.80      0.66      0.72       155
           8       0.99      0.99      0.99       167
           9       0.70      0.08      0.14        90
          10       0.98      0.99      0.99       117

    accuracy                           0.73      2736
   macro avg       0.73      0.68      0.67      2736
weighted avg       0.71      0.73      0.70      2736



In [41]:
rf = RandomForestClassifier(max_depth=5, random_state=222).fit(trainset[:][0], trainset[:][1])

In [42]:
print(classification_report(valset[:][1], rf.predict(valset[:][0])))

              precision    recall  f1-score   support

           0       0.48      0.45      0.46       690
           2       0.00      0.00      0.00        94
           3       0.75      0.96      0.85       415
           4       0.58      0.88      0.70       654
           5       0.94      0.66      0.78       103
           6       0.86      0.73      0.79       251
           7       0.00      0.00      0.00       155
           8       1.00      0.99      1.00       167
           9       0.00      0.00      0.00        90
          10       0.98      0.99      0.99       117

    accuracy                           0.66      2736
   macro avg       0.56      0.57      0.56      2736
weighted avg       0.59      0.66      0.62      2736



In [53]:
svm = make_pipeline(StandardScaler(), SVC(gamma='auto')).fit(trainset[:][0], trainset[:][1])
print(classification_report(valset[:][1], svm.predict(valset[:][0])))

              precision    recall  f1-score   support

           0       0.45      0.11      0.17       690
           2       0.00      0.00      0.00        94
           3       0.73      1.00      0.84       415
           4       0.38      0.98      0.54       654
           5       0.00      0.00      0.00       103
           6       0.71      0.04      0.08       251
           7       0.00      0.00      0.00       155
           8       1.00      0.99      1.00       167
           9       0.00      0.00      0.00        90
          10       0.99      0.99      0.99       117

    accuracy                           0.52      2736
   macro avg       0.43      0.41      0.36      2736
weighted avg       0.48      0.52      0.41      2736



In [11]:
xgboost = XGBClassifier().fit(trainset[:][0].numpy(), trainset[:][1].numpy())
print(classification_report(valset[:][1], xgboost.predict(valset[:][0].numpy())))

              precision    recall  f1-score   support

           0       0.71      0.68      0.70       690
           2       0.40      0.02      0.04        94
           3       0.77      0.91      0.83       415
           4       0.74      0.89      0.81       654
           5       0.99      1.00      1.00       103
           6       0.92      0.83      0.87       251
           7       0.93      0.76      0.84       155
           8       1.00      0.99      1.00       167
           9       0.80      0.44      0.57        90
          10       0.98      1.00      0.99       117

    accuracy                           0.80      2736
   macro avg       0.82      0.75      0.76      2736
weighted avg       0.79      0.80      0.78      2736



## 2.f. Save Model + Data

In [43]:
# save the model 
torch.save(malwareClassifier.state_dict(), "./malwareClassifier_parameters.pt")

---

# 3. Analysis

## 3.a. Summary: Main Results

Summarize your approach and results here

* First we loaded the data up and searched for the most present syscalls. 
* We stored the 50 most seen ones in a dictionary. These will be used to vectorize files.
* Given a file we vectorize it by searching for the number of appearances of said 50 most syscall patterns in the file and assign a vector-list with these values of the appearances. 
* Now we can turn a MIST file into a classification vector.
* On this we trained our fully connected neural network (MLP).
* This achieved about 73% accuracy. This result heavily depended on given hyperparameters like batch size though.

## 3.b. Discussion

Enter your final summary here.

For instance, you can address:
- What was the performance you obtained with the simplest approach?
- Which vectorized input representations helped more than the others?
- Which malwares are difficult to detect and why?
- Which approach do you recommend to perform malware classification?

* The performance for training the model was very decent. The most difficult performance heavy task was definitely vectorizing data because searching for unigrams was not particularely optimized in our code. But I thing our simple unigram approach was helping a bit because it could severely improve runtime compared to bigrams (which we also tested).
* In general the most simple approach seemed to yield the best runtime and decent results with a final accuracy.
* Generally speaking Basun seems to be difficult to detect as it is not appearing in our classification results. We did not found a particular reason for that.
* For a simple and not perfect approach we would recommend for vectorizing MIST output files by simple classifications as uni- or bigrams. For more dedicated approaches one could look into specific patterns of syscalls which might co-occure for better results. Note that vectorization then potentially increases in runtime by a quite significat amount though. For classification a simple but rather effective strategy is using a fully connected neural network as we demonstrated. 