<a href="https://colab.research.google.com/github/OctoberFall/SoK-Security/blob/main/PDF_malware_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a multi-layer perceptron classifier to classify PDF samples into benign and malicious

The black box PDF classifier is used to evaluate explanations on a test PDF sample.

Reference: Evaluating Explanation Methods for Deep Learning in Security, A.Warnecke, D.Arp, C. Wressnegger and K.Rieck, IEEE European Symposium on Security and Privacy (Euro S&P), 2020. [Github](https://github.com/alewarne/explain-mlsec)

### Run each cell to execute the code. We will first download a google drive folder. The zipped folder contains all the required files for this notebook. 

Google drive link: https://drive.google.com/file/d/19xRRmpJiaqXiVc5YkA4E_qBUQvKxkaeH/view?usp=sharing 

In [4]:
!gdown 19xRRmpJiaqXiVc5YkA4E_qBUQvKxkaeH

Downloading...
From: https://drive.google.com/uc?id=19xRRmpJiaqXiVc5YkA4E_qBUQvKxkaeH
To: /content/pdf_data.zip
  0% 0.00/2.11M [00:00<?, ?B/s]100% 2.11M/2.11M [00:00<00:00, 163MB/s]


#### Unzip the downloaded folder into the default folder.

In [5]:
!unzip "/content/pdf_data.zip"

Archive:  /content/pdf_data.zip
   creating: pdf_data/
  inflating: pdf_data/Columns.txt    
  inflating: pdf_data/custom_metrics.py  
   creating: pdf_data/data/
  inflating: pdf_data/data/contagio-all.csv  
   creating: pdf_data/lemna/
  inflating: pdf_data/lemna/0        
  inflating: pdf_data/lemna/betaspickle.pkl  
  inflating: pdf_data/lemna/sigmas.npy  
   creating: pdf_data/lime/
  inflating: pdf_data/lime/relevances_lime.pkl  
   creating: pdf_data/models/
  inflating: pdf_data/models/keras_model.h5  
   creating: pdf_data/perturbation/
  inflating: pdf_data/perturbation/linreg_representations_seed_40.pkl  
  inflating: pdf_data/perturbation/perturbation_labels_seed_40.npy  
  inflating: pdf_data/utils.py       


In [6]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

In [7]:

path_to_csv="pdf_data/data/contagio-all.csv"
non_relevant_columns = [1]  #filename is not relevant to classifier
label_column = 0
arr = np.genfromtxt(path_to_csv, dtype=str, delimiter=',', skip_header=0)
filenames = arr[1:, 1]
no_features = arr.shape[1]
no_features

137

In [8]:
columns_to_use = [i for i in range(no_features) if i not in non_relevant_columns]
arr = np.genfromtxt(path_to_csv, dtype=np.float, delimiter=',', skip_header=1, usecols=columns_to_use)
labels = arr[:, label_column]
labels = np.array([[1,0] if l == 0 else [0,1] for l in labels])
data = np.delete(arr, 0, axis=1)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  


In [9]:
random_seed = 123456
vec_output = True
loss = 'binary_crossentropy'
binary_encoding = True

In [10]:
if binary_encoding:
    data[np.where(data != 0)] = 1
else:
    data = normalize(data, 'max', axis=0)

In [11]:
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.25, random_state=random_seed)
_, filenames_test = train_test_split(filenames, test_size=0.25, random_state=random_seed)


## Add a MLP network

In [23]:
import sys
import os
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import normalize
import tensorflow as tf
from tensorflow import keras
#from tensorflow.keras import layers

In [24]:
no_features=x_train.shape[1]
vec_output = True
final_nonlinearity = 'softmax'
optimizer = 'adam'
loss = 'binary_crossentropy'
epochs = 100
batch_size = 32

In [25]:
model = keras.Sequential()
model.add(Dense(units=200, activation='relu', input_shape=(no_features, )))
model.add(Dropout(rate=0.5))
model.add(Dense(units=200, activation='relu'))
model.add(Dropout(rate=0.5))
model.add(Dense(units=2, activation='softmax'))
model.compile(optimizer, loss, metrics=['accuracy'])
model.fit(x_train, y_train, batch_size, epochs, validation_data=(x_test, y_test), verbose=2)

Epoch 1/100
235/235 - 2s - loss: 0.1338 - accuracy: 0.9568 - val_loss: 0.0314 - val_accuracy: 0.9908 - 2s/epoch - 7ms/step
Epoch 2/100
235/235 - 1s - loss: 0.0459 - accuracy: 0.9876 - val_loss: 0.0340 - val_accuracy: 0.9860 - 699ms/epoch - 3ms/step
Epoch 3/100
235/235 - 1s - loss: 0.0427 - accuracy: 0.9868 - val_loss: 0.0240 - val_accuracy: 0.9924 - 678ms/epoch - 3ms/step
Epoch 4/100
235/235 - 1s - loss: 0.0331 - accuracy: 0.9904 - val_loss: 0.0226 - val_accuracy: 0.9928 - 672ms/epoch - 3ms/step
Epoch 5/100
235/235 - 1s - loss: 0.0370 - accuracy: 0.9879 - val_loss: 0.0205 - val_accuracy: 0.9936 - 656ms/epoch - 3ms/step
Epoch 6/100
235/235 - 1s - loss: 0.0294 - accuracy: 0.9903 - val_loss: 0.0242 - val_accuracy: 0.9936 - 679ms/epoch - 3ms/step
Epoch 7/100
235/235 - 1s - loss: 0.0304 - accuracy: 0.9900 - val_loss: 0.0176 - val_accuracy: 0.9944 - 655ms/epoch - 3ms/step
Epoch 8/100
235/235 - 1s - loss: 0.0271 - accuracy: 0.9915 - val_loss: 0.0188 - val_accuracy: 0.9932 - 668ms/epoch - 3ms/

<keras.callbacks.History at 0x7fec32789590>

In [26]:
model.save(filepath="models/keras_model.h5")

In [27]:
# prints accuracy, precision, recall, fpr and f1 score for given model and test set with labels
def get_statistics(model, x_test, y_test):
    y_pred = np.argmax(model.predict(x_test), axis=1)
    y_test = np.argmax(y_test, axis=1)
    assert len(y_pred) == len(y_test)
    acc = np.sum(y_pred==y_test)/np.float(len(y_pred))
    cm = confusion_matrix(y_test, y_pred)
    TN, FN, TP, FP = cm[0,0], cm[1,0], cm[1,1], cm[0,1]
    TPR = TP/(TP+FN)
    FPR = FP/(FP+TN)
    precision = TP/(TP+FP)
    F1 = 2*TP/(2*TP+FP+FN)
    print('The model achieved: Accuracy:{}, Precision:{}, Recall:{}, FPR:{}, F1 score:{} on the test set.'.format(
        acc, precision, TPR, FPR, F1))

In [28]:
get_statistics(model, x_test, y_test)

The model achieved: Accuracy:0.9976, Precision:0.9975103734439834, Recall:0.9975103734439834, FPR:0.0023166023166023165, F1 score:0.9975103734439834 on the test set.


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  


## The model obtains excellent results on test set and is ready for evaluating explanations methods.