# Using Deep Averaging Networks for malware classification


In this notebook we will experiment with the concept of Deep Averaging Networks in our malware classification setting.

Let's start by loading some packages necessary for the experiment.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from classification import cla_action, cla_dan
from utilities import constants, evaluation
from preprocessing import pp_action
import plotly.graph_objs as go
import plotly.offline as ply
import tensorflow as tf
import pandas as pd
import numpy as np
import json
import os



In [3]:
config = json.load(open('config.json', 'r'))
ply.init_notebook_mode(connected=True)

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training, development and  test set.


In [4]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

Please choose the subset of data to workon on:
l for all labeled samples
k for samples of families mydoom, gepys, lamer, neshta, bladabindi, flystudio, eorezo
s for a small balanced subset
f for a single family
b for a balanced subset of samples
q to quit
b

Would you like to compute the Jensen-Shannon distance matrix for the chosen data? [y/n]
n

20007 train samples belonging to 65 malware families
Malware family:      multiplug       Number of samples:  748  
Malware family:     installcore      Number of samples:  724  
Malware family:       firseria       Number of samples:  720  
Malware family:      outbrowse       Number of samples:  712  
Malware family:       virlock        Number of samples:  707  
Malware family:      loadmoney       Number of samples:  706  
Malware family:        sality        Number of samples:  703  
Malware family:      browsefox       Number of samples:  701  
Malware family:       allaple        Number of samples:  698  
Malware family:         mira  

In [5]:
x_train = samples_data.index[samples_data['train'] == 1].tolist()
x_dev = samples_data.index[samples_data['dev'] == 1].tolist()
x_test = samples_data.index[samples_data['test'] == 1].tolist()
y_train = samples_data.fam_num[samples_data['train'] == 1].tolist()
y_dev = samples_data.fam_num[samples_data['dev'] == 1].tolist()
y_test = samples_data.fam_num[samples_data['test'] == 1].tolist()
y_test_fam = samples_data.family[samples_data['test'] == 1].tolist()

## Feature extraction

Since the DAN required a very considerable amount fo time to train with the full dataset, we will try reducing the dimensionality.

To achieve this we will use the Principal Component Analysis in order to operate on the sparse vectros.

In [None]:
xm_train_e = np.loadtxt('data/matrix/pca_1024_20007_tr.txt')
xm_dev_e = np.loadtxt('data/matrix/pca_1024_4288_dv.txt')
xm_test_e = np.loadtxt('data/matrix/pca_1024_4287_te.txt')

## Feature selection

An alternative to feature extraction, which creates a new -artificial- set of features, is feature selection. With feature selection we mean a method which tries to isolate the most important features for a specific learning task, among the natural features of the dataset.

We will attempt to select the most relevant features by using random forest classifiers.

In [None]:
xm_train_s = np.loadtxt('data/matrix/rfc_1024_20007_tr.txt')
xm_dev_s = np.loadtxt('data/matrix/rfc_1024_4288_dv.txt')
xm_test_s = np.loadtxt('data/matrix/rfc_1024_4287_te.txt')

## Classification

Now we can try classification with both data sets.

First with extracted features.

In [None]:
y_predicted, model, modifier = cla_dan.classify(xm_train_e, xm_dev_e, xm_test_e, y_train, y_dev, y_test, config)

In [None]:
evaluation.evaluate_classification(model[0], y_test_fam, y_predicted, model[1])

Now with selected features.

In [None]:
y_predicted, model, modifier = cla_dan.classify(xm_train_s, xm_dev_s, xm_test_s, y_train, y_dev, y_test, config)

In [None]:
evaluation.evaluate_classification(model[0], y_test_fam, y_predicted, model[1])

Let's try with a higher number of features selected with the random forest classifier method

In [6]:
xm_train_s2 = np.loadtxt('data/matrix/rfc_2048_20007_tr.txt')
xm_dev_s2 = np.loadtxt('data/matrix/rfc_2048_4288_dv.txt')
xm_test_s2 = np.loadtxt('data/matrix/rfc_2048_4287_te.txt')

In [None]:
y_predicted, model, modifier = cla_dan.classify(xm_train_s2, xm_dev_s2, xm_test_s2, y_train, y_dev, y_test, config)

X_train shape: (2048, 20007)
Y_train shape: (65, 20007)
X_dev shape: (2048, 4288)
Y_dev shape: (65, 4288)
X_test shape: (2048, 4287)
Y_test shape: (65, 4287)
Cost after epoch 0: 1.437555
Train Accuracy: 0.880291
Dev Accuracy: 0.868237
Learning Rate: 0.000999682

Cost after epoch 100: 0.047710
Train Accuracy: 0.988303
Dev Accuracy: 0.948694
Learning Rate: 0.000968352

Cost after epoch 200: 0.037709
Train Accuracy: 0.991502
Dev Accuracy: 0.945429
Learning Rate: 0.000938004

Cost after epoch 300: 0.031411
Train Accuracy: 0.993202
Dev Accuracy: 0.944963
Learning Rate: 0.000908608



In [None]:
evaluation.evaluate_classification(model[0], y_test_fam, y_predicted, model[1])