# Using Deep Averaging Networks for malware classification


In this notebook we will experiment with the concept of Deep Averaging Networks in our malware classification setting.

Let's start by loading some packages necessary for the experiment.

In [1]:
%load_ext autoreload
%autoreload 2

from sklearn.preprocessing import LabelBinarizer
from sklearn.decomposition import IncrementalPCA
from collections import defaultdict, Counter
from sklearn.externals import joblib
from preprocessing import pp_action
from helpers import loader_tfidf
from utilities import constants
import plotly.graph_objs as go
import plotly.offline as ply
import tensorflow as tf
import pandas as pd
import numpy as np
import random
import json
import os

In [2]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)
load_batch_size = 1100

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training and a testing set.


In [5]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

Please choose the subset of data to workon on:
l for all labeled samples
k for samples of families mydoom, gepys, lamer, neshta, bladabindi, flystudio, eorezo
s for 8 samples of families mydoom, gepys, bladabindi, flystudio
f for a single family
b for a balanced subset of samples
q to quit
b

Would you like to compute the Jensen-Shannon distance matrix for the chosen data? [y/n]
It may take very long, depending on the number of selected samples
n

20007 train samples belonging to 65 malware families
Malware family:      multiplug       Number of samples:  748  
Malware family:     installcore      Number of samples:  724  
Malware family:       firseria       Number of samples:  720  
Malware family:      outbrowse       Number of samples:  712  
Malware family:       virlock        Number of samples:  707  
Malware family:      loadmoney       Number of samples:  706  
Malware family:        sality        Number of samples:  703  
Malware family:      browsefox       Number of samples

In [6]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()
x_train = samples_data.index[samples_data['train'] == 1].tolist()
x_dev = samples_data.index[samples_data['dev'] == 1].tolist()
x_test = samples_data.index[samples_data['test'] == 1].tolist()
y_train = samples_data.fam_num[samples_data['train'] == 1].tolist()
y_dev = samples_data.fam_num[samples_data['dev'] == 1].tolist()
y_test = samples_data.fam_num[samples_data['test'] == 1].tolist()

## Dimensionality Reduction

Since the DAN required a very considerable amount fo time for the training processes we will try reducing the dimensionality of the dataset.

We would also like this approach to be scalable to the entire balanced dataset so we will load sparse representations of the data vectors.

To achieve this we will use the Principal Component Analysis in order to operate on the sparse vectros. Let's define two helper functions first.


In [7]:
def train_pca(config, i_pca, samples, load_batch_size):
    t = 0
    
    while t < len(samples):
        data = loader_tfidf.load_tfidf(config, samples[t : t + load_batch_size], dense=True, ordered=False)
        t += load_batch_size

        i_pca.partial_fit(data)

In [8]:
def transform_data(config, i_pca, samples, load_batch_size):
    new_data = [] 
    t = 0
    
    while t < len(samples):
        data = loader_tfidf.load_tfidf(config, samples[t : t + load_batch_size], dense=True, ordered=True)
        t += load_batch_size

        new_data.append(i_pca.transform(data))
        
    return np.concatenate(new_data)

In [9]:
i_pca = IncrementalPCA(n_components=1024, batch_size=load_batch_size)

We will train the PCA algorithm incrementally only on the trainining dataset

In [None]:
train_pca(config, i_pca, random.sample(x_train, len(x_train)), load_batch_size)
joblib.dump(i_pca, 'temp_pca_1000.pkl')

Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)


In [11]:
print(i_pca.explained_variance_ratio_.sum())  

0.72908727985


In [None]:
# or directly load the trained PCA model if available
i_pca = joblib.load('temp_pca.pkl') 

Then we will use the trained algorithm to (incrementally) transform all the data vectors. This will allow us to transform larger dataset than what would fit in RAM.

In [12]:
X_train = transform_data(config, i_pca, x_train, load_batch_size)
X_dev = transform_data(config, i_pca, x_dev, load_batch_size)
X_test = transform_data(config, i_pca, x_test, load_batch_size)

Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 207 documents
(207, 297360)
Loading Tf-Idf of 1100 documents
(1100, 297360)
Loading Tf-Idf of 1100 documents
(1100, 29

In [13]:
for i in range(X_train.shape[0]):
    X_train[i] = X_train[i] / X_train.shape[1]
X_train = X_train.T

for i in range(X_dev.shape[0]):
    X_dev[i] = X_dev[i] / X_dev.shape[1]
X_dev = X_dev.T

for i in range(X_test.shape[0]):
    X_test[i] = X_test[i] / X_test.shape[1]
X_test = X_test.T

## Labels pre-processing

We will initially convert the true labels into a one-hot vector representation.

In [14]:
classes = sorted(set(y_train))
n_classes = len(classes)

classes_dict = dict(zip(classes, range(n_classes)))
y_train = [classes_dict[i] for i in y_train]
y_dev = [classes_dict[i] for i in y_dev]
y_test = [classes_dict[i] for i in y_test]

In [15]:
lb = LabelBinarizer()
Y_train = lb.fit_transform(y_train).T
Y_dev = lb.fit_transform(y_dev).T
Y_test = lb.fit_transform(y_test).T

In [16]:
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_dev shape: " + str(X_dev.shape))
print ("Y_dev shape: " + str(Y_dev.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))

X_train shape: (1024, 20007)
Y_train shape: (65, 20007)
X_dev shape: (1024, 4288)
Y_dev shape: (65, 4288)
X_test shape: (1024, 4287)
Y_test shape: (65, 4287)


## Setting the Hyper-parameters

Let's set the hyper-paramters, we will try to start with a fast network. 

In [38]:
learning_rate = 0.001
n_epochs = 1500
minibatch_size = 512
n_h_layers = 2
# ls = [[128,X_train.shape[0]], [128,1], [128,128], [128,1], [Y_train.shape[0],128], [Y_train.shape[0],1]]
# ls = [[512,X_train.shape[0]], [512,1], [256, 512], [256,1], [128,256], [128,1], [Y_train.shape[0],128], [Y_train.shape[0],1]]
ls = [[128,X_train.shape[0]], [128,1], [Y_train.shape[0],128], [Y_train.shape[0],1]]
keep_probs = 0.8
reg = 0.0

## Model definition

At each step the vectors will go through a softmax function.

First let's define some placeholders for the input X and the labels Y

In [39]:
def init_ph(n_feats, n_classes):
    with tf.device('/gpu:0'):
        X = tf.placeholder(dtype=tf.float32, shape=(n_feats, None))
        Y = tf.placeholder(dtype=tf.float32, shape=(n_classes, None))
        keep_prob = tf.placeholder(tf.float32)
        
        return X,Y, keep_prob

Then we initialize the wiehgts using the Xavier intialization method

In [40]:
def init_weights(n_layers, layer_sizes):
    params = {}
    
    with tf.device('/gpu:0'):
        for i in range(n_layers):
            Wn = 'W{}'.format(i)
            bn = 'b{}'.format(i)
            
            params[Wn] = tf.get_variable(
                Wn, 
                layer_sizes[i * 2], 
                initializer = tf.contrib.layers.xavier_initializer(seed = 1)
            )
            
            params[bn] = tf.get_variable(
                bn, 
                layer_sizes[(i * 2) + 1],
                initializer = tf.zeros_initializer()
            )
    
    return params

Forward propagation

In [41]:
def fwd(X, params, keep_prob):
    Zn = None
    
    with tf.device('/gpu:0'):
        An = X
#         An = tf.nn.dropout(X, keep_prob)
        
        for i in range(n_h_layers):
            Wn = 'W{}'.format(i)
            bn = 'b{}'.format(i)
            
            Zn = tf.add(tf.matmul(params[Wn], An), params[bn])
            An = tf.nn.dropout(tf.nn.relu(Zn), keep_prob)
            
    return Zn


Cost function

In [42]:
def compute_cost(Zn, Y, reg, params, n_layers):
    
    with tf.device('/gpu:0'):
        logits = tf.transpose(Zn)
        labels = tf.transpose(Y)
        
        regularization = 0.0
        for i in range(n_layers):
            Wn = 'W{}'.format(i)
            regularization += tf.nn.l2_loss(params[Wn])
        
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels)) + (reg * regularization)
    
    return cost

The finally the DAN model

In [43]:
def dan(X_train, Y_train, X_test, Y_test, l_rate, n_epochs, minibatch_size, n_h_layers, layers, k_prob, reg):

    with tf.device('/gpu:0'):
        
        tf.reset_default_graph()
        
        X, Y, keep_prob = init_ph(X_train.shape[0], Y_train.shape[0])
        
        params = init_weights(n_h_layers, layers)
        
        Z = fwd(X, params, keep_prob)
        
        cost = compute_cost(Z, Y, reg, params, n_h_layers)
        
        global_step = tf.Variable(0, trainable=False)
        
        learning_rate = tf.train.exponential_decay(l_rate, global_step, 100000, 0.99)
        
        optimizer = tf.train.AdamOptimizer(learning_rate = l_rate).minimize(cost, global_step=global_step)
    
#         optimizer = tf.train.AdamOptimizer(learning_rate = l_rate).minimize(cost)
        
        init = tf.global_variables_initializer()  
        
        correct_prediction = tf.equal(tf.argmax(Z), tf.argmax(Y))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
        
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
        
            num_minibatches = int(X_train.shape[1] / minibatch_size)

            sess.run(init)

            for epoch in range(n_epochs):
                epoch_cost = 0.
                
                minibatch_idxs = np.random.permutation(X_train.shape[1])
                
                for i in range(num_minibatches):
                   

                    minibatch_X = np.take(
                        X_train,
                        minibatch_idxs[i * minibatch_size : (i + 1) * minibatch_size], 
                        axis=1
                    )
                    minibatch_Y = np.take(
                        Y_train, 
                        minibatch_idxs[i * minibatch_size : (i + 1) * minibatch_size], 
                        axis=1
                    )

                    _ , minibatch_cost = sess.run(
                        [optimizer, cost], 
                        feed_dict={
                            X: minibatch_X, 
                            Y: minibatch_Y,
                            keep_prob: k_prob
                        }
                    )

                    epoch_cost += minibatch_cost / num_minibatches

                if epoch % 100 == 0:
                    print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
                    print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train, keep_prob: 1.0}))
                    print ("Dev Accuracy:", accuracy.eval({X: X_dev, Y: Y_dev, keep_prob: 1.0}))
                    print ("Learning Rate:", learning_rate.eval())
                    print ("")
                    
                if epoch % 5 == 0:
                    costs.append(epoch_cost)


            tr_acc =  accuracy.eval({X: X_train, Y: Y_train, keep_prob: 1.0})
            dv_acc = accuracy.eval({X: X_dev, Y: Y_dev, keep_prob: 1.0})
            
            print ("Train Accuracy:",tr_acc)
            print ("Dev Accuracy:", dv_acc)


        return params, costs, tr_acc, dv_acc
        
        

In [None]:
tf.set_random_seed(1)
costs = []

tf.reset_default_graph()
parameters, cost_list, tr_acc, ts_acc = dan(
    X_train,
    Y_train,
    X_test,
    Y_test,
    learning_rate,
    n_epochs,
    minibatch_size,
    n_h_layers,
    ls,
    keep_probs,
    reg
)

Cost after epoch 0: 4.015227
Train Accuracy: 0.641225
Dev Accuracy: 0.631996
Learning Rate: 0.000999996

Cost after epoch 100: 0.302301
Train Accuracy: 0.925925
Dev Accuracy: 0.90555
Learning Rate: 0.000999604

Cost after epoch 200: 0.210804
Train Accuracy: 0.944419
Dev Accuracy: 0.91861
Learning Rate: 0.000999212

Cost after epoch 300: 0.168607
Train Accuracy: 0.957164
Dev Accuracy: 0.927239
Learning Rate: 0.000998821

Cost after epoch 400: 0.140943
Train Accuracy: 0.961713
Dev Accuracy: 0.93097
Learning Rate: 0.000998429

Cost after epoch 500: 0.118579
Train Accuracy: 0.96801
Dev Accuracy: 0.934002
Learning Rate: 0.000998038

Cost after epoch 600: 0.105775
Train Accuracy: 0.972609
Dev Accuracy: 0.9368
Learning Rate: 0.000997647

Cost after epoch 700: 0.093336
Train Accuracy: 0.976358
Dev Accuracy: 0.937034
Learning Rate: 0.000997256

Cost after epoch 800: 0.082394
Train Accuracy: 0.979956
Dev Accuracy: 0.9382
Learning Rate: 0.000996865

Cost after epoch 900: 0.073736
Train Accuracy: 

In [None]:
trace = go.Scatter(
    x = np.arange(len(costs)),
    y = costs
)
ply.iplot([trace], filename='costs')

In [None]:
tr_acc

In [None]:
ts_acc