# Using Deep Averaging Networks for malware classification


In this notebook we will experiment with the concept of Deep Averaging Networks in our malware classification setting.

Let's start by loading some packages necessary for the experiment.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.decomposition import IncrementalPCA
from collections import defaultdict, Counter
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.externals import joblib
from preprocessing import pp_action
from helpers import loader_tfidf
from utilities import constants
import plotly.graph_objs as go
import plotly.offline as ply
import tensorflow as tf
import pandas as pd
import numpy as np
import random
import json
import os

In [None]:
config = json.load(open('config.json', 'r'))
uuids_family = json.load(open(os.path.join(constants.dir_d, constants.json_labels), 'r'))
words = json.load(open(os.path.join(constants.dir_d, constants.json_words), 'r'))
ply.init_notebook_mode(connected=True)
load_batch_size = 1100

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training and a testing set.


In [None]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

In [None]:
uuids = samples_data.index[samples_data['selected'] == 1].tolist()
x_train = samples_data.index[samples_data['train'] == 1].tolist()
x_dev = samples_data.index[samples_data['dev'] == 1].tolist()
x_test = samples_data.index[samples_data['test'] == 1].tolist()
y_train = samples_data.fam_num[samples_data['train'] == 1].tolist()
y_dev = samples_data.fam_num[samples_data['dev'] == 1].tolist()
y_test = samples_data.fam_num[samples_data['test'] == 1].tolist()

## Dimensionality Reduction

Since the DAN required a very considerable amount fo time for the training processes we will try reducing the dimensionality of the dataset.

We would also like this approach to be scalable to the entire balanced dataset so we will load sparse representations of the data vectors.

To achieve this we will use the Principal Component Analysis in order to operate on the sparse vectros. Let's define two helper functions first.


In [None]:
def train_pca(config, i_pca, samples, load_batch_size):
    t = 0
    
    while t < len(samples):
        data = loader_tfidf.load_tfidf(config, samples[t : t + load_batch_size], dense=True, ordered=False)
        t += load_batch_size

        i_pca.partial_fit(data)

In [None]:
def transform_data(config, i_pca, samples, load_batch_size):
    new_data = [] 
    t = 0
    
    while t < len(samples):
        data = loader_tfidf.load_tfidf(config, samples[t : t + load_batch_size], dense=True, ordered=True)
        t += load_batch_size

        new_data.append(i_pca.transform(data))
        
    return np.concatenate(new_data)

In [None]:
i_pca = IncrementalPCA(n_components=1024, batch_size=load_batch_size)

We will train the PCA algorithm incrementally only on the trainining dataset

In [None]:
train_pca(config, i_pca, random.sample(x_train, len(x_train)), load_batch_size)
joblib.dump(i_pca, 'temp_pca_1000.pkl')

In [None]:
# or directly load the trained PCA model if available
i_pca = joblib.load('temp_pca_1000.pkl') 

In [None]:
print(i_pca.explained_variance_ratio_.sum())  

Then we will use the trained algorithm to (incrementally) transform all the data vectors. This will allow us to transform larger dataset than what would fit in RAM.

In [None]:
X_train = transform_data(config, i_pca, x_train, load_batch_size)
X_dev = transform_data(config, i_pca, x_dev, load_batch_size)
X_test = transform_data(config, i_pca, x_test, load_batch_size)

In [None]:
for i in range(X_train.shape[0]):
    X_train[i] = X_train[i] / X_train.shape[1]
X_train = X_train.T

for i in range(X_dev.shape[0]):
    X_dev[i] = X_dev[i] / X_dev.shape[1]
X_dev = X_dev.T

for i in range(X_test.shape[0]):
    X_test[i] = X_test[i] / X_test.shape[1]
X_test = X_test.T

## Labels pre-processing

We will initially convert the true labels into a one-hot vector representation.

In [None]:
classes = sorted(set(y_train))
n_classes = len(classes)

classes_dict = dict(zip(classes, range(n_classes)))
y_train = [classes_dict[i] for i in y_train]
y_dev = [classes_dict[i] for i in y_dev]
y_test = [classes_dict[i] for i in y_test]

In [None]:
lb = LabelBinarizer()
Y_train = lb.fit_transform(y_train).T
Y_dev = lb.fit_transform(y_dev).T
Y_test = lb.fit_transform(y_test).T

In [None]:
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_dev shape: " + str(X_dev.shape))
print ("Y_dev shape: " + str(Y_dev.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))

## Setting the Hyper-parameters

Let's set the hyper-paramters, we will try to start with a fast network. 

In [None]:
learning_rate = 0.001
n_epochs = 384
minibatch_size = 256
n_h_layers = 3
ls = [[256,X_train.shape[0]], [256,1], [128,256], [128,1], [Y_train.shape[0],128], [Y_train.shape[0],1]]
# ls = [[512,X_train.shape[0]], [512,1], [256, 512], [256,1], [128,256], [128,1], [Y_train.shape[0],128], [Y_train.shape[0],1]]
# ls = [[128,X_train.shape[0]], [128,1], [Y_train.shape[0],128], [Y_train.shape[0],1]]
# ls = [[512,X_train.shape[0]], [512,1], [Y_train.shape[0],512], [Y_train.shape[0],1]]
keep_probs = 0.9
reg = 0.0

## Model definition

At each step the vectors will go through a softmax function.

First let's define some placeholders for the input X and the labels Y

In [None]:
def init_ph(n_feats, n_classes):
    with tf.device('/gpu:0'):
        X = tf.placeholder(dtype=tf.float32, shape=(n_feats, None))
        Y = tf.placeholder(dtype=tf.float32, shape=(n_classes, None))
        keep_prob = tf.placeholder(tf.float32)
        
        return X,Y, keep_prob

Then we initialize the wiehgts using the Xavier intialization method

In [None]:
def init_weights(n_layers, layer_sizes):
    params = {}
    
    with tf.device('/gpu:0'):
        for i in range(n_layers):
            Wn = 'W{}'.format(i)
            bn = 'b{}'.format(i)
            
            params[Wn] = tf.get_variable(
                Wn, 
                layer_sizes[i * 2], 
                initializer = tf.contrib.layers.xavier_initializer(seed = 1)
            )
            
            params[bn] = tf.get_variable(
                bn, 
                layer_sizes[(i * 2) + 1],
                initializer = tf.zeros_initializer()
            )
    
    return params

Forward propagation

In [None]:
def fwd(X, params, keep_prob):
    Zn = None
    epsilon = 1e-4
    
    with tf.device('/gpu:0'):
#         An = X
        An = tf.nn.dropout(X, keep_prob)
        
        for i in range(n_h_layers):
            Wn = 'W{}'.format(i)
            bn = 'b{}'.format(i)
            
            Zn = tf.add(tf.matmul(params[Wn], An), params[bn])
            
            batch_mean, batch_var = tf.nn.moments(Zn,[0])
            BN = tf.nn.batch_normalization(
                x=Zn,
                mean=batch_mean,
                variance=batch_var,
                offset=None,
                scale=None,
                variance_epsilon=epsilon
            )
            
            An = tf.nn.dropout(tf.nn.relu(BN), keep_prob)
            
    return Zn


Cost function

In [None]:
def compute_cost(Zn, Y, reg, params, n_layers):
    
    with tf.device('/gpu:0'):
        logits = tf.transpose(Zn)
        labels = tf.transpose(Y)
        
        regularization = 0.0
        for i in range(n_layers):
            Wn = 'W{}'.format(i)
            regularization += tf.nn.l2_loss(params[Wn])
        
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels)) + (reg * regularization)
    
    return cost

The finally the DAN model

In [None]:
def dan(X_train, Y_train, X_dev, Y_dev, l_rate, n_epochs, minibatch_size, n_h_layers, layers, k_prob, reg):

    with tf.device('/gpu:0'):
        tf.reset_default_graph()
        
        X, Y, keep_prob = init_ph(X_train.shape[0], Y_train.shape[0])

        params = init_weights(n_h_layers, layers)
        
        Z = fwd(X, params, keep_prob)
        
        cost = compute_cost(Z, Y, reg, params, n_h_layers)
        
        global_step = tf.Variable(0, trainable=False)
        
        learning_rate = tf.train.exponential_decay(l_rate, global_step, 5000, 0.96)
        
        optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost, global_step=global_step)
        
        y_pred = tf.argmax(Z)
        
        y_true = tf.argmax(Y)
        
        correct_prediction = tf.equal(y_pred, y_true)
        
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    
    init = tf.global_variables_initializer()  

    sess =  tf.Session(config=tf.ConfigProto(log_device_placement=True))

    num_minibatches = int(X_train.shape[1] / minibatch_size)

    sess.run(init)

    for epoch in range(n_epochs):
        epoch_cost = 0.

        minibatch_idxs = np.random.permutation(X_train.shape[1])

        for i in range(num_minibatches):

            minibatch_X = np.take(
                X_train,
                minibatch_idxs[i * minibatch_size : (i + 1) * minibatch_size], 
                axis=1
            )
            minibatch_Y = np.take(
                Y_train, 
                minibatch_idxs[i * minibatch_size : (i + 1) * minibatch_size], 
                axis=1
            )

            _ , minibatch_cost = sess.run(
                [optimizer, cost], 
                feed_dict={
                    X: minibatch_X, 
                    Y: minibatch_Y,
                    keep_prob: k_prob
                }
            )

            epoch_cost += minibatch_cost / num_minibatches

        if epoch % 100 == 0:
            print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
            print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train, keep_prob: 1.0}, session=sess))
            print ("Dev Accuracy:", accuracy.eval({X: X_dev, Y: Y_dev, keep_prob: 1.0}, session=sess))
            print ("Learning Rate:", learning_rate.eval(session=sess))
            print ("")

        if epoch % 5 == 0:
            costs.append(epoch_cost)


    tr_acc =  accuracy.eval({X: X_train, Y: Y_train, keep_prob: 1.0}, session=sess)
    dv_acc = accuracy.eval({X: X_dev, Y: Y_dev, keep_prob: 1.0}, session=sess)

    print ("Train Accuracy:",tr_acc)
    print ("Dev Accuracy:", dv_acc)

    return params, costs, tr_acc, dv_acc, accuracy, sess, X, Y, keep_prob, y_pred, y_true 
        
        

In [None]:
tf.set_random_seed(1)
costs = []

parameters, cost_list, tr_acc, dv_acc, accuracy, sess, X, Y, keep_prob, y_pred, y_true = dan(
    X_train,
    Y_train,
    X_dev,
    Y_dev,
    learning_rate,
    n_epochs,
    minibatch_size,
    n_h_layers,
    ls,
    keep_probs,
    reg
)

Now that the network has been trained, let's see how it behaves on the test set

In [None]:
ts_acc, y_predicted, y_labels = sess.run(
    [accuracy, y_pred, y_true], 
    feed_dict={X: X_test, Y: Y_test, keep_prob: 1.0}
)

In [None]:
trace = go.Scatter(
    x = np.arange(len(costs)),
    y = costs
)
ply.iplot([trace], filename='costs')

## Evaluation

After having trained our NN let's see how it behaves on the real test set. To get a sense of the performance of the network we will look at the F1 score, which is the harmonic mean of precision and recall.

In [None]:
f1s = f1_score(y_labels, y_predicted, average=None)
print(f1s)

In [None]:
print ("Train Accuracy:",tr_acc)
print ("Dev Accuracy:", dv_acc)
print ("Test Accuracy:", ts_acc)

In [None]:
print('Test F1 Average Score:', f1_score(y_labels, y_predicted, average='ma'))

This seems like a very nice result. Let's see the detail of the score for each class.

In [None]:
y_test_fams = samples_data.family[samples_data['test'] == 1].tolist()
y_test_fams_num = samples_data.fam_num[samples_data['test'] == 1].tolist()

class_fam = {}
for i in range(len(y_test_fams)):
    class_fam[classes_dict[y_test_fams_num[i]]] = y_test_fams[i]

fam_score = {}
for fam_num, fam in class_fam.items():
    fam_score[fam] = f1s[fam_num]

In [None]:
for fam, score in sorted(fam_score.items()):
    print('{:20} {:20}'.format(fam, score))

Finally let's look at the confusion matrix

In [None]:
cm = confusion_matrix(y_labels, y_predicted).astype(float)
for vec in cm:
    vec /= np.sum(vec)

In [None]:
families = [class_fam[i] for i in sorted(class_fam.keys())]

In [None]:
trace = go.Heatmap(z=cm, x=families, y=families)
ply.iplot([trace], filename='conf_matrix_28k')