# Using Deep Averaging Networks for malware classification


In this notebook we will experiment with the concept of Deep Averaging Networks in our malware classification setting.

Let's start by loading some packages necessary for the experiment.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from classification import cla_action, cla_dan
from utilities import constants, evaluation
from preprocessing import pp_action
import plotly.graph_objs as go
import plotly.offline as ply
import tensorflow as tf
import pandas as pd
import numpy as np
import json
import os

In [None]:
config = json.load(open('config.json', 'r'))
ply.init_notebook_mode(connected=True)

## Data selection

Select a subset of the original dataset. Then the selected subset will be split into a training, development and  test set.


In [None]:
samples_data = pp_action.pre_process(config)
pp_action.split_show_data(samples_data)

In [None]:
x_train = samples_data.index[samples_data['train'] == 1].tolist()
x_dev = samples_data.index[samples_data['dev'] == 1].tolist()
x_test = samples_data.index[samples_data['test'] == 1].tolist()
y_train = samples_data.fam_num[samples_data['train'] == 1].tolist()
y_dev = samples_data.fam_num[samples_data['dev'] == 1].tolist()
y_test = samples_data.fam_num[samples_data['test'] == 1].tolist()
y_test_fam = samples_data.family[samples_data['test'] == 1].tolist()

## Feature extraction

Since the DAN required a very considerable amount fo time to train with the full dataset, we will try reducing the dimensionality.

To achieve this we will use the Principal Component Analysis in order to operate on the sparse vectros.

In [None]:
xm_train_e = np.loadtxt('data/matrix/pca_512_846_tr.txt')
xm_dev_e = np.loadtxt('data/matrix/pca_512_182_dv.txt')
xm_test_e = np.loadtxt('data/matrix/pca_512_181_te.txt')

## Feature selection

An alternative to feature extraction, which creates a new -artificial- set of features, is feature selection. With feature selection we mean a method which tries to isolate the most important features for a specific learning task, among the natural features of the dataset.

We will attempt to select the most relevant features by using random forest classifiers.

In [None]:
xm_train_s = np.loadtxt('data/matrix/rfc_512_846_tr.txt')
xm_dev_s = np.loadtxt('data/matrix/rfc_512_182_dv.txt')
xm_test_s = np.loadtxt('data/matrix/rfc_512_181_te.txt')

## Classification

Now we can try classification with both data sets.

First with extracted features.

In [None]:
y_predicted, model, modifier = cla_dan.classify(xm_train_e, xm_dev_e, xm_test_e, y_train, y_dev, y_test, config)

In [None]:
evaluation.evaluate_classification(model[0], y_test_fam, y_predicted, model[1])

Now with selected features.

In [None]:
y_predicted, model, modifier = cla_dan.classify(xm_train_s, xm_dev_s, xm_test_s, y_train, y_dev, y_test, config)

In [None]:
evaluation.evaluate_classification(model[0], y_test_fam, y_predicted, model[1])