# Domain Generation Algorithm detection using a Convolutional Neural Net

**Author:** Alexandra Ding

**Date last modified:** 2/21/2018

**Domain Generating Algorithms** (DGAs) are utilized by malware to generate domain names. Given a domain name such as 'google.com' or 'ejwfeijwofweofeofejfj833.net', this below model predicts whether this name was generated by a DGA. Examples are shown on how to train a deep learning model in Python Keras, as well as train and perform inference on multiple GPUs. As this notebook was last run on a p2.xlarge instance on Amazon Web Services- to experiment with the multi-GPU code, I recommend using a p2.8xlarge. 

Training datasets are the [Majestic Million](https://majestic.com/reports/majestic-million) and [Bambenek DGA](http://osint.bambenekconsulting.com/feeds/dga-feed.txt). 

In [16]:
# Import required packages
import random
import numpy as np
import tensorflow as tf
from keras import backend as K
from keras.backend import manual_variable_initialization
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Activation
from keras.layers import Flatten
from keras.layers import Conv1D, MaxPooling1D
from keras.optimizers import Adam

import pandas as pd
import time

In [18]:
# Verify that Jupyter notebook is running on GPU
device_name="/gpu:0"
shape=(int(10000),int(10000))

with tf.device(device_name):
    random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
    dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
    sum_operation = tf.reduce_sum(dot_operation)

## Convolutional Neural Net

For this example, we will compile and train an open-source model from [Bin Yu et al. 2017](http://faculty.washington.edu/mdecock/papers/byu2017a.pdf) in Keras with a Tensorflow backend. These frameworks are commonly used for model prototyping, but have limitations (memory, speed) for model deployment. 

In [11]:
### Compile Bin Yu et al. 2017's CNN Model

# Specify Parameters
max_len = 63 # Denotes the maximum length of a domain name
input_tokens = 256 # Number of possible characters in an ASCII encoding
embed_size = 128 # Dimensionality of embedding space

def build_compile_model():
    # Use Keras Sequential to add layers
    model = Sequential(name='Seq')
    model.add(Embedding(input_tokens, embed_size,
                        input_length=max_len))
    model.add(Conv1D(1000, 2, padding='same',
                     kernel_initializer='glorot_normal',
                     activation='relu'))
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(Dense(100, activation='relu',
                    kernel_initializer='glorot_normal'))
    model.add(Dense(1, activation='sigmoid',
                    kernel_initializer='glorot_normal'))

    # Compile Model using Adam optimizer
    model.compile(optimizer='adam', loss='binary_crossentropy',
                  metrics=['accuracy'])

    # Print model layers and input/output dimensions
    print(model.summary())
    return(model)


def train_save_model(dga_model, X, y, n_epochs=20, 
                     filename='dga_model'):
    """
    Train DGA deep learning model and save trained model several ways.
    Args:
        dga_model: compiled Keras model
        X: features in training set
        y: labels in training set
        n_epochs: number of training epochs (default 20)
        filename: what to save file as
    Returns:
        dga_model: trained Keras model
    """

    print(dga_model.summary())    
    start = time.time()
    history_fit = dga_model.fit(X, 
                          y,
                          batch_size = 100,
                          epochs = n_epochs,
                          shuffle = True,
                          verbose = 1,
                          validation_split = 0.2)
    
    # Ensure that variables do not get reinitialized right after training
    manual_variable_initialization(True)

    # Save model as .h5 and .hdf5 (Keras default)
    dga_model.save(filename+'.h5') 
    dga_model.save(filename+'.hdf5')
    
    print("Saved model")
    return dga_model

## (Open-source) training datasets

For this example, we pull open-source datasets containing known DGA domains ([Bambenek DGA Archive](https://osint.bambenekconsulting.com/feeds/dga-feed.txt)) and popular domains ([Majestic Millions](https://majestic.com/reports/majestic-million)). The Bambenek DGA Archive contains malicious names generated by more than 20 DGA families, as well as domains from several botnets, including Conficker, Kraken, Torpig, Kwyjibo. The Majestic Millions dataset contains the million domains with the top number of referring subnets- we assume that the majority of these domains are benign. Other approaches for generating datasets to train DGA detection models include taking historical network traffic and flagging domains which have never resolved, and/or have resulted in an NxDomain response. See the Yu et al. 2017 publication for further details. 

To preprocess the domain name inputs, we convert strings to lowercase and zero-padded the strings up to the max length for a domain name. They are then ASCII encoded (i.e. strings are mapped to ints). 

In [3]:
from sklearn.model_selection import train_test_split

def vectorize_domain(domain, max_len):
    """
    Convert a single domain name to a 1d array of integers (ASCII encoding).

    Args:
        domain: the domain name to be vectorized (ex: 'goog.com')
        max_len: maximum domain length (Default 110)
    Returns:
        Numpy array (dim max_len x 1) containing vectorized name
    """
    data = np.zeros((max_len), dtype=np.int32)
    for index in range(0, len(domain)):
        if index < max_len:
            data[index] = ord(domain[index])
    return np.array(data, dtype=np.int32, ndmin=2)

def column_vectorizer(domain_array, max_len=110):
    """
    Helper function takes a numpy array column or pandas series values
    containing domain names, vectorizes the names, and outputs a numpy array.

    Args:
        domain_series
        max_len: Max length of domain (Default 110)

    Returns:
        array containing vecotrized domain names
        (dim: n_domains x max_len)
    """
    domain_data = np.zeros((len(domain_array), max_len), dtype=np.int32)

    for i in range(len(domain_array)):
        domain_data[i] = vectorize_domain(domain_array[i], max_len)

    return np.array(domain_data, dtype=np.int32)

def stratified_split_dga_benign(benign_domains_raw,
                                dga_domains_raw,
                                prop=0.2,
                                rs=42):
    """
    Split DGAs into Train and Test sets by family, and split the benign.

    Args:
        benign_domains_raw: list of benign domains
        dga_domains_raw: list of dga domains
        prop: proportion to use as test
        rs: random seed

    Returns:
        X_train: domain names in training set
        X_test: domain names in testing set
        y_train: labels in training set
        y_test: labels in testing set

    """
    X_dga = dga_domains_raw['Domain'].values
    y_dga = dga_domains_raw['Family'].values

    X_benign = benign_domains_raw.values.flatten()
    y_benign = np.tile('benign', (X_benign.shape[0]))

    X = np.concatenate((X_dga, X_benign))
    y = np.concatenate((y_dga, y_benign))

    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        test_size=prop,
                                                        random_state=rs)
    return [X_train, X_test, y_train, y_test]

In [4]:
# Example of vectorized domain
print("Vectorized google.com is: \n", vectorize_domain('google.com', 100))

Vectorized google.com is: 
 [[103 111 111 103 108 101  46  99 111 109   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]]


In [6]:
# load datasets- these are saved locally(you will have to pull them from the urls above)
# Load bambenek DGA feeds
bamb_df = pd.read_csv('~/DGA_data/merged_feeds.csv')
print(bamb_df.head(n=5))
print("DGA Size:", bamb_df.shape)

# Load majestic millions
mm_df = pd.read_csv('~/DGA_data/majestic_million.csv')
print(mm_df.head(n=5))
print("Benign Size:", mm_df.shape)

# Stratified split into train/test datasets
[X_train, X_test, y_train_multiclass, y_test_multiclass]= stratified_split_dga_benign(mm_df, bamb_df, prop=0.2)
y_train = y_train_multiclass != 'benign'
y_train = y_train.astype(int)
y_test = y_test_multiclass != 'benign'
y_test = y_test.astype(int)

print(X_train)
print(y_train_multiclass)
print(y_train)

# Vectorize the domain names
X_train_vec = column_vectorizer(X_train, max_len)
print(X_train_vec.shape)
X_test_vec = column_vectorizer(X_test, max_len)
print(X_test_vec.shape)

   Unnamed: 0             Domain Family
0           0  isdpjdnnjhibq.com     cl
1           1  vegmjnsqtwuka.net     cl
2           2  jtfansoakdfga.biz     cl
3           3   qsjiqfrcvnhch.ru     cl
4           4  eiivuknlmtrxx.org     cl
DGA Size: (957049, 3)
          Domain
0     google.com
1   facebook.com
2    youtube.com
3    twitter.com
4  microsoft.com
Benign Size: (1000000, 1)
['childhoodexpect.net' 'urbackup.org' 'bvohellefrictionlessv.com' ...
 'owuekspqehcr.com' 'mntqnwesbssp.com' 'kkmngrvwtuxw.pw']
['pizd' 'benign' 'banjori' ... 'tinba' 'ramnit' 'tinba']
[1 0 1 ... 1 1 1]
(1565639, 63)
(391410, 63)


## Train and save a CNN in Keras 

In [19]:
dga_model = build_compile_model()
train_save_model(dga_model, X_train_vec, y_train, n_epochs=20, 
                     filename='model')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 63, 128)           32768     
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 63, 1000)          257000    
_________________________________________________________________
dropout_5 (Dropout)          (None, 63, 1000)          0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 63000)             0         
_________________________________________________________________
dense_9 (Dense)              (None, 100)               6300100   
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 101       
Total params: 6,589,969
Trainable params: 6,589,969
Non-trainable params: 0
_________________________________________________________________


KeyboardInterrupt: 

## Training on multiple GPUs in Keras

Code is pulled from Keras documentation and from an online [demo](https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/). Keras multi_gpu_model uses data parallelism to partition the workload over multiple devices. Assume there are n devices. Then each one will receive a copy of the complete model and train it on 1/n of the data. The results such as gradients and updated model are communicated across these devices. Data parallelism is also possible in Gluon, MxNet and Tensorflow.

In [None]:
from keras.utils import multi_gpu_model
NUM_GPUS = 8

dga_model = build_compile_model()

# Replicates `model` on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(dga_model, gpus=NUM_GPUS)
parallel_model.compile(loss='binary_crossentropy',
                       optimizer='adam')

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

## Inference on multi-GPUs

We can load a copy of a saved Keras model to multiple GPUs, and again use **data parallelism** to deliver a subset of each data batch to each GPU for inference. This example uses Python Multiprocessing to initialize the workers on each GPU. In a deployment environment, data is coming off of an [Apache Kafka](https://kafka.apache.org/) topic, which has its own API and controls batch size. The below example should work with a Pandas dataframe or Numpy array as input.

In [None]:
import multiprocessing as mp
from multiprocessing import Process,Manager,set_start_method

def worker(batch, gpuid, d, filename='saved_model.h5'):
    """Worker function.

    Sets the number of visible cuda devices and executes the main function
    for a deep learning model.

    Args:
        batch: batch size
        gpuid: the GPU that will be utilized by the processes
        d: the shared dictionary

    Returns:
        None.

    """

    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpuid)

    df = pd.DataFrame(batch).fillna(value="none")
    model = load_model(filename)
    domain_vec = column_vectorizer(df['query'].str[0],75)
    df['pred']= model.predict(domain_vec)

    split = np.array_split(df[['pred','flow_id']],5)

    for batch in split:
        result = batch.to_csv(index=False, header=False)
        d['output'] = d['output'] + [result]
    return


def execute_inference_batching(data):
    """Execute main function."""
    # Split array into multiple sub-arrays
    l = np.array_split(np.array(data),3)
    workers = []
    gid = 0

    for batch in l:
        worker_assignment = Process(target=worker,
                                    args=(batch.tolist(), gid, d))
        workers.append(worker_assignment)
        worker_assignment.start()
        gid+=1

    for w in workers:
        w.join()

    print("at producer for loop")
    for arr in d['output']:
        print('length: ')
        print(len(arr))
    return

In [None]:
# Call the function as
# execute_inference_batching(data)