# DGA domain classifier

The goal of this notebook is to demonstrate how to perform real-time `DGA` domain classification *(Domain Generation Algorithms)* using a machine learning model for logs stored in the Devo platform.

Firstly, a model is trained with the `H2O` engine and registered in Devo using the Python ML Model Manager Client.

Finally, the domain classification is performed in real time using the `mlevalmodel` operation from the Devo query engine.

## Requirements

Table *demo.ecommerce.data* in Devo.

## Install

In [None]:
!pip install devo-sdk
!pip install devo-mlmodelmanager
!pip install h2o

## Imports

In [None]:
import os
import h2o

from h2o.estimators import H2OGradientBoostingEstimator
from devo.api import Client, ClientConfig, SIMPLECOMPACT_TO_OBJ
from devo_ml.modelmanager import create_client_from_token, engines

## Setup

In [None]:
# A valid Devo access token
TOKEN = ''

# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
DEVO_API_URL = ''

# URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/
DEVO_MLMM_URL = ''

# The domain to connect to, e.g. self
DOMAIN = ''

# The name of the model
NAME = 'dga_classifier'

# The description of the models
DESCRIPTION = 'DGA domain classifier'

# The path where model file will be stored
MODELS_PATH = 'models'

# The URL of a dataset to build the model
DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv"

VOWELS = "aeiouAEIOU"

## ML model

In [None]:
h2o.init()

In [None]:
# import dataset
domains = h2o.import_file(DATASET_URL, header=1)

In [None]:
'''
Prepare data set
    1. Domain length
    2. Shannon entropy
    3. Vowel proportion
    4. Malicious flag
'''
domains = domains[~domains['subclass'].isna()]
domains['length'] = domains['domain'].nchar()
domains['entropy'] = domains['domain'].entropy()
domains['vowel_proportion'] = 0
for v in VOWELS:
    domains['vowel_proportion'] += domains['domain'].countmatches(v)
domains['vowel_proportion'] /= domains['length']
domains['malicious'] = domains['class'] != 'legit'
domains['malicious'] = domains['malicious'].asfactor()

In [43]:
train, valid = domains.split_frame(ratios=[.8], seed=1234)

In [None]:
model = H2OGradientBoostingEstimator(model_id=NAME)

In [None]:
model.train(
    x=['length', 'entropy', 'vowel_proportion'],
    y='malicious',
    training_frame=train,
    validation_frame=valid
)

In [None]:
# Create path if not exists
os.makedirs(MODELS_PATH, exist_ok=True)

In [None]:
model.download_mojo(path=MODELS_PATH)

In [None]:
h2o.cluster().shutdown()

## Register the model in Devo

In [None]:
mlmm = create_client_from_token(DEVO_MLMM_URL, TOKEN)

In [None]:
mlmm.add_model(
    NAME,
    engines.H2O,
    os.path.join(MODELS_PATH, f"{NAME}.zip"),
    description=DESCRIPTION,
    force=True
)

## Classify DGA domains

In [None]:
# use in the query the mlevalmodel operator to evaluate the model

query = f'''from demo.ecommerce.data
  select split(referralUri, "/",2) as domain,
  float(length(domain)) as length,
  shannonentropy(domain) as entropy,
  float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion,
  mlevalmodel("{DOMAIN}", "{NAME}", length, entropy, vowel_proportion) as class
'''

In [None]:
api = Client(
    auth={"token": TOKEN},
    address=DEVO_API_URL,
    config=ClientConfig(
        response="json/simple/compact",
        stream=True,
        processor=SIMPLECOMPACT_TO_OBJ
    )
)

In [None]:
response = api.query(query=query, dates={'from': "now()-1*hour()"})

In [None]:
for row in response:
    print(row)
