<h2 align="center">INFORMATION SECURITY MARKET ESTIMATION</h2> 

![title](imgs/cs_market_estimation.png)

<h2 align="center">ANOTHER TOOL TO DETECT MALWARE</h2> 

If you have infected computers, the management server can communicate with them through domains that are automatically generated by the hackers' software to send commands and updates.

Explanation can be read at https://en.wikipedia.org/wiki/Domain_generation_algorithm 

<h2 align="center">GENERATE DOMAINS BY YOURSELF</h2> 

![title](imgs/hacker.jpg)

In [1]:
#Example from wikipedia

def generate_domain(year, month, day):
    """Generates a domain name for the given date."""
    domain = ""

    for i in range(16):
        year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
        month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
        day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
        domain += chr(((year ^ month ^ day) % 25) + 97)

    return domain

generate_domain(2018, 9, 20)

'hlxclhpmcajwaquf'

<h2 align="center">DGA DOMAINS CLASSIFICATION</h2> 
<br /><center>Recognition of automatically generated domains can be considered as a binary classification problem.</center>

#### Do we have data for analysis?

Yes, in short, if we log network traffic in some way, then we can rip the logs of requests from computers on the network, whether corporate, home, public or some other, from there we can get the domain, analyze it, and issue our verdict for further decision.

#### Do we have labeled data to learn and test the classifier?

In fact, yes. We need to know that security engineers around the world are trying to get the algorithms that hackers use, in order to prevent attacks, there are two types of obtaining such information.
The first method is widely used in business - this is the method of reverse engineering.
The second way is to grab the software of intruders.
Both work fine.

See https://github.com/andrewaeva/DGA

<h2 align="center">Let's start</h2>

The very first thing we are to do when we deal with such type of problems is to get raw dataset and prepare it for the next step. We know that our dataset is clean enough, that is why we should miss some steps which are common probably for the most of ML tasks.

In [None]:
import pandas as pd

# load dataset
all_legit = pd.read_csv('data/domains/all_legit.txt', delimiter=' ', names=['domain', 'label']).dropna()
all_dga = pd.read_csv('data/domains/all_dga.txt', delimiter=' ', names=['domain', 'label']).dropna()
raw_data = pd.concat([all_legit, all_dga])

# prepare column of labels to binary classification
data = raw_data.copy()
data['label'] = data['label'].apply(
    lambda x:
    True if x == 0
    else False
)
data.rename(columns={'label':'is_legitime'}, inplace=True)

# take a view
print(data['is_legitime'].value_counts())
data.sample(n=10)

True     1000000
False     801667
Name: is_legitime, dtype: int64


Unnamed: 0,domain,is_legitime
839127,goxxxfuck.com,True
658213,xuatnhapcanh.com.vn,True
559779,workerscompensation.com,True
523342,spnlnnvurq.org,False
741030,verraes.net,True
369438,crossroadstrading.com,True
969917,lcsnw.org,True
967331,liyandigital.com,True
774831,insaneproductivity.com,True
292860,pahabeow.ru,False


So, we have dataset, it is quite balanced (0.555 positive : 0.445 negative).

Our next step is to extract features. For some reason we will not extract features like first letter or get dummy from the letters included.
We will concentrate on features which probably will impact on the predictions the most, our experience in the related field and dozens of scientific papers, articles and reports from companies and independent researchers in the field of cybersecurity will tell us what to look for first, actually this is the way, usually data scientists solve issues or build baselines.

In [None]:
# !!!!!!!!!!better to do some explanations and provide intuitions!!!!!!!!!!

import numpy as np
import re
import math
from collections import Counter
import tldextract
from sklearn.feature_extraction.text import CountVectorizer

def max_subword_len(word):

    return max([len(subword) for subword in re.split(r'[\d\.\-]+', word)])

def specific_symbols_count(domain):

    return sum((symbol.isdigit() | (symbol in['.','-','_'])) for symbol in domain)

# count word entropy
def count_entropy(word):

    word_counter, word_len = Counter(word), float(len(word))

    return -sum(
        count / word_len * math.log(count / word_len, 2)
        for count in word_counter.values()
    )

# extract info from frequency vocabulary
def extract_vocab_n_counts(path):

    words_lst=list()
    with open(path) as file:
        for line in file:
            words_lst.append(line[:-1])
    lang = pd.DataFrame(words_lst, columns = ['word'])
    lang_vc = CountVectorizer(analyzer='char', ngram_range=(3, 5), min_df=1e-5, max_df=1.0)
    lang_counts_matrix = lang_vc.fit_transform(lang['word'])
    lang_counts = np.log10(lang_counts_matrix.sum(axis=0).getA1())

    return lang_vc, lang_counts

# prepare features from domains
def extract_domain_based_features_to_df(blank_domains_df, vocab_data):

    domains_df = blank_domains_df.copy()

    domains_df['domain_name'] = domains_df['domain'].apply(lambda x: tldextract.extract(x).domain)
    domains_df['domain_name_len'] = domains_df['domain_name'].apply(len)
    domains_df['specific_symbols_count'] = domains_df['domain'].apply(specific_symbols_count)
    domains_df['max_word_len'] = domains_df['domain'].apply(max_subword_len)
    domains_df['percentage_of_specific'] = domains_df['specific_symbols_count'] / domains_df['domain_name_len']
    domains_df['entropy'] = domains_df['domain'].apply(count_entropy)
    en_vc, en_counts, tr_vc, tr_counts = vocab_data
    domains_df['en_grams'] = en_counts * en_vc.transform(domains_df['domain_name']).T
    domains_df['tr_grams'] = tr_counts * tr_vc.transform(domains_df['domain_name']).T

    return domains_df

en_vc, en_counts = extract_vocab_n_counts('data/vocabs/en.txt')
tr_vc, tr_counts = extract_vocab_n_counts('data/vocabs/tr.txt')
vocab_data = (en_vc, en_counts, tr_vc, tr_counts)

prepared_data = extract_domain_based_features_to_df(data, vocab_data)

We already have prepared dataset, we have features vectors, we have labels for them.

Now we are to split our data into training dataset and test dataset.

In [None]:
# !!!!!!!!!!better to visualise feature descriptions!!!!!!!!!!
features_description = list(map(lambda x: (x[0], x[-1].describe()), prepared_data.groupby('is_legitime')))

# split the data into labels and features
X = prepared_data[
    [
        'domain_name_len', 'specific_symbols_count', 'max_word_len',
        'percentage_of_specific', 'entropy', 'en_grams', 'tr_grams'
    ]
]
y = prepared_data['is_legitime']

# split our data into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We already have data, we already know that we arediving into binary classification problem.

So, what should we do?

Know we are to:

- select features
- select the model
- tune parameters
- teach the classifier
- test our classifier

In [None]:
# !!!!!!!!!!better to go through the steps, not just showing the result!!!!!!!!!!

from sklearn import metrics
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

model = XGBClassifier(
    colsample_bytree=0.3,
    min_child_weight=0.1,
    learning_rate=0.1,
    max_depth=3,
    n_estimators=37,
    seed=42
)

clf = model.fit(X_train,y_train)

y_pred = clf.predict(X_test)

class_names = ['non_legitime', 'legitime']

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    import itertools
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()

print(metrics.classification_report(y_test, y_pred))