# Hoja de trabajo: Detección de DGA

Para este ejercicio usaremos las siguientes librerías:
* Pandas (http://pandas.pydata.org/pandas-docs/stable/)
* Numpy (https://docs.scipy.org/doc/numpy/reference/)
* Matplotlib (http://matplotlib.org/api/pyplot_api.html)
* Scikit-learn (http://scikit-learn.org/stable/documentation.html)
* YellowBrick (http://www.scikit-yb.org/en/latest/)
* Seaborn (https://seaborn.pydata.org)

In [None]:
#Importar las librerías

import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
from yellowbrick.features import Rank2D
from yellowbrick.features import RadViz

## Parte 1: Ingeniería de características

In [None]:
## Cargar el dataset proporcionado
df = pd.read_csv('dga_data_small.csv')
df.drop(['host', 'subclass'], axis=1, inplace=True)
print(df.shape)
df.sample(n=5).head() # print a random sample of the DataFrame

In [None]:
df[df.isDGA == 'legit'].sample(5)

In [None]:
df[df.isDGA == 'dga'].sample(5)

**Lista de características a derivar (basado en artículos académicos)**:

1. Length ["length"]
2. Number of digits ["digits"]
3. Entropy ["entropy"] - use ```H_entropy```
4. Vowel to consonant ratio ["vowel-cons"] - use ```vowel_consonant_ratio```
5. The index of the first digit - use the ``firstDigitIndex`` 
6. n-grams

In [None]:
from six.moves import cPickle as pickle
with open('d_common_en_words' + '.pickle', 'rb') as f:
        d = pickle.load(f)

def H_entropy (x):
    # Calculate Shannon Entropy
    prob = [ float(x.count(c)) / len(x) for c in dict.fromkeys(list(x)) ] 
    H = - sum([ p * np.log2(p) for p in prob ]) 
    return H

def firstDigitIndex( s ):
    for i, c in enumerate(s):
        if c.isdigit():
            return i + 1
    return 0

def vowel_consonant_ratio (x):
    # Calculate vowel to consonant ratio
    x = x.lower()
    vowels_pattern = re.compile('([aeiou])')
    consonants_pattern = re.compile('([b-df-hj-np-tv-z])')
    vowels = re.findall(vowels_pattern, x)
    consonants = re.findall(consonants_pattern, x)
    try:
        ratio = len(vowels) / len(consonants)
    except: # catch zero devision exception 
        ratio = 0  
    return ratio

# ngrams: Implementation according to Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence"
# http://s2lab.isg.rhul.ac.uk/papers/files/dimva2014.pdf

def ngrams(word, n):
    # Extract all ngrams and return a regular Python list
    # Input word: can be a simple string or a list of strings
    # Input n: Can be one integer or a list of integers 
    # if you want to extract multipe ngrams and have them all in one list
    
    l_ngrams = []
    if isinstance(word, list):
        for w in word:
            if isinstance(n, list):
                for curr_n in n:
                    ngrams = [w[i:i+curr_n] for i in range(0,len(w)-curr_n+1)]
                    l_ngrams.extend(ngrams)
            else:
                ngrams = [w[i:i+n] for i in range(0,len(w)-n+1)]
                l_ngrams.extend(ngrams)
    else:
        if isinstance(n, list):
            for curr_n in n:
                ngrams = [word[i:i+curr_n] for i in range(0,len(word)-curr_n+1)]
                l_ngrams.extend(ngrams)
        else:
            ngrams = [word[i:i+n] for i in range(0,len(word)-n+1)]
            l_ngrams.extend(ngrams)
#     print(l_ngrams)
    return l_ngrams

def ngram_feature(domain, d, n):
    # Input is your domain string or list of domain strings
    # a dictionary object d that contains the count for most common english words
    # finally you n either as int list or simple int defining the ngram length
    
    # Core magic: Looks up domain ngrams in english dictionary ngrams and sums up the 
    # respective english dictionary counts for the respective domain ngram
    # sum is normalized
    
    l_ngrams = ngrams(domain, n)
#     print(l_ngrams)
    count_sum=0
    for ngram in l_ngrams:
        if d[ngram]:
            count_sum+=d[ngram]
    try:
        feature = count_sum/(len(domain)-n+1)
    except:
        feature = 0
    return feature
    
def average_ngram_feature(l_ngram_feature):
    # input is a list of calls to ngram_feature(domain, d, n)
    # usually you would use various n values, like 1,2,3...
    return sum(l_ngram_feature)/len(l_ngram_feature)

In [None]:
df['ngrams'] = df['domain'].apply(lambda x: average_ngram_feature([ngram_feature(x, d, 1), 
                                                                ngram_feature(x, d, 2), 
                                                                ngram_feature(x, d, 3)]))

# check final 2D pandas DataFrame containing all final features and the target vector isDGA
df.sample(10)
df['entropy'] = df['domain'].apply(H_entropy)
df['vowel-cons'] = df['domain'].apply(vowel_consonant_ratio)
df['firstDigitIndex'] = df['domain'].apply(firstDigitIndex)
# Calcular las características faltantes length y digits
# Su código:




# Codifique para la columna isDGA: dga con el valor 1, y legit con el valor 0
# Su código



print(df['isDGA'].value_counts())
df.sample(n=5).head()


In [None]:
df_final = df
df_final = df_final.drop(['domain'], axis=1)
df_final.to_csv('dga_features_final_df.csv', index=False)
df_final.head()

## Visualización de la data

In [None]:
feature_names = ['length','digits','entropy','vowel-cons','firstDigitIndex', 'ngrams']
features = df_final[feature_names]
target = df_final['isDGA']

In [None]:
sns.pairplot(df_final, hue='isDGA', vars=feature_names)

Explique cómo se interpreta la característica entropy contra cualquiera de las demas características

Gráfica elegida e interpretación:

In [None]:
visualizer = Rank2D(algorithm='pearson',features=feature_names)
visualizer.fit_transform( features )
visualizer.poof()

Explique cómo se interpreta la correlación entre las características entropy y length

Respuesta:


In [None]:
features = df_final[feature_names].values
target = df_final['isDGA'].values

radvizualizer = RadViz(classes=['Benign','isDga'], features=feature_names)
radvizualizer.fit_transform( features, target)
radvizualizer.poof()

Interprete el gráfico generado por el algoritmo Radviz

Respuesta: 

## Parte 2: Implementación del modelo

### Paso 1: Prepare la matriz de características y el vector Target

- En estadistica, la matriz de características es normalmente conocida como ```X```
- Target es un vector que contiene las etiquetas para cada URL (también conocido como  *y* en estadistica)
- En sklearnel X y el Objetivo pueden ser ambos pandas DataFrame/Series o numpy array/vector (no pueden ser listas!)

Tarea:
- asigne la columna 'isDGA' a una serie de pandas y nombrela 'target'
- Elimine la columna 'isDGA' del dataFrame ```dga``` y nombre el DataFrame resultante como 'feature_matrix'

In [None]:
target = df_final['isDGA']
feature_matrix = df_final.drop(['isDGA'], axis=1)
print('Final features', feature_matrix.columns)

feature_matrix.head()

### Paso 2: Separación de datos

- Divida el dataset en datos de entrenamiento (75%) y prueba (25%), en las variables feature_matrix_train, feature_matrix_test, target_train, target_test

In [None]:
#Su código

In [None]:
feature_matrix_train.count()

In [None]:
feature_matrix_test.count()

In [None]:
target_train.head()

In [None]:
target_train.value_counts()

### Paso 3: Entrenamiento del modelo

- Use el modelo de árbol de decisión de sklearn [tree.DecisionTreeClassfier()](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), y cree un modelo con los parámetros estandar , luego entrenelo usando la función ```.fit()``` con la data ```X_train``` y ```target_train```.

In [None]:
clf = tree.DecisionTreeClassifier()  # clf means classifier
clf = clf.fit(feature_matrix_train, target_train)

# Extract a row from the test data
test_feature = feature_matrix_test[192:193]
test_target = target_test[192:193]

# Make the prediction
pred = clf.predict(test_feature)
print('Predicted class:', pred)
print('Accurate prediction?', pred[0] == test_target)

In [None]:
pred[0] == test_target

### Paso 4: Predicciones

- Para poder hacer predicciones con el modelo, se deben derivar las características de las URLs de prueba

In [None]:
def is_dga(domain, clf, d):
    # Parámetros: dominio y el modelo entrenado clf
    # Retorna una predicción 
    
    domain_features = np.empty([1,6])
    # orden de las características ['length', 'digits', 'entropy', 'vowel-cons', firstDigitIndex]
    domain_features[0,0] = len(domain)
    pattern = re.compile('([0-9])')
    domain_features[0,1] = len(re.findall(pattern, domain))
    domain_features[0,2] = H_entropy(domain)
    domain_features[0,3] = vowel_consonant_ratio(domain)
    domain_features[0,4] = firstDigitIndex(domain)
    domain_features[0,5] = average_ngram_feature([ngram_feature(domain, d, 1), 
                                                  ngram_feature(domain, d, 2), 
                                                  ngram_feature(domain, d, 3)])
    pred = clf.predict(domain_features)
    return pred[0]

Utilice la función is_dga para realizar predicciones sobre los siguientes dominios de prueba:

- microsoft
- google
- 1vxznov16031kjxneqjk1rtofi6

In [None]:
#Su código

### Paso 5: Validación

- Cálculo de la matriz de validación

In [None]:
target_pred = clf.predict(feature_matrix_test)
print(metrics.accuracy_score(target_test, target_pred))
print('Confusion Matrix\n', metrics.confusion_matrix(target_test, target_pred))

Etiquete de forma apropiada la matriz, y explique el valor de:

- Verdaderos positivos
- Verdaderos negativos
- Falsos positivos
- Falsos negativos

In [None]:
print(metrics.classification_report(target_test, target_pred, target_names=['legit', 'dga']))

Explique qué significa el valor de las métricas de precision, recall y f1-score