# Caso Práctico

En este caso de uso Práctico se pretende resolver un problema de deteccion de malware en dispostivos Android mediante el análisis del tráfico de red que genera el dispositivo mediante el uso de árboles de decisión

[Canadian Institute for Cybersecurity](https://www.unb.ca/cic/datasets/android-adware.html)

# Descripcion

## Android adware and general malware dataset (CIC-AAGM2017)
The sophisticated and advanced Android malware is able to identify the presence of the emulator used by the malware analyst and in response, alter its behaviour to evade detection. To overcome this issue, we installed the Android applications on the real device and captured its network traffic. 

CICAAGM dataset is captured by installing the Android apps on the real smartphones semi-automated. The dataset is generated from 1,900 applications with the following three categories:

**1. Adware (250 apps)**
Airpush: Designed to deliver unsolicited advertisements to the user’s systems for information stealing.

Dowgin: Designed as an advertisement library that can also steal the user’s information.

Kemoge: Designed to take over a user’s Android device. This adware is a hybrid of botnet and disguises itself as popular apps via repackaging.

Mobidash: Designed to display ads and to compromise user’s personal information.

Shuanet: Similar to Kemoge, Shuanet is also designed to take over a user’s device.

**2. General Malware (150 apps)**
AVpass: Designed to be distributed in the guise of a Clock app.

FakeAV: Designed as a scam that tricks user to purchase a full version of the software in order to re-mediate non-existing infections.

FakeFlash/FakePlayer: Designed as a fake Flash app in order to direct users to a website (after successfully installed).

GGtracker: Designed for SMS fraud (sends SMS messages to a premium-rate number) and information stealing.

Penetho: Designed as a fake service (hacktool for Android devices that can be used to crack the WiFi password). The malware is also able to infect the user’s computer via infected email attachment, fake updates, external media and infected documents.

**3. Benign (1,500 apps)**
2015 GooglePlay market (top free popular and top free new)

2016 GooglePlay market (top free popular and top free new)

License
The CICAAGM dataset consists of the following items is publicly available for researchers.

.pcap files – the network traffic of both the malware and benign (20% malware and 80% benign)

.csv files - the list of extracted network traffic features generated by the CIC-flowmeter

If you are using our dataset, you should cite our related paper that outlines the details of the dataset and its underlying principles:

Arash Habibi Lashkari, Andi Fitriah A. Kadir, Hugo Gonzalez, Kenneth Fon Mbah and Ali A. Ghorbani, “Towards a Network-Based Framework for Android Malware Detection and Characterization”, In the proceeding of the 15th International Conference on Privacy, Security and Trust, PST, Calgary, Canada, 2017.

[Descargar Dataset](http://cicresearch.ca//CICDataset/CICAndAdGMal2017/)

# Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix, recall_score, f1_score, precision_score
from pandas import DataFrame

### Funciones Auxilixares

In [None]:
# Construcción de una función que realice el particionado completo
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [None]:
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return (X, y)

In [None]:
def evaluate_result(y_pred, y, y_prep_pred, y_prep, metric):
    print(metric.__name__, "WITHOUT preparation:", metric(y_pred, y, average='weighted'))

    print(metric.__name__, "WITH preparation:", metric(y_prep_pred, y_prep, average='weighted'))

## 1.- Lectura del DataSet

In [None]:

df = pd.read_csv('datasets/datasets/TotalFeatures-ISCXFlowMeter.csv')

## 2.- Visualización del DataSet

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
df['calss'].value_counts()

###  Buscando Correlaciones

In [None]:
# Copiar el DataSet y transformar la variable de salida a numérica para calcular las correlaciones
# Pasar de variable categorica a numerica [0] -> Toma un array en una sola dimension
X = df.copy()
X['calss'] = X['calss'].factorize()[0]

In [None]:
# Calcular las correlaciones
corr_matrix  = X.corr()
corr_matrix['calss'].sort_values(ascending = False)

In [None]:
# Mostrar TODOS
X.corr()

In [None]:
# COnsulta filtrada para saber cual es el mejor
# Se puede llegar a valorar y quedarse con las que tengan una mayor correlación
corr_matrix[corr_matrix["calss"]>0.05]

## 3.- División del DataSet

In [None]:
# Dvividir el dataset
train_set, val_set, test_set = train_val_test_split(X)

In [None]:
# Separas esa etiqueta calss, solo vamos a usar calss para reducir la carga de trabajo al hardware
X_train, y_train = remove_labels(train_set, 'calss')
X_val, y_val = remove_labels(val_set, 'calss')
X_test, y_test = remove_labels(test_set, 'calss')

# 4.-Escalando el dataset

Es importnte comprender que lo árbole de desición son algoritmos que **no requieren demasiada preparación de los datos** correctamente, no requieren la realización o escalado o normalizació. En este ejercicio se ve a realizar escalado al Dataset y se van a comparar los resultados con el DataSet sin escala. De esta manera, se demuestra como aplicar preprocesamientos como el escaladopuedr llrgsr s sfectar el rendimiento del modelo

In [None]:
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
scaler = RobustScaler()
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
scaler = RobustScaler()
X_val_scaled = scaler.fit_transform(X_val)

In [None]:
# Transformar un DataFrame de pandas
X_train_scaled = DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_train_scaled.head(10)

## 5.- Árbol de decisión

In [None]:
from sklearn.tree import DecisionTreeClassifier

MAX_DEPTH = 20
# Model entrenado  von rl DataSet sin escalar
clf_tree = DecisionTreeClassifier(max_depth=MAX_DEPTH, random_state=42)
clf_tree.fit(X_train, y_train)

In [None]:
# Modelo entrenado con el conjunto de datos escalado
clf_tree_scaled = DecisionTreeClassifier(max_depth = MAX_DEPTH, random_state = 42)
clf_tree_scaled.fit(X_train_scaled, y_train)