**Caso Práctico: Detección de malware en Android**


En este caso de uso práctico se pretende resolver un problema de detección de malware en dispositivos Android mediante el análisis del tráfico de red que genera el dispositivo mediante el uso de árboles de decisión.


Descarga de los ficheros de datos

https://drive.google.com/file/d/1FyRlPKiMnC2cDypeipX3lrAqy0wU3y0X/view?usp=sharing


Referencias adicionales sobre el conjunto de datos

_Arash Habibi Lashkari, Andi Fitriah A. Kadir, Hugo Gonzalez, Kenneth Fon Mbah and Ali A. Ghorbani, “Towards a Network-Based Framework for Android Malware Detection and Characterization”, In the proceeding of the 15th International Conference on Privacy, Security and Trust, PST, Calgary, Canada, 2017._


Notas:

    Puedes usar esta función auxiliar para separar las características de entrada de la salida:

def remove_labels(df, label_name):

  X = df.drop(label_name, axis=1)

  y = df[label_name].copy()

  return (X, y)


    Y esta para probarlo con preprocesamiento o sin él:

def evaluate_result(y_pred, y, y_prep_pred, y_prep, metric):

  print(metric.__name__, "WITHOUT preparation:", metric(y_pred, y, average='weighted'))

  print(metric.__name__, "WITH preparation:", metric(y_prep_pred, y_prep, average='weighted'))


    Comprueba que no haya características categóricas y transforma la clase y "calss' de categórica a numérica con factorize().
    Revisa correlaciones por si puedes eliminar alguna caractrerística de entrada muy correlacionadas entre sí, o quedarte solo con las que esté correlacionadas con la class y (calss) por encima de un umbral.
    Otra buena acción, podría ser escalar los datos y comparar los resultados con el entrenamiento sin escalar (en árboles de decisión no es tan bueno escalarlos, incluso puede afectar al rendimiento del modelo).
    Entrena el algoritmo con DecissionTreeClassifier de skearn.tree (con los hiperparámetros max_depth, prueba con números sobre 10 ó 20 porque números más altos pueden producir overfitting, y random_state)


Intenta predecir adecuadamente con un f1_score > 0.89


Extra I: 

Trata de visualizar el límite de decisión.que ha construido el algoritmo (representa el árbol con graphviz usando los dos atributos más correlacionados con la class de salida y escalados para poder verlos adecuadamente en la gráfica, entrénalo con poca profundidad, por ejemplo max_depth=2) .


Extra II:

Ahora entéralo con Random Forest para comprobar si mejora:

from sklearn.ensemble import RandomForestClassifier con, por ejemplo, n_estimators=100

Intenta predecir adecuadamente con un f1_score > 0.93

In [1]:
def remove_labels(df, label_name):

  X = df.drop(label_name, axis=1)

  y = df[label_name].copy()

  return (X, y)

In [2]:
def evaluate_result(y_pred, y, y_prep_pred, y_prep, metric):

  print(metric.__name__, "WITHOUT preparation:", metric(y_pred, y, average='weighted'))

  print(metric.__name__, "WITH preparation:", metric(y_prep_pred, y_prep, average='weighted'))
    

# Mejoroar

## 14_Técnicas de selección de características

In [3]:
import pandas as pd
import numpy as np
import graphviz

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score

In [4]:
# Construcción de una función que realice el particionado completo
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [5]:
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return (X, y)

In [6]:
df = pd.read_csv('../../datasets/TotalFeatures-ISCXFlowMeter/TotalFeatures-ISCXFlowMeter.csv')

In [7]:
train_set, val_set, test_set = train_val_test_split(df)

In [8]:
X_train, y_train = remove_labels(train_set, 'calss')
X_val, y_val = remove_labels(val_set, 'calss')
X_test, y_test = remove_labels(test_set, 'calss')

### Random Forest


In [15]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train, y_train)

In [16]:
y_pred = clf_rnd.predict(X_val)

In [17]:
print("F1 score:", f1_score(y_pred, y_val, average='weighted'))

F1 score: 0.9324043007314987


In [18]:
clf_rnd.feature_importances_

array([0.03096656, 0.00303719, 0.00440737, 0.02318232, 0.01184895,
       0.01721388, 0.00881173, 0.02199267, 0.01122589, 0.01910279,
       0.01229994, 0.00912599, 0.0049411 , 0.01864105, 0.00468261,
       0.01359503, 0.0060695 , 0.01755146, 0.00504174, 0.01740915,
       0.00478204, 0.00668029, 0.00337915, 0.00937514, 0.00572423,
       0.        , 0.        , 0.00268121, 0.00471322, 0.02948284,
       0.0175912 , 0.02737585, 0.0276842 , 0.02610625, 0.0159516 ,
       0.0247063 , 0.01454405, 0.02000791, 0.03888253, 0.03004006,
       0.00794144, 0.03300505, 0.00432689, 0.0041829 , 0.01156361,
       0.00794625, 0.        , 0.        , 0.        , 0.01207349,
       0.02251504, 0.01938611, 0.00347552, 0.00116829, 0.00072676,
       0.00094549, 0.00527031, 0.0106541 , 0.00290367, 0.00144508,
       0.00254706, 0.00234171, 0.00912673, 0.00249816, 0.00228634,
       0.00765582, 0.00907677, 0.01158292, 0.00196904, 0.0121832 ,
       0.00783499, 0.00994449, 0.00162465, 0.00188881, 0.14141

In [19]:
feature_importances = {name: score for name, score in zip(list(df), clf_rnd.feature_importances_)}

In [20]:
feature_importances_sorted = pd.Series(feature_importances).sort_values(ascending=False)
feature_importances_sorted.head(20)

Init_Win_bytes_forward     0.141411
max_flowiat                0.038883
flow_fin                   0.033005
Init_Win_bytes_backward    0.031345
duration                   0.030967
mean_flowiat               0.030040
fPktsPerSecond             0.029483
flowBytesPerSecond         0.027684
flowPktsPerSecond          0.027376
min_flowpktl               0.026106
mean_flowpktl              0.024706
total_fpktl                0.023182
avgPacketSize              0.022515
max_fpktl                  0.021993
min_flowiat                0.020008
fAvgSegmentSize            0.019386
mean_fpktl                 0.019103
total_fiat                 0.018641
min_seg_size_forward       0.017701
bPktsPerSecond             0.017591
dtype: float64

### Reducir con los parámetros más importantes

In [21]:
# Extraemos las 10 caracteristicas con mas relevancia para el algoritmo
columns = list(feature_importances_sorted.head(20).index)

In [22]:
columns

['Init_Win_bytes_forward',
 'max_flowiat',
 'flow_fin',
 'Init_Win_bytes_backward',
 'duration',
 'mean_flowiat',
 'fPktsPerSecond',
 'flowBytesPerSecond',
 'flowPktsPerSecond',
 'min_flowpktl',
 'mean_flowpktl',
 'total_fpktl',
 'avgPacketSize',
 'max_fpktl',
 'min_flowiat',
 'fAvgSegmentSize',
 'mean_fpktl',
 'total_fiat',
 'min_seg_size_forward',
 'bPktsPerSecond']

In [23]:
X_train_reduced = X_train[columns].copy()
X_val_reduced = X_val[columns].copy()

In [24]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train_reduced, y_train)

In [25]:
y_pred = clf_rnd.predict(X_val_reduced)

In [26]:
print("F1 score:", f1_score(y_pred, y_val, average='weighted'))

F1 score: 0.934169742585424


### Añadir los grid Random (con set reducido)

In [34]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=10, high=300),
        'max_depth': randint(low=10, high=55,)
        }

rnd_clf = RandomForestClassifier(n_jobs=-1)

# train across 2 folds, that's a total of 5*2=10 rounds of training
rnd_search = RandomizedSearchCV(rnd_clf, param_distributions=param_distribs,
                                n_iter=100, cv=10, scoring='f1_weighted', verbose=4)

rnd_search.fit(X_train_reduced, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
[CV 1/10] END ...max_depth=36, n_estimators=137;, score=0.933 total time=  14.9s
[CV 2/10] END ...max_depth=36, n_estimators=137;, score=0.933 total time=  16.1s
[CV 3/10] END ...max_depth=36, n_estimators=137;, score=0.935 total time=  15.8s
[CV 4/10] END ...max_depth=36, n_estimators=137;, score=0.932 total time=  15.4s
[CV 5/10] END ...max_depth=36, n_estimators=137;, score=0.933 total time=  16.0s
[CV 6/10] END ...max_depth=36, n_estimators=137;, score=0.930 total time=  15.5s
[CV 7/10] END ...max_depth=36, n_estimators=137;, score=0.935 total time=  15.8s
[CV 8/10] END ...max_depth=36, n_estimators=137;, score=0.933 total time=  15.9s
[CV 9/10] END ...max_depth=36, n_estimators=137;, score=0.935 total time=  15.6s
[CV 10/10] END ..max_depth=36, n_estimators=137;, score=0.934 total time=  15.8s
[CV 1/10] END ....max_depth=42, n_estimators=88;, score=0.933 total time=  10.7s
[CV 2/10] END ....max_depth=42, n_estimators

In [36]:
rnd_search.best_params_

{'max_depth': 24, 'n_estimators': 246}

In [37]:
rnd_search.best_estimator_

In [38]:
cvres = rnd_search.cv_results_
# Ordenar por "mean_test_score" de manera descendente
sorted_results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)

# Imprimir los resultados ordenados
for mean_score, params in sorted_results:
    print("F1 score:", mean_score, "-", "Parámetros:", params)

F1 score: 0.9348872195310125 - Parámetros: {'max_depth': 24, 'n_estimators': 246}
F1 score: 0.9348620332791269 - Parámetros: {'max_depth': 24, 'n_estimators': 231}
F1 score: 0.9348263945077239 - Parámetros: {'max_depth': 25, 'n_estimators': 236}
F1 score: 0.9348167584075986 - Parámetros: {'max_depth': 22, 'n_estimators': 278}
F1 score: 0.9347281627912603 - Parámetros: {'max_depth': 22, 'n_estimators': 245}
F1 score: 0.9347023978054908 - Parámetros: {'max_depth': 23, 'n_estimators': 112}
F1 score: 0.9346662811958222 - Parámetros: {'max_depth': 26, 'n_estimators': 287}
F1 score: 0.9346657536706197 - Parámetros: {'max_depth': 24, 'n_estimators': 98}
F1 score: 0.9346330188177525 - Parámetros: {'max_depth': 26, 'n_estimators': 293}
F1 score: 0.9345647219136227 - Parámetros: {'max_depth': 25, 'n_estimators': 234}
F1 score: 0.934541211152038 - Parámetros: {'max_depth': 25, 'n_estimators': 206}
F1 score: 0.9344859021862767 - Parámetros: {'max_depth': 27, 'n_estimators': 279}
F1 score: 0.934429

### Modelo

In [39]:
rnd_search.best_estimator_.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 24,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 246,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [40]:
# Seleccionamos el mejor modelo
clf_rnd = rnd_search.best_estimator_

In [41]:
# Predecimos con el conjunto de datos de entrenamiento
y_train_pred = clf_rnd.predict(X_train_reduced)

In [42]:
# Predicción con el conjunto de datos de entrenamiento
print("F1 score Train Set:", f1_score(y_train_pred, y_train, average='weighted'))

F1 score Train Set: 0.9753489207437736


In [43]:
# Predecimos con el conjunto de datos de entrenamiento
y_val_pred = clf_rnd.predict(X_val_reduced)

In [44]:
# Predicción con el conjunto de datos de validación
print("F1 score Validation Set:", f1_score(y_val_pred, y_val, average='weighted'))

F1 score Validation Set: 0.9360347234080588


### PCA


In [45]:
X_df, y_df = remove_labels(df, 'calss')
y_df = y_df.factorize()[0]

In [46]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.9999)
df_reduced = pca.fit_transform(X_df)

In [47]:
print("Número de componentes:", pca.n_components_)

Número de componentes: 11


In [48]:
pca.explained_variance_ratio_

array([9.16952089e-01, 5.61087653e-02, 2.16566915e-02, 3.65011318e-03,
       5.56686331e-04, 3.79356201e-04, 1.86256059e-04, 1.75273567e-04,
       1.38834106e-04, 8.83685900e-05, 4.56458730e-05])

In [49]:
df_reduced = pd.DataFrame(df_reduced, columns=["c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11"])
df_reduced["Class"] = y_df

In [50]:
train_set, val_set, test_set = train_val_test_split(df_reduced)

In [51]:
X_train, y_train = remove_labels(train_set, 'Class')
X_val, y_val = remove_labels(val_set, 'Class')
X_test, y_test = remove_labels(test_set, 'Class')

### Random forest

In [52]:
clf_rnd = RandomForestClassifier(n_estimators=200, max_depth=30, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train, y_train)

In [53]:
y_val_pred = clf_rnd.predict(X_val)

In [54]:
print("F1 score validation test:", f1_score(y_val_pred, y_val, average='weighted'))

F1 score validation test: 0.904880286545858


In [55]:
y_test_pred = clf_rnd.predict(X_test)

In [56]:
print("F1 score test set:", f1_score(y_test_pred, y_test, average='weighted'))

F1 score test set: 0.9070542131346466
