**Caso Práctico: Detección de malware en Android**


En este caso de uso práctico se pretende resolver un problema de detección de malware en dispositivos Android mediante el análisis del tráfico de red que genera el dispositivo mediante el uso de árboles de decisión.


Descarga de los ficheros de datos

https://drive.google.com/file/d/1FyRlPKiMnC2cDypeipX3lrAqy0wU3y0X/view?usp=sharing


Referencias adicionales sobre el conjunto de datos

_Arash Habibi Lashkari, Andi Fitriah A. Kadir, Hugo Gonzalez, Kenneth Fon Mbah and Ali A. Ghorbani, “Towards a Network-Based Framework for Android Malware Detection and Characterization”, In the proceeding of the 15th International Conference on Privacy, Security and Trust, PST, Calgary, Canada, 2017._


Notas:

    Puedes usar esta función auxiliar para separar las características de entrada de la salida:

def remove_labels(df, label_name):

  X = df.drop(label_name, axis=1)

  y = df[label_name].copy()

  return (X, y)


    Y esta para probarlo con preprocesamiento o sin él:

def evaluate_result(y_pred, y, y_prep_pred, y_prep, metric):

  print(metric.__name__, "WITHOUT preparation:", metric(y_pred, y, average='weighted'))

  print(metric.__name__, "WITH preparation:", metric(y_prep_pred, y_prep, average='weighted'))


    Comprueba que no haya características categóricas y transforma la clase y "calss' de categórica a numérica con factorize().
    Revisa correlaciones por si puedes eliminar alguna caractrerística de entrada muy correlacionadas entre sí, o quedarte solo con las que esté correlacionadas con la class y (calss) por encima de un umbral.
    Otra buena acción, podría ser escalar los datos y comparar los resultados con el entrenamiento sin escalar (en árboles de decisión no es tan bueno escalarlos, incluso puede afectar al rendimiento del modelo).
    Entrena el algoritmo con DecissionTreeClassifier de skearn.tree (con los hiperparámetros max_depth, prueba con números sobre 10 ó 20 porque números más altos pueden producir overfitting, y random_state)


Intenta predecir adecuadamente con un f1_score > 0.89


Extra I: 

Trata de visualizar el límite de decisión.que ha construido el algoritmo (representa el árbol con graphviz usando los dos atributos más correlacionados con la class de salida y escalados para poder verlos adecuadamente en la gráfica, entrénalo con poca profundidad, por ejemplo max_depth=2) .


Extra II:

Ahora entéralo con Random Forest para comprobar si mejora:

from sklearn.ensemble import RandomForestClassifier con, por ejemplo, n_estimators=100

Intenta predecir adecuadamente con un f1_score > 0.93

In [1]:
def remove_labels(df, label_name):

  X = df.drop(label_name, axis=1)

  y = df[label_name].copy()

  return (X, y)

In [2]:
def evaluate_result(y_pred, y, y_prep_pred, y_prep, metric):

  print(metric.__name__, "WITHOUT preparation:", metric(y_pred, y, average='weighted'))

  print(metric.__name__, "WITH preparation:", metric(y_prep_pred, y_prep, average='weighted'))
    

In [46]:
import pandas as pd
import numpy as np
import graphviz

In [48]:
df = pd.read_csv("../../datasets/TotalFeatures-ISCXFlowMeter/TotalFeatures-ISCXFlowMeter.csv")
df_original = df.copy()

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631955 entries, 0 to 631954
Data columns (total 80 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration                 631955 non-null  int64  
 1   total_fpackets           631955 non-null  int64  
 2   total_bpackets           631955 non-null  int64  
 3   total_fpktl              631955 non-null  int64  
 4   total_bpktl              631955 non-null  int64  
 5   min_fpktl                631955 non-null  int64  
 6   min_bpktl                631955 non-null  int64  
 7   max_fpktl                631955 non-null  int64  
 8   max_bpktl                631955 non-null  int64  
 9   mean_fpktl               631955 non-null  float64
 10  mean_bpktl               631955 non-null  float64
 11  std_fpktl                631955 non-null  float64
 12  std_bpktl                631955 non-null  float64
 13  total_fiat               631955 non-null  int64  
 14  tota

In [50]:
df.head(10)

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward,calss
0,1020586,668,1641,35692,2276876,52,52,679,1390,53.431138,...,0.0,-1,0.0,2,4194240,1853440,1640,668,32,benign
1,80794,1,1,75,124,75,124,75,124,75.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
2,998,3,0,187,0,52,-1,83,-1,62.333333,...,0.0,-1,0.0,4,101888,-1,0,3,32,benign
3,189868,9,9,1448,6200,52,52,706,1390,160.888889,...,0.0,-1,0.0,2,4194240,2722560,8,9,32,benign
4,110577,4,6,528,1422,52,52,331,1005,132.0,...,0.0,-1,0.0,2,155136,31232,5,4,32,benign
5,261876,7,6,1618,882,52,52,730,477,231.142857,...,0.0,-1,0.0,2,4194240,926720,3,7,32,benign
6,14,2,0,104,0,52,-1,52,-1,52.0,...,0.0,-1,0.0,3,5824,-1,0,2,32,benign
7,29675,1,1,71,213,71,213,71,213,71.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
8,806635,4,0,239,0,52,-1,83,-1,59.75,...,0.0,-1,0.0,5,107008,-1,0,4,32,benign
9,56620,3,2,1074,719,52,52,592,667,358.0,...,0.0,-1,0.0,3,128512,10816,1,3,32,benign


In [51]:
df["calss"].value_counts()

calss
benign            471597
asware            155613
GeneralMalware      4745
Name: count, dtype: int64

In [52]:
# Identificar si hay columnas categóricas
print(df.dtypes)

# Transformar la columna "calss" de categórica a numérica
df['calss'], _ = pd.factorize(df['calss'])

# Verificar la conversión
print(df['calss'].head())

duration                     int64
total_fpackets               int64
total_bpackets               int64
total_fpktl                  int64
total_bpktl                  int64
min_fpktl                    int64
min_bpktl                    int64
max_fpktl                    int64
max_bpktl                    int64
mean_fpktl                 float64
mean_bpktl                 float64
std_fpktl                  float64
std_bpktl                  float64
total_fiat                   int64
total_biat                   int64
min_fiat                     int64
min_biat                     int64
max_fiat                     int64
max_biat                     int64
mean_fiat                  float64
mean_biat                  float64
std_fiat                   float64
std_biat                   float64
fpsh_cnt                     int64
bpsh_cnt                     int64
furg_cnt                     int64
burg_cnt                     int64
total_fhlen                  int64
total_bhlen         

In [22]:
matriz_correlación = df.corr()

In [59]:
pd.set_option('display.max_rows', None) 
print(matriz_correlación['calss'].sort_values(ascending=False))

calss                      1.000000
flow_fin                   0.286175
min_seg_size_forward       0.258352
Init_Win_bytes_forward     0.129425
std_fpktl                  0.123758
std_flowpktl               0.119375
flow_syn                   0.115044
min_fiat                   0.074491
min_flowiat                0.074452
max_bpktl                  0.073212
std_bpktl                  0.072953
mean_flowiat               0.072549
mean_fiat                  0.071397
Init_Win_bytes_backward    0.069405
max_active                 0.067178
duration                   0.067066
max_flowiat                0.066992
max_idle                   0.066965
mean_active                0.066552
mean_idle                  0.066348
min_active                 0.066001
min_idle                   0.065804
max_fiat                   0.064875
total_fiat                 0.064770
bAvgSegmentSize            0.064755
mean_bpktl                 0.064753
flow_rst                   0.058715
bVarianceDataBytes         0

In [32]:
threshold = 0.1  
high_corr_features = matriz_correlación.index[abs(matriz_correlación['calss']) > threshold]

filtered_data = df[high_corr_features]

# Arbol normal


Sin modificar


In [42]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

# Separar características (X) y la clase objetivo (y)
X = df.drop('calss', axis=1)
y = df['calss']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Entrenar árbol de decisión NORMAL
tree = DecisionTreeClassifier(max_depth=20, random_state=42)
tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)

f1 = f1_score(y_test, y_pred, average='weighted')
print("F1 Score (sin escalado):", f1)

F1 Score (sin escalado): 0.9332863569367247


Con las más correlacionadas

In [56]:
X_filtered = filtered_data.drop('calss', axis=1)
y_filtered = df['calss']

X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered = train_test_split(X_filtered, y_filtered, test_size=0.3, random_state=42)

tree_filtered = DecisionTreeClassifier(max_depth=20, random_state=42)
tree_filtered.fit(X_train_filtered, y_train_filtered)

# Realizar predicciones
y_pred_filtered = tree_filtered.predict(X_test_filtered)

# Evaluar el rendimiento usando F1 score
f1_filtered = f1_score(y_test_filtered, y_pred_filtered, average='weighted')
print("F1 Score (con características filtradas):", f1_filtered)

F1 Score (con características filtradas): 0.9087537706466773


In [60]:
# def evaluate_result(y_pred, y, y_prep_pred, y_prep, metric):
#   print(metric.__name__, "WITHOUT preparation:", metric(y_pred, y, average='weighted'))
#   print(metric.__name__, "WITH preparation:", metric(y_prep_pred, y_prep, average='weighted'))

evaluate_result(y_pred, y_test, y_pred_filtered, y_test_filtered, f1_score)

f1_score WITHOUT preparation: 0.9353016791627968
f1_score WITH preparation: 0.9146096456213157


In [None]:
dot_data = export_graphviz(clf_simple, out_file=None, 
                           feature_names=top_2_features,
                           class_names=['Class 0', 'Class 1'],  # Ajusta estos nombres según tu dataset
                           filled=True, rounded=True,
                           special_characters=True)

# Random forest

In [57]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

y_pred_forest = forest.predict(X_test)

f1_forest = f1_score(y_test, y_pred_forest, average='weighted')
print("F1 Score (Random Forest):", f1_forest)

F1 Score (Random Forest): 0.9324178863458773


In [61]:

forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train_filtered, y_train_filtered)

y_pred_forest = forest.predict(X_test_filtered)

f1_forest = f1_score(y_test_filtered, y_pred_forest, average='weighted')
print("F1 Score (Random Forest con características filtradas):", f1_forest)

F1 Score (Random Forest con características filtradas): 0.9213017514398837


# Graphviz

In [72]:
correlation_matrix = df.corr()

In [73]:
correlation_with_class = correlation_matrix['calss'].drop('calss')
top_features = correlation_with_class.abs().nlargest(2).index.tolist()

In [74]:
top2cor = df[top_features + ['calss']]

In [75]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(top2cor[top_features])
y_scaled = filtered_data['calss'].values

X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y_scaled, test_size=0.3, random_state=42)

tree = DecisionTreeClassifier(max_depth=2, random_state=42)
tree.fit(X_train_scaled, y_train_scaled)

# Exportar el árbol de decisión para visualizarlo
dot_data = export_graphviz(tree, out_file=None, 
                           feature_names=top_features, 
                           class_names=np.unique(y_scaled.astype(str)),  # Convertir a string para las clases
                           filled=True, rounded=True, 
                           special_characters=True)

# Visualizar el árbol
graph = graphviz.Source(dot_data)
graph.render("decision_tree_visualization")  # Guardar el gráfico como archivo .pdf
graph.view()

'decision_tree_visualization.pdf'

In [70]:
!where dot


C:\Users\David\anaconda3\envs\aprendizaje_auto_i\Library\bin\dot.exe
C:\Users\David\anaconda3\envs\aprendizaje_auto_i\Scripts\dot.bat
