# Caso Práctico: Random Forest

En este caso de uso práctico se pretende resolver un problema de detección de malware en dispositivos Android, mediante el analisis de tráfico de red qie genera el dispositivo mediante el uso de Conjuntos de árboles de desición

# Caso Práctico: Random Forest

En este caso de uso práctico se pretende resolver un problema de detección de malware en dispositivos Android, mediante el analisis de tráfico de red qie genera el dispositivo mediante el uso de Conjuntos de árboles de desición

### DataSet: Detección de Malware en Android

#### Description
The sophisticated and advanced Android malware is able to identify the presence of the emulator used by the malware analyst and in response, alter its behavior to evade detection. To overcome this issue, we installed the Android applications on the real device and captured its network traffic. See our publicly available Android Sandbox.

CICAAGM dataset is captured by installing the Android apps on the real smartphones semi-automated. The dataset is generated from 1900 applications with the following three categories:

**1. Adware (250 apps)**
* Airpush: Designed to deliver unsolicited advertisements to the user’s systems for information stealing.
* Dowgin: Designed as an advertisement library that can also steal the user’s information.
* Kemoge: Designed to take over a user’s Android device. This adware is a hybrid of botnet and disguises itself as popular apps via repackaging.
* Mobidash: Designed to display ads and to compromise user’s personal information.
* Shuanet: Similar to Kemoge, Shuanet also is designed to take over a user’s device.

**2. General Malware (150 apps)**
* AVpass: Designed to be distributed in the guise of a Clock app.
* FakeAV: Designed as a scam that tricks user to purchase a full version of the software in order to re-mediate non-existing infections.
* FakeFlash/FakePlayer: Designed as a fake Flash app in order to direct users to a website (after successfully installed).
* GGtracker: Designed for SMS fraud (sends SMS messages to a premium-rate number) and information stealing.
* Penetho: Designed as a fake service (hacktool for Android devices that can be used to crack the WiFi password). The malware is also able to infect the user’s computer via infected email attachment, fake updates, external media and infected documents.

**3. Benign (1500 apps)**
* 2015 GooglePlay market (top free popular and top free new)
* 2016 GooglePlay market (top free popular and top free new)

### Ficheros de datos
* pcap files – the network traffic of both the malware and benign (20% malware and 80% benign)
* <span style="color:green">.csv files - the list of extracted network traffic features generated by the CIC-flowmeter</span>

### Descarga de los ficheros de datos
https://www.unb.ca/cic/datasets/android-adware.html

### Referencias adicionales sobre el conjunto de datos
_Arash Habibi Lashkari, Andi Fitriah A. Kadir, Hugo Gonzalez, Kenneth Fon Mbah and Ali A. Ghorbani, “Towards a Network-Based Framework for Android Malware Detection and Characterization”, In the proceeding of the 15th International Conference on Privacy, Security and Trust, PST, Calgary, Canada, 2017._

## Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import  train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score


## Funciones Auxiliares

In [None]:
# Construcción de una función que realice el particionado completo
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [None]:
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return (X, y)


In [None]:
def evaluate_result(y_pred, y, y_prep_pred, y_prep, metric):
    print(metric.__name__, "WITHOUT preparation:", metric(y_pred, y, average='weighted'))

    print(metric.__name__, "WITH preparation:", metric(y_prep_pred, y_prep, average='weighted'))

### 1.- Lectura del DataSet

In [None]:
df = pd.read_csv('datasets/datasets/TotalFeatures-ISCXFlowMeter.csv')

### 2.- Visualización del DataSet

In [None]:
df.head(10)

In [None]:
df.describe

In [None]:
df.info()

In [None]:
# Ver la longitud de los datos
print("Longitud del DataSet", len(df))
print("Número de caracteristicas del DataSet", len(df.columns))

In [None]:
df['calss'].value_counts()

##### Convertir una salida categorica a una categorica numerica
## Buscando Correlaciones

In [None]:
# Transformar la variable de salida numérica para buscar correlaciones
X = df.copy()
# Pasar de variable categorica a numerica [0] -> Toma un array en una sola dimension
X['calss'] = X['calss'].factorize()[0]

In [None]:
# Calcular las correlaciones
corr_matrix  = X.corr()
corr_matrix['calss'].sort_values(ascending = False)


In [None]:
X.corr()

In [None]:
# Se puede llegar a valorar quedarse con aquellas que tienen mayor correlación 
corr_matrix[corr_matrix['calss'] > 0.05]

## 3.- División del DataSet

# Dvividir el dataset
train_set, val_set, test_set = train_val_test_split(X)

In [None]:
# Separas esa etiqueta calss, solo vamos a usar calss para reducir la carga de trabajo al hardware
X_train, y_train = remove_labels(train_set, 'calss')
X_val, y_val = remove_labels(val_set, 'calss')
X_test, y_test = remove_labels(test_set, 'calss')

# 4.-Escalando el dataset

Es importnte comprender que lo árbole de desición son algoritmos que **no requieren demasiada preparación de los datos** correctamente, no requieren la realización o escalado o normalizació. En este ejercicio se ve a realizar escalado al Dataset y se van a comparar los resultados con el DataSet sin escala. De esta manera, se demuestra como aplicar preprocesamientos como el escaladopuedr llrgsr s sfectar el rendimiento del modelo

In [None]:
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
scaler = RobustScaler()
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
scaler = RobustScaler()
X_val_scaled = scaler.fit_transform(X_val)

In [None]:
# Transformar un DataFrame de pandas
from pandas import DataFrame
X_train_scaled = DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_train_scaled.head(10)

In [None]:
X_val_scaled = pd.DataFrame(X_val_scaled, columns=X_val.columns, index=X_val.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)


In [None]:
X_train_scaled.describe()

## 5.- Decision Forest

In [None]:
# Modelo entrenado con el DataSet sin escalar
from sklearn.tree import DecisionTreeClassifier

clf_tree = DecisionTreeClassifier(random_state=42)
clf_tree.fit(X_train, y_train)

In [None]:
# Predecimos el DataSet de Entrenamiento
y_train_pred = clf_tree.predict(X_train)

In [None]:
print("F1 Score Train Set: ", f1_score(y_train_pred, y_train, average='weighted'))

In [None]:
# Predecir con el DataSet de validadción 
y_val_pred = clf_tree.predict(X_val)

In [None]:
# Comparar los resultados entre escalado y sin escalar
print("F1 Score Validation Set: ", f1_score(y_val_pred, y_val, average='weighted'))

# 1.- Modelo de entrenamiento escalado y sin escalar
# 2.- Comparación de ambos modelos
# La prediccion de datos de validación
# 4.- COmparar los resultados con escalar y sin escalar
# 7.- Regresion Forest
Los árboles y conjuntos de árboles de decision también pueden aplicarse a problemas de regresión

## 1.- Modelo de entrenamiento escalado y sin escalar

In [None]:
# 1.- Escalado del dataset
from sklearn.preprocessing import RobustScaler

def scale_datasets(X_train, X_val, X_test):
    """
    Escala los conjuntos de entrenamiento, validación y prueba usando RobustScaler.
    Retorna las versiones escaladas en el mismo orden.
    """
    scaler = RobustScaler()
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
    X_val_scaled = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns, index=X_val.index)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
    return X_train_scaled, X_val_scaled, X_test_scaled

X_train_scaled, X_val_scaled, X_test_scaled = scale_datasets(X_train, X_val, X_test)
print("Escalado completado.")


In [None]:
# 1.- Random Forest sin escalar
from sklearn.ensemble import RandomForestClassifier


def train_rf_unscaled(X_train, y_train):
    """
    Entrena un RandomForestClassifier sin escalar los datos.
    """
    clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    clf.fit(X_train, y_train)
    return clf

clf_rf_unscaled = train_rf_unscaled(X_train, y_train)

y_train_pred_unscaled = clf_rf_unscaled.predict(X_train)
y_val_pred_unscaled = clf_rf_unscaled.predict(X_val)
y_test_pred_unscaled = clf_rf_unscaled.predict(X_test)

print("=== Random Forest (SIN escalar) ===")
print("F1 Train:", f1_score(y_train, y_train_pred_unscaled, average='weighted'))
print("F1 Val:  ", f1_score(y_val, y_val_pred_unscaled, average='weighted'))
print("F1 Test: ", f1_score(y_test, y_test_pred_unscaled, average='weighted'))
print("\nReporte de clasificación (Validación):")
print(classification_report(y_val, y_val_pred_unscaled))


In [None]:
# 2.- Modelo de entrenamiento CON escalar
clf_rf_scaled = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf_rf_scaled.fit(X_train_scaled, y_train)

# Predicciones
y_train_pred_scaled = clf_rf_scaled.predict(X_train_scaled)
y_val_pred_scaled = clf_rf_scaled.predict(X_val_scaled)
y_test_pred_scaled = clf_rf_scaled.predict(X_test_scaled)

# Evaluacion
print("Random Forest (ESCALADO) ")
print("Train F1-weighted:", f1_score(y_train, y_train_pred_scaled, average='weighted'))
print("Val   F1-weighted:", f1_score(y_val, y_val_pred_scaled, average='weighted'))
print("Test  F1-weighted:", f1_score(y_test, y_test_pred_scaled, average='weighted'))
print("\nClassification report (Validation):\n", classification_report(y_val, y_val_pred_scaled))


# 2.- Comparación de ambos modelos

evaluate_result(
    y_val_pred_unscaled, y_val,
    y_val_pred_scaled, y_val,
    f1_score
)


In [None]:
# 3.- Comparación de ambos modelos 

f1_train_unscaled = f1_score(y_train, y_train_pred_unscaled, average='weighted')
f1_val_unscaled   = f1_score(y_val, y_val_pred_unscaled, average='weighted')
f1_test_unscaled  = f1_score(y_test, y_test_pred_unscaled, average='weighted')

f1_train_scaled = f1_score(y_train, y_train_pred_scaled, average='weighted')
f1_val_scaled   = f1_score(y_val, y_val_pred_scaled, average='weighted')
f1_test_scaled  = f1_score(y_test, y_test_pred_scaled, average='weighted')

print("COMPARACIÓN DE MODELOS")
print(f"Train (Sin escalar): {f1_train_unscaled:.4f} | (Escalado): {f1_train_scaled:.4f}")
print(f"Valid (Sin escalar): {f1_val_unscaled:.4f} | (Escalado): {f1_val_scaled:.4f}")
print(f"Test  (Sin escalar): {f1_test_unscaled:.4f} | (Escalado): {f1_test_scaled:.4f}")


## 7.- Random Forest:

In [None]:
# 7.- Random Forest Regressor — ejemplo
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Selecciona la primera columna numérica para el ejemplo de regresión
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [c for c in numeric_cols if c != 'calss']  # evitar la etiqueta

if len(numeric_cols) == 0:
    print("❌ No se encontraron columnas numéricas para regresión.")
else:
    target_col = numeric_cols[0]
    print(f"🎯 Usando columna '{target_col}' como variable objetivo para regresión.")

    # Construir datasets de regresión
    X_reg_train = X_train.drop(columns=[target_col])
    X_reg_val = X_val.drop(columns=[target_col])
    X_reg_test = X_test.drop(columns=[target_col])

    y_reg_train = X_train[target_col]
    y_reg_val = X_val[target_col]
    y_reg_test = X_test[target_col]

    # Escalar features
    scaler_reg = RobustScaler()
    X_reg_train_scaled = pd.DataFrame(scaler_reg.fit_transform(X_reg_train), columns=X_reg_train.columns, index=X_reg_train.index)
    X_reg_val_scaled = pd.DataFrame(scaler_reg.transform(X_reg_val), columns=X_reg_val.columns, index=X_reg_val.index)
    X_reg_test_scaled = pd.DataFrame(scaler_reg.transform(X_reg_test), columns=X_reg_test.columns, index=X_reg_test.index)

    # Entrenar Regressor
    rfr = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
    rfr.fit(X_reg_train_scaled, y_reg_train)

    # Predicciones
    y_reg_pred = rfr.predict(X_reg_val_scaled)

    # Métricas
    print("MAE (val):", mean_absolute_error(y_reg_val, y_reg_pred))
    print("MSE (val):", mean_squared_error(y_reg_val, y_reg_pred))
    print("R² (val):", r2_score(y_reg_val, y_reg_pred))
