# Random Forest

## Detección de malware en Android

El sofisticado y avanzado malware de Android puede identificar la presencia del emulador utilizado por el analista de malware y, en respuesta, alterar su comportamiento para evadir la detección. Para superar este problema, instalamos las aplicaciones de Android en el dispositivo real y capturamos su tráfico de red. Vea nuestro Sandbox de Android disponible al público .

El conjunto de datos CICAAGM se captura instalando las aplicaciones de Android en los teléfonos inteligentes reales semiautomatizados. El conjunto de datos se genera a partir de 1900 aplicaciones con las siguientes tres categorías:

### 1. Adware (250 aplicaciones)

* **Airpush:** diseñado para entregar anuncios no solicitados a los sistemas del usuario para el robo de información.

* **Dowgin:** diseñado como una biblioteca de publicidad que también puede robar la información del usuario.

* **Kemoge:** diseñado para hacerse cargo del dispositivo Android de un usuario. Este adware es un híbrido de botnet y se disfraza de aplicaciones populares a través del reempaquetado.

* **Mobidash:** diseñado para mostrar anuncios y comprometer la información personal del usuario.

* **Shuanet:** similar a Kemoge, Shuanet también está diseñado para hacerse cargo del dispositivo de un usuario.

### 2. Malware general (150 aplicaciones)

* **AVpass:** diseñado para ser distribuido bajo la apariencia de una aplicación de reloj.

* **FakeAV:** Diseñado como una estafa que engaña al usuario para que compre una versión completa del software con el fin de mediar infecciones no existentes.

* **FakeFlash / FakePlayer:** diseñado como una aplicación Flash falsa para dirigir a los usuarios a un sitio web (después de una instalación exitosa).

* **GGtracker:** diseñado para el fraude por SMS (envía mensajes SMS a un número de tarifa premium) y robo de información.

* **Penetho:** diseñado como un servicio falso (hacktool para dispositivos Android que se puede usar para descifrar la contraseña de WiFi). El malware también puede infectar la computadora del usuario a través de archivos adjuntos de correo electrónico infectados, actualizaciones falsas, medios externos y documentos infectados.

### 3. Benigno (1500 aplicaciones)

* 2015 GooglePlay market (top gratis popular y top gratis nuevo)
* 2016 GooglePlay market (top gratis popular y top gratis nuevo)

## Importaciones 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score

## Funciones auxiliares

### Función para la partición del DataSet

In [2]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

### Separacion de características de entrada 

In [4]:
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return (X, y)

## Lectura del DataSet

In [5]:
df = pd.read_csv('Datasets/TotalFeatures-ISCXFlowMeter.csv')
df

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward,calss
0,1020586,668,1641,35692,2276876,52,52,679,1390,53.431138,...,0.0,-1,0.000000e+00,2,4194240,1853440,1640,668,32,benign
1,80794,1,1,75,124,75,124,75,124,75.000000,...,0.0,-1,0.000000e+00,2,0,0,0,1,0,benign
2,998,3,0,187,0,52,-1,83,-1,62.333333,...,0.0,-1,0.000000e+00,4,101888,-1,0,3,32,benign
3,189868,9,9,1448,6200,52,52,706,1390,160.888889,...,0.0,-1,0.000000e+00,2,4194240,2722560,8,9,32,benign
4,110577,4,6,528,1422,52,52,331,1005,132.000000,...,0.0,-1,0.000000e+00,2,155136,31232,5,4,32,benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
631950,530,1,1,74,334,74,334,74,334,74.000000,...,0.0,-1,0.000000e+00,2,0,0,0,1,0,benign
631951,50240627,23,24,4767,6107,52,52,533,855,207.260870,...,9842879.0,9964749,1.196806e+05,2,317952,107008,11,23,32,GeneralMalware
631952,35471450,1,2,52,104,52,52,52,52,52.000000,...,35300000.0,35290631,0.000000e+00,2,3904,88704,1,1,32,asware
631953,41713629,12,26,1821,18643,40,40,489,1390,151.750000,...,20200000.0,32711382,1.770000e+07,2,227456,2432,23,12,20,benign


## Visualización del DataSet

In [6]:
df.head(10)

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward,calss
0,1020586,668,1641,35692,2276876,52,52,679,1390,53.431138,...,0.0,-1,0.0,2,4194240,1853440,1640,668,32,benign
1,80794,1,1,75,124,75,124,75,124,75.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
2,998,3,0,187,0,52,-1,83,-1,62.333333,...,0.0,-1,0.0,4,101888,-1,0,3,32,benign
3,189868,9,9,1448,6200,52,52,706,1390,160.888889,...,0.0,-1,0.0,2,4194240,2722560,8,9,32,benign
4,110577,4,6,528,1422,52,52,331,1005,132.0,...,0.0,-1,0.0,2,155136,31232,5,4,32,benign
5,261876,7,6,1618,882,52,52,730,477,231.142857,...,0.0,-1,0.0,2,4194240,926720,3,7,32,benign
6,14,2,0,104,0,52,-1,52,-1,52.0,...,0.0,-1,0.0,3,5824,-1,0,2,32,benign
7,29675,1,1,71,213,71,213,71,213,71.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
8,806635,4,0,239,0,52,-1,83,-1,59.75,...,0.0,-1,0.0,5,107008,-1,0,4,32,benign
9,56620,3,2,1074,719,52,52,592,667,358.0,...,0.0,-1,0.0,3,128512,10816,1,3,32,benign


In [7]:
df.describe()

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,min_idle,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward
count,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,...,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0,631955.0
mean,21952450.0,6.728514,10.431934,954.0172,12060.42,141.475727,44.357688,263.675901,183.248084,174.959706,...,19973270.0,20312280.0,20752380.0,466387.5,2.360896,962079.6,310451.9,9.733144,6.72471,19.965713
std,190057800.0,174.161354,349.424019,82350.4,482471.6,157.68088,89.099554,289.644383,371.863224,162.024811,...,189798600.0,189790200.0,189972100.0,6199704.0,3.04181,1705655.0,664795.6,347.877923,174.13813,14.914261
min,-18.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,...,-1.0,0.0,-1.0,0.0,2.0,-1.0,-1.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,69.0,0.0,52.0,-1.0,52.0,-1.0,52.0,...,-1.0,0.0,-1.0,0.0,2.0,0.0,-1.0,0.0,1.0,0.0
50%,24450.0,1.0,0.0,184.0,0.0,52.0,-1.0,83.0,-1.0,83.0,...,-1.0,0.0,-1.0,0.0,2.0,87616.0,-1.0,0.0,1.0,32.0
75%,1759751.0,3.0,1.0,427.0,167.0,108.0,52.0,421.0,115.0,356.0,...,1013498.0,1291379.0,1306116.0,0.0,2.0,304640.0,90496.0,1.0,3.0,32.0
max,44310760000.0,48255.0,74768.0,40496440.0,103922200.0,1390.0,1390.0,1500.0,1390.0,1390.0,...,44310720000.0,44300000000.0,44310720000.0,847000000.0,2269.0,4194240.0,4194240.0,74524.0,48255.0,44.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631955 entries, 0 to 631954
Data columns (total 80 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration                 631955 non-null  int64  
 1   total_fpackets           631955 non-null  int64  
 2   total_bpackets           631955 non-null  int64  
 3   total_fpktl              631955 non-null  int64  
 4   total_bpktl              631955 non-null  int64  
 5   min_fpktl                631955 non-null  int64  
 6   min_bpktl                631955 non-null  int64  
 7   max_fpktl                631955 non-null  int64  
 8   max_bpktl                631955 non-null  int64  
 9   mean_fpktl               631955 non-null  float64
 10  mean_bpktl               631955 non-null  float64
 11  std_fpktl                631955 non-null  float64
 12  std_bpktl                631955 non-null  float64
 13  total_fiat               631955 non-null  int64  
 14  tota

In [9]:
print("Longitud del conjunto de datos: ", len(df))

Longitud del conjunto de datos:  631955


In [10]:
print("Número de características del conjunto de datos:", len(df.columns))

Número de características del conjunto de datos: 80


In [11]:
# Categorías de clasificación
df["calss"].value_counts()

benign            471597
asware            155613
GeneralMalware      4745
Name: calss, dtype: int64

## División del DataSet

In [12]:
train_set, val_set, test_set = train_val_test_split(df)

In [13]:
X_train, y_train = remove_labels(train_set, 'calss')
X_val, y_val = remove_labels(val_set, 'calss')
X_test, y_test = remove_labels(test_set, 'calss')

In [14]:
print("Longitud del Training Set:", len(train_set))
print("Longitud del Validation Set:", len(val_set))
print("Longitud del Test Set:", len(test_set))

Longitud del Training Set: 379173
Longitud del Validation Set: 126391
Longitud del Test Set: 126391


## Random Forests

In [15]:
# Reducimos las características de entrada de manera que hagamos selección de características,
# mejoramos el tiempo de entrenamiento de nuestro modelo, el rendimiento de clasificación del modelo.


from sklearn.ensemble import RandomForestClassifier

# Instanciamos la clase "RandomForestClassifier" en el objeto "clf_rnd" y le pasamos los siguientes parámetros:


# Vamos a entrenar 50 árboles aleatorios
clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)

# Invocamos el método fit de nuestro objeto "clf_rnd" y le pasamos los subconjuntos de entrenamiento.
clf_rnd.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [16]:
# Predecimos con el conjunto de datos de validación
y_pred = clf_rnd.predict(X_val)
y_pred

array(['asware', 'asware', 'benign', ..., 'benign', 'asware', 'benign'],
      dtype=object)

In [18]:
# En un 93.29% de las ocasiones, el algoritmo está clasificando correctamente.
print("F1 score:", f1_score(y_pred, y_val, average='weighted'))

F1 score: 0.9324043007314987


## Importancia de las características

Extraer la importancia de las características es fácil, porque cuando el algoritmo de Random Forest construye el modelo, lo que hace es generar una variable interna dentro del objeto "clf_rnd" que se denomina feature_importances_, es decir, importancia de las características. 

In [19]:
# Importancia de las características
clf_rnd.feature_importances_

array([0.03096656, 0.00303719, 0.00440737, 0.02318232, 0.01184895,
       0.01721388, 0.00881173, 0.02199267, 0.01122589, 0.01910279,
       0.01229994, 0.00912599, 0.0049411 , 0.01864105, 0.00468261,
       0.01359503, 0.0060695 , 0.01755146, 0.00504174, 0.01740915,
       0.00478204, 0.00668029, 0.00337915, 0.00937514, 0.00572423,
       0.        , 0.        , 0.00268121, 0.00471322, 0.02948284,
       0.0175912 , 0.02737585, 0.0276842 , 0.02610625, 0.0159516 ,
       0.0247063 , 0.01454405, 0.02000791, 0.03888253, 0.03004006,
       0.00794144, 0.03300505, 0.00432689, 0.0041829 , 0.01156361,
       0.00794625, 0.        , 0.        , 0.        , 0.01207349,
       0.02251504, 0.01938611, 0.00347552, 0.00116829, 0.00072676,
       0.00094549, 0.00527031, 0.0106541 , 0.00290367, 0.00144508,
       0.00254706, 0.00234171, 0.00912673, 0.00249816, 0.00228634,
       0.00765582, 0.00907677, 0.01158292, 0.00196904, 0.0121832 ,
       0.00783499, 0.00994449, 0.00162465, 0.00188881, 0.14141

In [20]:
# Podemos extraer que características son más importantes para la correcta clasificación de los datos
# Geneamos un diccionario (clave,valor) 
feature_importances = {name: score for name, score in zip(list(df), clf_rnd.feature_importances_)}
feature_importances

{'duration': 0.0309665551436818,
 'total_fpackets': 0.0030371879478990325,
 'total_bpackets': 0.00440736580271348,
 'total_fpktl': 0.023182320203179382,
 'total_bpktl': 0.011848946866387284,
 'min_fpktl': 0.0172138800874452,
 'min_bpktl': 0.008811725005059473,
 'max_fpktl': 0.021992674160090632,
 'max_bpktl': 0.011225894335249498,
 'mean_fpktl': 0.019102793492948293,
 'mean_bpktl': 0.012299937432945812,
 'std_fpktl': 0.009125990714497939,
 'std_bpktl': 0.004941100960028792,
 'total_fiat': 0.01864105413523527,
 'total_biat': 0.0046826053903186015,
 'min_fiat': 0.013595030719278397,
 'min_biat': 0.006069501776756673,
 'max_fiat': 0.01755146299222828,
 'max_biat': 0.0050417449242102525,
 'mean_fiat': 0.017409145281749354,
 'mean_biat': 0.004782035041627898,
 'std_fiat': 0.006680286577580882,
 'std_biat': 0.0033791452474712593,
 'fpsh_cnt': 0.009375136418490434,
 'bpsh_cnt': 0.005724226852978681,
 'furg_cnt': 0.0,
 'burg_cnt': 0.0,
 'total_fhlen': 0.002681208069916213,
 'total_bhlen': 0.00

In [21]:
feature_importances_sorted = pd.Series(feature_importances).sort_values(ascending=False)
feature_importances_sorted.head(20)

Init_Win_bytes_forward     0.141411
max_flowiat                0.038883
flow_fin                   0.033005
Init_Win_bytes_backward    0.031345
duration                   0.030967
mean_flowiat               0.030040
fPktsPerSecond             0.029483
flowBytesPerSecond         0.027684
flowPktsPerSecond          0.027376
min_flowpktl               0.026106
mean_flowpktl              0.024706
total_fpktl                0.023182
avgPacketSize              0.022515
max_fpktl                  0.021993
min_flowiat                0.020008
fAvgSegmentSize            0.019386
mean_fpktl                 0.019103
total_fiat                 0.018641
min_seg_size_forward       0.017701
bPktsPerSecond             0.017591
dtype: float64

## Reducción del número de características

In [22]:
# Extraemos las 10 caracteristicas con mas relevancia para el algoritmo
columns = list(feature_importances_sorted.head(10).index)
columns

['Init_Win_bytes_forward',
 'max_flowiat',
 'flow_fin',
 'Init_Win_bytes_backward',
 'duration',
 'mean_flowiat',
 'fPktsPerSecond',
 'flowBytesPerSecond',
 'flowPktsPerSecond',
 'min_flowpktl']

In [23]:
# Creamos los archivos de entrenamiento reducidos, es decir, para X_train y X_val con 79 características cada una, me 
# quedo únicamente con las columnas (variables o características de entrada) definidas en "columns".
X_train_reduced = X_train[columns].copy()
X_val_reduced = X_val[columns].copy()

In [24]:
# Data de entrenamiento con 10 variables
X_train_reduced

Unnamed: 0,Init_Win_bytes_forward,max_flowiat,flow_fin,Init_Win_bytes_backward,duration,mean_flowiat,fPktsPerSecond,flowBytesPerSecond,flowPktsPerSecond,min_flowpktl
508881,0,490,0,0,490,490.0,2040.816327,679591.836700,4081.632653,73
208326,0,-1,0,-1,0,0.0,0.000000,0.000000,0.000000,422
107213,0,-1,0,-1,0,0.0,0.000000,0.000000,0.000000,436
466726,0,23933,0,0,23933,23933.0,41.783312,21267.705680,83.566623,54
230085,0,-1,0,-1,0,0.0,0.000000,0.000000,0.000000,422
...,...,...,...,...,...,...,...,...,...,...
110268,0,5018131,0,0,8856187,4428093.5,0.225831,36.584593,0.338746,108
259178,88704,28238005,2,-1,28238005,28200000.0,0.070827,3.682980,0.070827,52
365838,4194240,34928,1,1718208,72542,14508.4,41.355353,5955.170798,82.710706,52
131932,13376,-1,0,-1,0,0.0,0.000000,0.000000,0.000000,52


In [25]:
# Data de validación con 10 variables
X_val_reduced

Unnamed: 0,Init_Win_bytes_forward,max_flowiat,flow_fin,Init_Win_bytes_backward,duration,mean_flowiat,fPktsPerSecond,flowBytesPerSecond,flowPktsPerSecond,min_flowpktl
240832,90496,8580002,2,-1,8580002,8580002.000,0.233100,1.212121e+01,0.233100,52
326539,0,114583,0,0,114583,114583.000,8.727298,3.482192e+03,17.454596,67
200606,0,-1,0,-1,0,0.000,0.000000,0.000000e+00,0.000000,422
431142,106816,7941127,1,-1,7941129,3970564.500,0.377780,2.354829e+01,0.377780,52
478100,4194240,31205763,1,1853440,31590262,1504298.190,0.379864,2.316220e+02,0.696417,52
...,...,...,...,...,...,...,...,...,...,...
215540,89792,7379378,2,-1,7379378,7379378.000,0.271026,2.249512e+01,0.271026,83
516620,62912,8,0,-1,8,8.000,250000.000000,1.690000e+07,250000.000000,52
592495,262336,103128,0,32768,151998,30399.600,26.316136,1.193437e+04,39.474204,52
279808,4194240,60186541,1,1145472,60262041,6695782.333,0.082971,3.746969e+01,0.165942,52


### Reentrenamiento

In [26]:
# Se entrena más rápido
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train_reduced, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [27]:
# Predecimos con el conjunto de datos de validación
y_pred = clf_rnd.predict(X_val_reduced)
y_pred

array(['asware', 'asware', 'benign', ..., 'benign', 'asware', 'benign'],
      dtype=object)

In [28]:
print("F1 score:", f1_score(y_pred, y_val, average='weighted'))

F1 score: 0.926788599012114


**El rendimiento de nuestro modelo empeora muy poco eliminando 69 de las 79 características de las que disponía. Por otro lado, el rendimiento en el entrenamiento y en la predicción mejora sustancialmente.**