# Caso Practico: Seleccion de Caracteristicas.
En este caso de  uso practico se presenta un mecanismo de selección de caracteristicas mediante el uso de **random forest**.

## DataSet: Detección de Malware en Android

We propose our new Android malware dataset here, named CICAndMal2017. In this approach, we run our both malware and benign applications on real smartphones to avoid runtime behaviour modification of advanced malware samples that are able to detect the emulator environment. We collected more than 10,854 samples (4,354 malware and 6,500 benign) from several sources. We have collected over six thousand benign apps from Googleplay market published in 2015, 2016, 2017.

We installed 5,000 of the collected samples (426 malware and 5,065 benign) on real devices. Our malware samples in the CICAndMal2017 dataset are classified into four categories:

Adware
Ransomware
Scareware
SMS Malware
Our samples come from 42 unique malware families. The family kinds of each category and the numbers of the captured samples are as follows:

Adware
Dowgin family, 10 captured samples
Ewind family, 10 captured samples
Feiwo family, 15 captured samples
Gooligan family, 14 captured samples
Kemoge family, 11 captured samples
koodous family, 10 captured samples
Mobidash family, 10 captured samples
Selfmite family, 4 captured samples
Shuanet family, 10 captured samples
Youmi family, 10 captured samples
Ransomware
Charger family, 10 captured samples
Jisut family, 10 captured samples
Koler family, 10 captured samples
LockerPin family, 10 captured samples
Simplocker family, 10 captured samples
Pletor family, 10 captured samples
PornDroid family, 10 captured samples
RansomBO family, 10 captured samples
Svpeng family, 11 captured samples
WannaLocker family, 10 captured samples
Scareware
AndroidDefender 17 captured samples
AndroidSpy.277 family, 6 captured samples
AV for Android family, 10 captured samples
AVpass family, 10 captured samples
FakeApp family, 10 captured samples
FakeApp.AL family, 11 captured samples
FakeAV family, 10 captured samples
FakeJobOffer family, 9 captured samples
FakeTaoBao family, 9 captured samples
Penetho family, 10 captured samples
VirusShield family, 10 captured samples
SMS Malware
BeanBot family, 9 captured samples
Biige family, 11 captured samples
FakeInst family, 10 captured samples
FakeMart family, 10 captured samples
FakeNotify family, 10 captured samples
Jifake family, 10 captured samples
Mazarbot family, 9 captured samples
Nandrobox family, 11 captured samples
Plankton family, 10 captured samples
SMSsniffer family, 9 captured samples
Zsone family, 10 captured samples
In order to acquire a comprehensive view of our malware samples, we created a specific scenario for each malware category. We also defined three states of data capturing in order to overcome the stealthiness of an advanced malware:

Installation: The first state of data capturing which occurs immediately after installing malware (1-3 min).
Before restart: The second state of data capturing which occurs 15 min before rebooting phones.
After restart: The last state of data capturing which occurs 15 min after rebooting phones.
For feature Extraction and Selection, we captured network traffic features (.pcap files), and extracted more than 80 features by using CICFlowMeter-V3 during all three mentioned states (installation, before restart, and after restart). 

License
The CICAndMal2017 dataset is publicly available for researchers. If you are using our dataset, you should cite our related research paper that outlines the details of the dataset and its underlying principles:

Arash Habibi Lashkari, Andi Fitriah A. Kadir, Laya Taheri, and Ali A. Ghorbani, “Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification”, In the proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, Quebec, Canada, 2018.

[Descargar DataSet](http://205.174.165.80/CICDataset/CICMalAnal2017/)

## Imports

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score
from pandas import DataFrame

## Homework

- Funciones Auxiliares ( Particionado )
- Eliminación de etiquetas ( Remove_Labels )
- Lectura del DataSet ( Ruta )
- Visualización del DataSet
    - Head
    - Describe
    - Info
- División del DataSet

### Funciones Auxiliares

In [25]:
# Contrucción de una funcion que realice al particionado completo del DataSet

def train_val_test_split(df, rstate = 42, shuffle = True, stratify = None):
    
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size = 0.4, random_state = rstate, shuffle = shuffle, stratify = strat
    )

    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size = 0.5, random_state = rstate, shuffle = shuffle, stratify = strat
    )

    return (train_set, val_set, test_set)

### Eliminación de Etiquetas

In [26]:
# Separar el DataFrame  de label_name y se lo generea en otro DataFrame (y)

def remove_labels(df, label_name):
    X = df.drop(label_name, axis = 1)
    y = df[label_name].copy()
    return (X,y)

In [27]:
# Comparar el rendimiento con una preparación y sin preparación del modelo de aprendizaje

def evaluate_result (y_pred, y, y_prep_pred, y_prep, metric): # metric es la metrica de comparación
    print(metric.__name__, 'Whithout preparation: ', metric(y_pred, y, average = 'weighted'))
    print(metric.__name__, 'Whith preparation: ', metric(y_prep_pred, y_prep, average = 'weighted')) 

### Lectura del DataSet

In [28]:
df = pd.read_csv('data/TotalFeatures-ISCXFlowMeter.csv')

### Visualización del DataSet

In [29]:
df.head(10)

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward,calss
0,1020586,668,1641,35692,2276876,52,52,679,1390,53.431138,...,0.0,-1,0.0,2,4194240,1853440,1640,668,32,benign
1,80794,1,1,75,124,75,124,75,124,75.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
2,998,3,0,187,0,52,-1,83,-1,62.333333,...,0.0,-1,0.0,4,101888,-1,0,3,32,benign
3,189868,9,9,1448,6200,52,52,706,1390,160.888889,...,0.0,-1,0.0,2,4194240,2722560,8,9,32,benign
4,110577,4,6,528,1422,52,52,331,1005,132.0,...,0.0,-1,0.0,2,155136,31232,5,4,32,benign
5,261876,7,6,1618,882,52,52,730,477,231.142857,...,0.0,-1,0.0,2,4194240,926720,3,7,32,benign
6,14,2,0,104,0,52,-1,52,-1,52.0,...,0.0,-1,0.0,3,5824,-1,0,2,32,benign
7,29675,1,1,71,213,71,213,71,213,71.0,...,0.0,-1,0.0,2,0,0,0,1,0,benign
8,806635,4,0,239,0,52,-1,83,-1,59.75,...,0.0,-1,0.0,5,107008,-1,0,4,32,benign
9,56620,3,2,1074,719,52,52,592,667,358.0,...,0.0,-1,0.0,3,128512,10816,1,3,32,benign


In [30]:
df.describe()

Unnamed: 0,duration,total_fpackets,total_bpackets,total_fpktl,total_bpktl,min_fpktl,min_bpktl,max_fpktl,max_bpktl,mean_fpktl,...,min_idle,mean_idle,max_idle,std_idle,FFNEPD,Init_Win_bytes_forward,Init_Win_bytes_backward,RRT_samples_clnt,Act_data_pkt_forward,min_seg_size_forward
count,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,...,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0,315977.0
mean,14561700.0,5.408305,7.969283,782.8554,9656.793,211.070894,34.541739,302.603449,131.493068,238.203467,...,13502280.0,13712910.0,13984110.0,283800.6,2.268228,743868.1,250348.4,7.563826,5.40822,14.738655
std,186821400.0,193.223225,359.298667,63391.65,497752.4,179.564073,80.47328,254.311154,321.15533,173.115136,...,186669000.0,186651400.0,186766100.0,4412134.0,0.973226,1563906.0,602756.6,357.928135,193.223228,15.526032
min,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,...,-1.0,0.0,-1.0,0.0,2.0,-1.0,-1.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,83.0,0.0,52.0,-1.0,60.0,-1.0,58.0,...,-1.0,0.0,-1.0,0.0,2.0,0.0,-1.0,0.0,1.0,0.0
50%,2.0,1.0,0.0,365.0,0.0,71.0,-1.0,365.0,-1.0,231.142857,...,-1.0,0.0,-1.0,0.0,2.0,0.0,-1.0,0.0,1.0,0.0
75%,191496.0,2.0,1.0,422.0,112.0,420.0,52.0,422.0,60.0,420.0,...,-1.0,0.0,-1.0,0.0,2.0,96768.0,10240.0,1.0,2.0,32.0
max,44310760000.0,48255.0,74768.0,33146820.0,103922200.0,1390.0,1390.0,1390.0,1390.0,1390.0,...,44310720000.0,44300000000.0,44310720000.0,535000000.0,62.0,4194240.0,4194240.0,74524.0,48255.0,44.0


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 315977 entries, 0 to 315976
Data columns (total 80 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration                 315977 non-null  int64  
 1   total_fpackets           315977 non-null  int64  
 2   total_bpackets           315977 non-null  int64  
 3   total_fpktl              315977 non-null  int64  
 4   total_bpktl              315977 non-null  int64  
 5   min_fpktl                315977 non-null  int64  
 6   min_bpktl                315977 non-null  int64  
 7   max_fpktl                315977 non-null  int64  
 8   max_bpktl                315977 non-null  int64  
 9   mean_fpktl               315977 non-null  float64
 10  mean_bpktl               315977 non-null  float64
 11  std_fpktl                315977 non-null  float64
 12  std_bpktl                315977 non-null  float64
 13  total_fiat               315977 non-null  int64  
 14  tota

In [32]:
# Imprimir la Longitud del DataSet y el numero de caracteristicas
print('Longitud del Dataset', len(df))
print('Numero de caractetisitcas', len(df.columns))

Longitud del Dataset 315977
Numero de caractetisitcas 80


In [33]:
df['calss'].value_counts()

calss
benign    234706
asware     81271
Name: count, dtype: int64

### División del DataSet

In [34]:
train_set, val_set, test_set = train_val_test_split(X)

In [35]:
X_train, y_train = remove_labels(train_set, 'calss')
X_val, y_val = remove_labels(val_set, 'calss')
X_test, y_test = remove_labels(test_set, 'calss')

### Random Forest

In [36]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators = 50, random_state = 42, n_jobs = -1) # n_jobs usa todo el registro
clf_rnd.fit(X_train, y_train)

0,1,2
,n_estimators,50
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [37]:
# Pedecir con el DataSet de validación
y_pred = clf_rnd.predict(X_val)

In [38]:
print("F1 Score:", f1_score(y_pred, y_val, average='weighted')) # F1 score evalua el rendimiento del modelo

F1 Score: 0.9624686581306697


### Importancia de las características

In [39]:
clf_rnd.feature_importances_    # Colocar caracteristicas de mayor a menos

array([0.03352998, 0.00204695, 0.00246819, 0.02317963, 0.01300585,
       0.01894395, 0.0079682 , 0.03254743, 0.00959185, 0.02350418,
       0.01314166, 0.0085183 , 0.00358414, 0.01080972, 0.00416948,
       0.01039611, 0.00409059, 0.01299075, 0.0041399 , 0.01113759,
       0.00313383, 0.00400255, 0.00301038, 0.00842324, 0.004219  ,
       0.        , 0.        , 0.00220932, 0.00423967, 0.03011715,
       0.01772341, 0.03298286, 0.02753135, 0.03715189, 0.01585442,
       0.03980366, 0.03447542, 0.01742659, 0.05497709, 0.03311086,
       0.00772097, 0.02097283, 0.00149409, 0.00341121, 0.0078283 ,
       0.01166879, 0.        , 0.        , 0.        , 0.01046121,
       0.02921456, 0.02501198, 0.00257796, 0.00156958, 0.00048032,
       0.00098007, 0.00335447, 0.00727829, 0.00162588, 0.00087183,
       0.00144448, 0.0031847 , 0.00307433, 0.00244125, 0.00175729,
       0.00974963, 0.00792785, 0.00827405, 0.00116164, 0.00802302,
       0.01274659, 0.01198313, 0.0005293 , 0.00334104, 0.10199

In [40]:
# Es pocible extraer caracteristicas que son mas importantes para la correcta clasificacion de los datos.

feature_importances = {
    name: score for name, score in zip(list(df), clf_rnd.feature_importances_)
}

In [41]:
feature_importances_sorted = pd.Series(feature_importances).sort_values(ascending=False)
feature_importances_sorted.head(20)

Init_Win_bytes_forward     0.101999
max_flowiat                0.054977
min_seg_size_forward       0.041446
mean_flowpktl              0.039804
min_flowpktl               0.037152
std_flowpktl               0.034475
duration                   0.033530
mean_flowiat               0.033111
flowPktsPerSecond          0.032983
max_fpktl                  0.032547
fPktsPerSecond             0.030117
avgPacketSize              0.029215
flowBytesPerSecond         0.027531
fAvgSegmentSize            0.025012
mean_fpktl                 0.023504
total_fpktl                0.023180
flow_fin                   0.020973
min_fpktl                  0.018944
Init_Win_bytes_backward    0.018683
bPktsPerSecond             0.017723
dtype: float64

## Reducción del numero de caracteristicas

In [None]:
# Extraer lsa 10 caracteristicas con más relevancia para el algoritmo
columns = list(feature_importances_sorted.head(10).index)

In [44]:
X_train_reduced = X_train[columns].copy()
X_val_reduced = X_val[columns].copy()

In [45]:
X_train_reduced.head(10)

Unnamed: 0,Init_Win_bytes_forward,max_flowiat,min_seg_size_forward,mean_flowpktl,min_flowpktl,std_flowpktl,duration,mean_flowiat,flowPktsPerSecond,max_fpktl
116976,4194240,67748,32,292.4,52,425.87689,205123,8546.792,121.878093,785
261432,4928,-1,32,52.0,52,0.0,0,0.0,0.0,52
293551,87616,449880,32,52.0,52,1.0,449880,449880.0,4.44563,52
296426,88704,2,20,40.0,40,1.0,2,2.0,1000000.0,40
243164,0,-1,0,436.0,436,0.0,0,0.0,0.0,436
183278,93184,199656661,32,58.0,52,8.42615,199656661,200000000.0,0.010017,64
20404,4194240,33916,32,72.0,52,41.335215,75891,15178.2,79.060758,156
25781,0,-1,0,420.0,420,0.0,0,0.0,0.0,420
130727,0,-1,0,408.0,408,0.0,0,0.0,0.0,408
230506,15232,7030,32,52.0,52,1.0,7030,7030.0,284.495021,52


In [46]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators = 50, random_state = 42, n_jobs = -1)
clf_rnd.fit(X_train_reduced, y_train)

0,1,2
,n_estimators,50
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [47]:
# Pedecir con el DataSet de validación
y_pred = clf_rnd.predict(X_val_reduced)

In [48]:
print("F1 Score:", f1_score(y_pred, y_val, average='weighted')) # F1 score evalua el rendimiento del modelo

F1 Score: 0.958669653898154


Como puden observarse en la casilla anterior el rendimiento de nuestro modelo empeora muy poco eliminando 69 de las 79 caracteristicas de las que se disponian. Por otro lado el rendimiento en el entrenamiento y en la predicción mejora sustancialmente.