# Preparacion del Dataset

En este notebook se muestran algunas de las tecnicas mas utilizadas para transformar el Dataset.

## Dataset

## Descripcion
ISCX NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in [1]. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods.

Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

Data files

* <spam style="color:green">  KDDTrain+.ARFF: The full NSL-KDD train set with binary labels in ARFF format </spam>
* KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
* KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file
* KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
* KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format
* KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
* KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
* KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21

[Link de Descarga](https://www.unb.ca/cic/datasets/nsl.html) 

License
You may redistribute, republish, and mirror the ISCX NSL-KDD dataset in any form. However, any use or redistribution of the data must include a citation to the NSL-KDD dataset and the paper referenced below.

References: [1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009.

imports

In [1]:
import arff 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Funciones auciliares

In [2]:
def load_kdd_dataset(data_path): 
    """Lectura del dataset NSL-KDD"""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
        attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns=attributes)

In [3]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    print("Longitud del DataSet:", len(df))
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

## 1.- Lectura del dataset

In [4]:
df = load_kdd_dataset("./datasets/datasets/NSL-KDD/KDDTrain+.arff")

## 2.- Division del DataSet

In [5]:
train_set, val_set, test_set = train_val_test_split(df, stratify='protocol_type')

Longitud del DataSet: 125973


In [6]:
print("Longitud del training set:", len(train_set))
print("Longitud del validation set:", len(val_set)) # Corrección de sintaxis y variable
print("Longitud del test set:", len(test_set))

Longitud del training set: 75583
Longitud del validation set: 25195
Longitud del test set: 25195


## 3.- Limpiando los datos

Antres de comenzar es necesario recuperar el dataset limpio y separar las etiquetas del resto de los datos, no necesariamnente se quiere aplcar las mismas transformaciones en ambos conjuntos.

In [7]:
# Separar las caracteristicas de entradad de las caracteristicas de salida 
X_train = train_set.drop("class", axis=1)
y_train = train_set["class"].copy()

In [8]:
# Para ilustrar esta seccion es necesario anadir algunos valores nulos a algunas caracteristicas del Dataset
X_train.loc[(X_train["src_bytes"]>400)& (X_train["src_bytes"]<800), "src_bytes"] = np.nan 
X_train.loc[(X_train["dst_bytes"]>500)&(X_train["dst_bytes"]<200), "dst_bytes"] =np.nan
X_train

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,,53508.0,0,0.0,0.0,0.0,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,304.0,636.0,0,0.0,0.0,0.0,...,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,...,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,210.0,736.0,0,0.0,0.0,0.0,...,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,889.0,328.0,0,0.0,0.0,0.0,...,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,284.0,444.0,0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


La mayoria de los algortimos de Machine learning no pueden trabajar sobre caracteristicas que contengas valores nulos. Por ello, existen tres opciones para remplazarlos 
* Eliminar las filas correspondientes, siempre y cuando no tengan datos.
* Eliminar el atributo (columna) correspondiente.
* Rellenar con un valor determinado (cero, media....)

In [9]:
# Comprobar si existe algun atributo con valores nulos 
X_train.isna().any()

duration                       False
protocol_type                  False
service                        False
flag                           False
src_bytes                       True
dst_bytes                      False
land                           False
wrong_fragment                 False
urgent                         False
hot                            False
num_failed_logins              False
logged_in                      False
num_compromised                False
root_shell                     False
su_attempted                   False
num_root                       False
num_file_creations             False
num_shells                     False
num_access_files               False
num_outbound_cmds              False
is_host_login                  False
is_guest_login                 False
count                          False
srv_count                      False
serror_rate                    False
srv_serror_rate                False
rerror_rate                    False
s

In [10]:
# Seleccioar las fiilas que tengan valores nulos
filas_valores_nulos = X_train[X_train.isnull().any(axis=1)]
filas_valores_nulos

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,,53508.0,0,0.0,0.0,0.0,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.00,0.0
64957,1.0,tcp,smtp,SF,,329.0,0,0.0,0.0,0.0,...,198.0,181.0,0.65,0.03,0.01,0.01,0.02,0.02,0.00,0.0
76437,0.0,icmp,ecr_i,SF,,0.0,0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.0
77010,0.0,tcp,ftp_data,SF,,0.0,0,0.0,0.0,0.0,...,87.0,29.0,0.33,0.07,0.33,0.00,0.00,0.00,0.01,0.0
111550,0.0,tcp,smtp,SF,,317.0,0,0.0,0.0,0.0,...,17.0,127.0,0.59,0.24,0.06,0.02,0.00,0.00,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2379,0.0,tcp,smtp,SF,,327.0,0,0.0,0.0,0.0,...,43.0,130.0,0.67,0.14,0.02,0.02,0.00,0.02,0.00,0.0
66041,3.0,tcp,smtp,SF,,329.0,0,0.0,0.0,0.0,...,95.0,179.0,0.83,0.06,0.01,0.01,0.02,0.01,0.00,0.0
31079,0.0,tcp,ftp_data,SF,,0.0,0,0.0,0.0,0.0,...,255.0,77.0,0.30,0.02,0.30,0.00,0.00,0.00,0.00,0.0
13023,0.0,tcp,ftp_data,SF,,0.0,0,0.0,0.0,0.0,...,100.0,58.0,0.01,0.03,0.01,0.03,0.00,0.00,0.00,0.0


### Opcion 1: Eliminar las filas con valores nulos 

In [11]:
# Copiar el dataset para no alterar el original
X_train_copy = X_train.copy()

In [12]:
#Eliminar las filas con valores nulos 
X_train_copy.dropna(subset=["src_bytes", "dst_bytes"], inplace=True)
X_train_copy

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
31899,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,304.0,636.0,0,0.0,0.0,0.0,...,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,...,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
98007,0.0,udp,domain_u,SF,46.0,139.0,0,0.0,0.0,0.0,...,255.0,254.0,1.00,0.01,0.00,0.00,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,210.0,736.0,0,0.0,0.0,0.0,...,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,889.0,328.0,0,0.0,0.0,0.0,...,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,284.0,444.0,0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


In [13]:
#Contar el numero de filas eliminadas 
print("El numero de filas eliminadas es:", len(X_train) - len(X_train_copy))

El numero de filas eliminadas es: 1887


### Opcion 2: Eliminar los atributos con valores nulos

In [14]:
# Copiar el dataset para no alterar el original
X_train_copy = X_train.copy()

In [15]:
#Eliminar los atributos con valores nulos 
X_train_copy.drop(["src_bytes", "dst_bytes"], axis=1, inplace=True)
X_train_copy

Unnamed: 0,duration,protocol_type,service,flag,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0,0.0,0.0,0.0,0.0,0,...,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,...,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0,0.0,0.0,0.0,0.0,0,...,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,0,0.0,0.0,0.0,0.0,0,...,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0,0.0,0.0,0.0,0.0,0,...,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,...,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,0,0.0,0.0,0.0,0.0,1,...,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


In [16]:
#
print("El numero de atributos eliminadas es:", len(list(X_train)) - len(list(X_train_copy)))

El numero de atributos eliminadas es: 2


## Rellenar los valores nulos con un valor determinado

In [17]:
# Copoiar el dataset para no alterar el original
X_train_copy = X_train.copy()


In [18]:
# rellenar los valores nulos con la media de los valores del atributo
media_src_bytes = X_train_copy["src_bytes"].mean()
media_dst_bytes = X_train_copy["dst_bytes"].mean()

X_train_copy["src_bytes"].fillna(media_src_bytes, inplace=True)
X_train_copy["dst_bytes"].fillna(media_dst_bytes, inplace=True)
X_train_copy


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train_copy["src_bytes"].fillna(media_src_bytes, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train_copy["dst_bytes"].fillna(media_dst_bytes, inplace=True)


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,66914.530762,53508.0,0,0.0,0.0,0.0,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0.000000,0.0,0,0.0,0.0,0.0,...,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,304.000000,636.0,0,0.0,0.0,0.0,...,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0.000000,0.0,0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.000000,0.0,0,0.0,0.0,0.0,...,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0.000000,0.0,0,0.0,0.0,0.0,...,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,210.000000,736.0,0,0.0,0.0,0.0,...,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,889.000000,328.0,0,0.0,0.0,0.0,...,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,284.000000,444.0,0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


In [19]:
#Copiar el dataset para no alterar el original
X_train_copy = X_train.copy()

#### Existe otra alternativa para la opcion 3 que consiste en usar la clase imputer de Sklearn

In [20]:
# Ahora vamos a usar la clase SimpleImputer de Sklearn para rellenar los valores nulos
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')


In [21]:
# La clase imputer no admite datos categoricos, se eliminar los atributos categoricos
X_train_copy_num = X_train_copy.select_dtypes(exclude=['object'])
X_train_copy_num.info()

<class 'pandas.core.frame.DataFrame'>
Index: 75583 entries, 113467 to 99030
Data columns (total 34 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     75583 non-null  float64
 1   src_bytes                    73696 non-null  float64
 2   dst_bytes                    75583 non-null  float64
 3   wrong_fragment               75583 non-null  float64
 4   urgent                       75583 non-null  float64
 5   hot                          75583 non-null  float64
 6   num_failed_logins            75583 non-null  float64
 7   num_compromised              75583 non-null  float64
 8   root_shell                   75583 non-null  float64
 9   su_attempted                 75583 non-null  float64
 10  num_root                     75583 non-null  float64
 11  num_file_creations           75583 non-null  float64
 12  num_shells                   75583 non-null  float64
 13  num_access_files

In [22]:
# Se le proporciona los atributos numericos para que calcule los valores 
imputer.fit(X_train_copy_num)

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


In [23]:
# Ahora vamos a rellenar los valores nulos usando el imputer
X_train_copy_num_nonan = imputer.transform(X_train_copy_num)
# Transformar el resultado en un DataFrame de Pandas
X_train_copy = pd.DataFrame(X_train_copy_num_nonan, columns=X_train_copy_num.columns)
X_train_copy

Unnamed: 0,duration,src_bytes,dst_bytes,wrong_fragment,urgent,hot,num_failed_logins,num_compromised,root_shell,su_attempted,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0.0,43.0,53508.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
2,0.0,304.0,636.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
4,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
75579,0.0,210.0,736.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
75580,3.0,889.0,328.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
75581,0.0,284.0,444.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


### API's de Sklearn 

Antes de continuar es nesesario hacer una paquena resena sobre como funsionan las Api's de Sklearn
* **Estimators**: Cualquier objeto que pueda estimar algun parametro: 
    * El propio estimador se forma mediante el metodo fit(), que siempre toma un DataSet como argumento.
    * Cualquier otro parametro de este metodo, es un hiperparametro
* **Transformers**: Son estimadores capaces de transformar el conjunto de datos (como Imputer).
    * La transformacion se realiza mediante el metodo transform()
    * Reciben un DataSet como parametro de entrada.

* **Predictors**: Son estimadores capaces de realizar predicciones.
    * La prediccion se realiza mediante el metodo predict()
    * Reciben un DataSet como entrada.
    * Retornan un DataSet como entrada.
    * Tienen un metodo score() para evaluar el resultado de la prediccion


## 4.- Transformacion de atributos categoricos a numericos 

Antes de comenzar, vamos a recuperar el DataSet limpio y separar las etiquetas del resto de los datos, no necesariamente se quiere aplicar las mismas transformaciones en ambos conjuntos 

In [24]:
X_train = train_set.drop("class", axis=1)
y_train = train_set["class"].copy()

Los algoritmos de machine learning por norma general, ingieren datos numericos. En el DataSet se tiene una gran cantida de valores categoricos y en consecuencia se debe comvertir a categoricos numericos

In [25]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 75583 entries, 113467 to 99030
Data columns (total 41 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     75583 non-null  float64
 1   protocol_type                75583 non-null  object 
 2   service                      75583 non-null  object 
 3   flag                         75583 non-null  object 
 4   src_bytes                    75583 non-null  float64
 5   dst_bytes                    75583 non-null  float64
 6   land                         75583 non-null  object 
 7   wrong_fragment               75583 non-null  float64
 8   urgent                       75583 non-null  float64
 9   hot                          75583 non-null  float64
 10  num_failed_logins            75583 non-null  float64
 11  logged_in                    75583 non-null  object 
 12  num_compromised              75583 non-null  float64
 13  root_shell      

Existen diferentes formas de convertir los atributos categoricos en categoricos numericos.
Probablemente, la mas sencilla es la que proporciona el metodo **factorize** de panda. Que transforma cada categoria en un numero secuencial.

In [26]:
protocol_type = X_train["protocol_type"]
protocol_type_encoded, categorias = protocol_type.factorize()
protocol_type_encoded

array([0, 0, 0, ..., 0, 0, 0], shape=(75583,))

In [27]:
# Mostrar por pantalla como se han codificado las categorias
for i in range(10):
    print(protocol_type.iloc[i], "=" , protocol_type_encoded[i])

tcp = 0
tcp = 0
tcp = 0
tcp = 0
icmp = 1
udp = 2
tcp = 0
tcp = 0
tcp = 0
tcp = 0


### Transformaciones avanzadas mediante Sklearn

#### Ordinal Encoding

Realiza la misma codificacion que el metodo **factorize** de pandas

In [28]:
from sklearn.preprocessing import OrdinalEncoder
protocol_type = X_train[["protocol_type"]]
ordinal_encoder = OrdinalEncoder()
protocol_type_encoded = ordinal_encoder.fit_transform(protocol_type)
protocol_type_encoded[:10]

array([[1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [2.],
       [1.],
       [1.],
       [1.],
       [1.]])

In [29]:
# mostrar por pantalla como se han codificado las categorias
for i in range(10):
    print(protocol_type.iloc[i,0], "=", protocol_type_encoded[i,0])

tcp = 1.0
tcp = 1.0
tcp = 1.0
tcp = 1.0
icmp = 0.0
udp = 2.0
tcp = 1.0
tcp = 1.0
tcp = 1.0
tcp = 1.0


In [30]:
print(ordinal_encoder.categories_)

[array(['icmp', 'tcp', 'udp'], dtype=object)]


El problema de este tipo de modificacion radica en que ciertos algoritmos de ML funsionan mediante la similitud de dos puntos por distancia se considera que el 1 esta mas cerca del 2 que del 3 y en este caso para estos valores categoricos no tiene sentido por ello se utilizan otros metodos de categorizacion como por ejemplo One-Hot-Encoding.


#### ONE-HOT Encoding

Genera para cada categoria del atributo categorico una matriz binaria que representa el valor 

In [32]:
# Parsea la matriz y almacena la posicion de los valores que no son 0 para ahorrar memoria 

from sklearn.preprocessing import OneHotEncoder
protocol_type = X_train[["protocol_type"]]
oh_encoder = OneHotEncoder()
protocol_type_oh = oh_encoder.fit_transform(protocol_type)
protocol_type_oh

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 75583 stored elements and shape (75583, 3)>

In [33]:
# convertir el parseo de la matriz en una array de numpy 
protocol_type_oh.toarray()


array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       ...,
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.]], shape=(75583, 3))

In [34]:
# mostrar por pantalla como se han codificado las categorias
for i in range(10):
    print(protocol_type["protocol_type"].iloc[i], "=", protocol_type_oh.toarray()[i])
    

tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
icmp = [1. 0. 0.]
udp = [0. 0. 1.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]


In [35]:
print (oh_encoder.categories_)

[array(['icmp', 'tcp', 'udp'], dtype=object)]


En michas ocaciones al particionar el DataSet o realizar una prediccion con nuevos ejemplos aparecen nuevos valores para determinadas categorias que produciran un error en la funcion **transform**. La clase OneHotEncoding proporciona el parametro **handle_uknown** ya sea para generar un error o ignorar si una caracteristica categorica desconocida esta presente durante la transformacion (el valor predeterminado es lanzar un error).
Cuando este parametro se establece en "ignorar" y se encuentra una categoria desconociad durante la transformacion la columnas modificadas resultantes para esta caracteristica seran todos ceros en la transformacion inversa una categoria desconocida se denotara como None.

In [36]:
oh_encoder = OneHotEncoder(handle_unknown="ignore")

### get dummies 
Get dummies es un 

In [37]:
pd.get_dummies(X_train["protocol_type"])

Unnamed: 0,icmp,tcp,udp
113467,False,True,False
31899,False,True,False
108116,False,True,False
89913,False,True,False
106319,True,False,False
...,...,...,...
64559,False,True,False
67272,False,True,False
32452,False,True,False
112657,False,True,False


## Escalado del DataSet 
Antes de comenzar es necesario recuperar el DataSet limpi tambien es necesario separar las etiquetas del resto de los datos, no necesariamente se requiere aplicar las mismas transfomaciones en ambos conjuntos.


In [38]:
# crear la copia del dataset para no alterar el original
X_train_copy = X_train.copy()

Por norma general los algoritmos de ML no se comportan adecuadamente si los valores de las caracteristicas de escalado, **importante tener en cuenta que los mecanismos de escalado no deben aplicarse sobre las etiquetas**
* **normalizacion**: Los valores del atributo se escalan para adquirir un valor entre 0 y 1.
* **Estandarizacion**: Los valores de atributo se escalan y reciben un valor similar pero no se encuentran dentro del rango.

**Es importante que para probar estos valores se realicen las transformaciones solo sobre un DataSet de entrenamiento. Despues, se aplkican sobre el DataSet de prueba para testear

In [43]:
from sklearn.preprocessing import RobustScaler

scale_attrs = X_train[["src_bytes", "dst_bytes"]]
robust_scaler = RobustScaler()
X_train_scaled = robust_scaler.fit_transform(scale_attrs)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=["src_bytes", "dst_bytes"])
X_train_scaled

Unnamed: 0,src_bytes,dst_bytes
0,1.324818,101.920000
1,-0.160584,0.000000
2,0.948905,1.211429
3,-0.160584,0.000000
4,-0.131387,0.000000
...,...,...
75578,-0.160584,0.000000
75579,0.605839,1.401905
75580,3.083942,0.624762
75581,0.875912,0.845714


In [47]:
# ahora vamos a imprimir los valores originales y los escalados
X_train.head(10)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,407.0,53508.0,0,0.0,0.0,0.0,...,9.0,255.0,1.0,0.0,0.11,0.03,0.0,0.0,0.0,0.0
31899,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,4.0,0.02,0.05,0.0,0.0,1.0,1.0,0.0,0.0
108116,0.0,tcp,http,SF,304.0,636.0,0,0.0,0.0,0.0,...,39.0,255.0,1.0,0.0,0.03,0.06,0.0,0.0,0.0,0.0
89913,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.0,0.0,1.0,1.0,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,...,2.0,7.0,1.0,0.0,1.0,0.57,0.0,0.0,0.0,0.0
98007,0.0,udp,domain_u,SF,46.0,139.0,0,0.0,0.0,0.0,...,255.0,254.0,1.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
16447,0.0,tcp,smtp,SF,1790.0,363.0,0,0.0,0.0,0.0,...,141.0,137.0,0.55,0.04,0.01,0.01,0.0,0.0,0.0,0.0
64957,1.0,tcp,smtp,SF,729.0,329.0,0,0.0,0.0,0.0,...,198.0,181.0,0.65,0.03,0.01,0.01,0.02,0.02,0.0,0.0
100052,0.0,tcp,http,SF,206.0,1492.0,0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28800,0.0,tcp,ftp_data,SF,334.0,0.0,0,0.0,0.0,0.0,...,8.0,28.0,1.0,0.0,1.0,0.11,0.0,0.0,0.0,0.0
