# Preparación del DataSet

En este Notebook se muestran algunasde las técnicas más utilizadas para transformar el DataSet .

## DataSet 
NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in [1]. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh [2] and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

### Data Files

* <span style = "color:green">**KDDTrain+.ARFF** - The full NSL-KDD train set with binary labels in ARFF format </span>

* **KDDTrain+.TXT**- The full NSL-KDD train set including attack-type labels and difficulty level in CSV format

* KDDTrain+_20Percent.ARFF - A 20% subset of the KDDTrain+.arff file

* KDDTrain+_20Percent.TXT - A 20% subset of the KDDTrain+.txt file

* KDDTest+.ARFF - The full NSL-KDD test set with binary labels in ARFF format

* KDDTest+.TXT - The full NSL-KDD test set including attack-type labels and difficulty level in CSV format

* KDDTest-21.ARFF - A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21

* KDDTest-21.TXT - A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21

## Imports

In [1]:
import arff
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Funciones auxiliares

In [3]:
def load_kdd_dataset(data_path):
    """Lectura del conjunto de datos NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
    attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns=attributes)

In [4]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

# 1.- Lectura del DataSet

In [5]:
df = load_kdd_dataset("datasets/datasets/NSL-KDD/KDDTrain+.arff")

In [6]:
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,...,25.0,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,...,1.0,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,26.0,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0.0,tcp,http,SF,232.0,8153.0,0,0.0,0.0,0.0,...,255.0,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,...,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,25.0,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8.0,udp,private,SF,105.0,145.0,0,0.0,0.0,0.0,...,244.0,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0.0,tcp,smtp,SF,2231.0,384.0,0,0.0,0.0,0.0,...,30.0,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0.0,tcp,klogin,S0,0.0,0.0,0,0.0,0.0,0.0,...,8.0,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


# 2.- División del DataSet

In [9]:
train_set, val_set, test_set = train_val_test_split(df, stratify = 'protocol_type')

In [10]:
print('Longitud del training Set', len (train_set))
print ('Longitud del Validation', len (val_set))
print('Longitud del test Set', len (test_set))

Longitud del training Set 75583
Longitud del Validation 25195
Longitud del test Set 25195


# 3.- Limpiando Datos
Antes de comenzar es necesario recuperar el DataSet limpio y separar las etiquetas del resto del los datos, no necesariamente aplicar las mismas transformaciones en ambos conjuntos.

In [11]:
# Separar las características de entrada de las características de salida.
X_train = train_set.drop("class", axis=1)
y_train = train_set['class'].copy()

In [12]:
# Para ilustrar esta sección es necesario añadir algunos valores nulos a algunas caracteristicas del DataSet
X_train .loc[(X_train["src_bytes"]> 400) & (X_train['src_bytes']<800), "src_bytes"]= np.nan
X_train .loc[(X_train["src_bytes"]> 500) & (X_train['src_bytes']<2000), "src_bytes"]= np.nan
X_train

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,,53508.0,0,0.0,0.0,0.0,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,304.0,636.0,0,0.0,0.0,0.0,...,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,...,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,210.0,736.0,0,0.0,0.0,0.0,...,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,,328.0,0,0.0,0.0,0.0,...,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,284.0,444.0,0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


La mayoria de los algoritmos de ML no pueden trabajar sobre caracteristicas que contengan valores nulos por ello existen 3 opciones para remplazarlos
* Eliminar la filas correspondientes siempre y cuando no haya un dato en esas filas
* Eliminar el atributo (columna) correspondiente
* Rellenarlos con un valor determinado ya sea(zero, media, mediana, ...).

In [13]:
# Comprobar si existe algun atributo con valores nulos
X_train.isna().any()

duration                       False
protocol_type                  False
service                        False
flag                           False
src_bytes                       True
dst_bytes                      False
land                           False
wrong_fragment                 False
urgent                         False
hot                            False
num_failed_logins              False
logged_in                      False
num_compromised                False
root_shell                     False
su_attempted                   False
num_root                       False
num_file_creations             False
num_shells                     False
num_access_files               False
num_outbound_cmds              False
is_host_login                  False
is_guest_login                 False
count                          False
srv_count                      False
serror_rate                    False
srv_serror_rate                False
rerror_rate                    False
s

In [14]:
# Seleccionar las filas que contengan valores nulos
filas_valores_nulos = X_train [X_train.isnull().any(axis=1)]
filas_valores_nulos

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,,53508.0,0,0.0,0.0,0.0,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.00,0.00
16447,0.0,tcp,smtp,SF,,363.0,0,0.0,0.0,0.0,...,141.0,137.0,0.55,0.04,0.01,0.01,0.00,0.00,0.00,0.00
64957,1.0,tcp,smtp,SF,,329.0,0,0.0,0.0,0.0,...,198.0,181.0,0.65,0.03,0.01,0.01,0.02,0.02,0.00,0.00
53498,0.0,tcp,smtp,SF,,330.0,0,0.0,0.0,0.0,...,255.0,108.0,0.42,0.02,0.00,0.00,0.00,0.01,0.00,0.00
30757,0.0,tcp,ftp_data,SF,,0.0,0,0.0,0.0,0.0,...,188.0,66.0,0.35,0.04,0.35,0.03,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124208,0.0,udp,other,SF,,4.0,0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
96968,0.0,tcp,smtp,SF,,337.0,0,0.0,0.0,0.0,...,113.0,205.0,0.96,0.02,0.01,0.01,0.00,0.00,0.00,0.00
81402,0.0,tcp,ftp_data,SF,,0.0,0,0.0,0.0,0.0,...,137.0,49.0,0.16,0.52,0.15,0.04,0.00,0.00,0.48,0.02
42166,0.0,tcp,smtp,SF,,328.0,0,0.0,0.0,0.0,...,219.0,139.0,0.57,0.02,0.00,0.01,0.00,0.00,0.03,0.04


### Opción 1: Eliminar la filas con valores Nulos