# Creación de Transformadores y Pipelines Personalizados

En este Notebook, se muestra la creación de Transformadores y Pipelines Personalizados

## DataSet

## Description
**NSL-KDD** is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in [1]. Although, this new version of the **KDD** data set still suffers from some of the problems discussed by McHugh [2] and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the **NSL-KDD** train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

### Data Files
- <span style="color:green">**KDDTrain+.ARFF** - The full NSL-KDD train set with binary labels in ARFF format</span>
- KDDTrain+.TXT - The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
- KDDTrain+_20Percent.ARFF - A 20% subset of the KDDTrain+.arff file
- KDDTrain+_20Percent.TXT - A 20% subset of the KDDTrain+.txt file
- KDDTest+.ARFF - The full NSL-KDD test set with binary labels in ARFF format
- KDDTest+.TXT - The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
- KDDTest-21.ARFF - A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
- KDDTest-21.TXT - A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21

## Imports

In [5]:
import arff
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler

## Funciones Auxiliares

In [8]:
def load_kdd_dataset(data_path):
    """Lectura del conjunto de datos NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
    attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns=attributes)

In [10]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

# 1.- Lectura de los datos

In [14]:
df = load_kdd_dataset('datasets/datasets/NSL-KDD/KDDTrain+.arff')
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,...,25.0,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,...,1.0,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,26.0,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0.0,tcp,http,SF,232.0,8153.0,0,0.0,0.0,0.0,...,255.0,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,...,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,25.0,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8.0,udp,private,SF,105.0,145.0,0,0.0,0.0,0.0,...,244.0,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0.0,tcp,smtp,SF,2231.0,384.0,0,0.0,0.0,0.0,...,30.0,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0.0,tcp,klogin,S0,0.0,0.0,0,0.0,0.0,0.0,...,8.0,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


# 2.- Division del DataSet

In [16]:
train_set, val_set, test_set = train_val_test_split(df, stratify = 'protocol_type')

In [18]:
print('Longitud del Trainig Set', len(train_set))
print('Longitud del Validation Set', len(val_set))
print('Longitud del Test Set', len(test_set))

Longitud del Trainig Set 75583
Longitud del Validation Set 25195
Longitud del Test Set 25195


## API Sklearn

Antes de continuar vamos  a hacer una reseña sobre como funciónan las API's de Sklearn:
* **Estimators**: Cualquier objeto que pueda estimar algún parámetro:
    * El propio estimador se forma mediante el método fit(), que siempre toma un DataSet, como argumento.
    * Cualquier otro parámetro de este método es un hiperparámetro
* **Transformers**: Son estimadores capaces de transformar el DataSet (como inputer).
    * La transformación se realiza mediante el método transform().
    * Reciben un DataSt como parámetro de entrada.
* **Predictors**: Son estimadores capaces de realizar predicciónes.
    * La predicción se realiza mediante el método predict().
    * Recibe un DataSet como entrada.
    * Retorna un DataSet con las predicciónes.
    * Tiene un método score() para evaluar el resultado de la predicción.

# 1.- Contruyendo Tranformadores Personalizados

La creación de transformadores propios, permite mantener el código más limpio  y estructurado a la hora de preparar los datos para los algoritmos de ML. Además facilitan la reutilización de código pra otros proyectos

Antes de comenzar, recuperemos el DataSet limpio y separemos las etiquetas del resto de los datos, no necesariamente se quiere aplicar las mismas tranformaciónes en ambos conjuntos

In [27]:
X_train = train_set.drop("class", axis = 1)
y_train = train_set["class"].copy()

In [34]:
# Para ilustrar esta sección, es necesario añadir unos valores nulos a algunas características del DataSet
X_train.loc[(X_train["src_bytes"]>400) & (X_train["src_bytes"]<800), "src_bytes"] = np.nan
X_train.loc[(X_train["dst_bytes"]>500) & (X_train["dst_bytes"]<2000), "src_bytes"] = np.nan
X_train

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,,53508.0,0,0.0,0.0,0.0,...,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,,0.0,0,0.0,0.0,0.0,...,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,,636.0,0,0.0,0.0,0.0,...,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,,0.0,0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,,0.0,0,0.0,0.0,0.0,...,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,,0.0,0,0.0,0.0,0.0,...,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,,736.0,0,0.0,0.0,0.0,...,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,,328.0,0,0.0,0.0,0.0,...,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,,444.0,0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


### Transformadores para atributos númericos

In [38]:
# Transformador creado para eliminar las filas con valores nulos
from sklearn.base import BaseEstimator, TransformerMixin

class DeleteNanRows(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y = None):
        return self
    def transform(self, X, y = None):
        return X.dropna()

In [42]:
delete_nan = DeleteNanRows()
X_train_prep = delete_nan.fit_transform(X_train)
X_train_prep

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
63761,27.0,tcp,ftp,SF,1507.0,4152.0,0,0.0,0.0,30.0,...,144.0,57.0,0.40,0.04,0.01,0.0,0.00,0.00,0.00,0.00
93952,0.0,tcp,http,SF,54540.0,8314.0,0,0.0,0.0,2.0,...,255.0,255.0,1.00,0.00,0.00,0.0,0.00,0.00,0.05,0.05
47322,28.0,tcp,ftp,SF,1486.0,4152.0,0,0.0,0.0,30.0,...,182.0,49.0,0.27,0.08,0.01,0.0,0.02,0.04,0.00,0.00
119596,0.0,tcp,http,SF,54540.0,8314.0,0,0.0,0.0,2.0,...,123.0,123.0,1.00,0.00,0.01,0.0,0.00,0.00,0.05,0.05
40858,18711.0,tcp,IRC,RSTR,2974.0,10399.0,0,0.0,0.0,0.0,...,14.0,6.0,0.43,0.14,0.07,0.0,0.00,0.00,0.36,0.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2209,0.0,tcp,http,SF,54540.0,8314.0,0,0.0,0.0,2.0,...,80.0,80.0,1.00,0.00,0.01,0.0,0.01,0.01,0.01,0.01
38473,11271.0,tcp,telnet,SF,2429.0,20456.0,0,0.0,0.0,0.0,...,255.0,93.0,0.36,0.02,0.00,0.0,0.60,0.77,0.00,0.01
73872,0.0,tcp,http,SF,54540.0,8314.0,0,0.0,0.0,2.0,...,73.0,73.0,1.00,0.00,0.01,0.0,0.00,0.00,0.01,0.01
122816,30.0,tcp,ftp,SF,1431.0,4152.0,0,0.0,0.0,30.0,...,137.0,39.0,0.28,0.03,0.01,0.0,0.01,0.03,0.00,0.00
