# Evaluacion de resultados 
en este notebook se muestran las tecnicas para la evaluacion de los resultados de prediccion con un algoritmo de machine learning.

## Dataset
#### Descripcion 

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in [1]. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods.

Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

### Ficheros de Datos.

* <span style ="color:green" >**KDDTrain+.ARFF:** The full NSL-KDD train set with binary labels in ARFF format.</span>
* KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format.
* KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file.
* KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
* KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format.
* KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format.
* KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21.
* KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21.

### Descarga de los ficheros de datos.
https://www.unb.ca/cic/datasets/index.html

### Referencias adicionales sobre el conjunto de datos.
_M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009._ 

## import 

In [5]:
import arff
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

## Funciones auxiliares

In [6]:
def load_kdd_dataset(data_path):
    """Lectura del DataSet NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
    attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns=attributes)

In [7]:
def train_val_split(df,rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
                df, test_size = 0.4, random_state = rstate, shuffle = shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split( 
    test_set, test_size = 0.5, random_state = rstate, shuffle = shuffle, stratify=strat)
    
    return (train_set, val_set, test_set)

In [8]:
# Contruccion de un pipeline para los atributos numericos 
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy = "median")),('rbst_scaler', RobustScaler()),])

In [10]:
#Transformador para codificar unicamente las columnas categoricas y devolver una DataFrame
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
    def _init_(self):
        self._oh = OneHotEncoder(sparse = False)
        self._columns = None
    def fit(self, X, y = None):
        X_cat = X.select_dtypes(include = ['object'])
        self._columns = pd.get_dummies(X_cat).columns
        self._oh.fit(X_cat)
        return self
    def transform(self, X, y = None):
        X_copy = X.copy()
        X_cat = X_copy.select_dtypes(include = ['object'])
        X_num = X_copy.select_dtypes(exclude = ['object'])
        X_cat_oh = self._oh.transform(X_cat)
        X_cat_oh = pd.DataFrame(X_cat_oh,
                               columns = self._columns,
                               index = X_copy.index)
        X_copy.drop(list(X_cat), axis = 1, inplace = True)
        return X_copy.join(X_cat_oh)

In [11]:
#Transformador que prepara todo el conjunto de datos llamados Pipelines y tranformadores personalizados
class DataFramePreparer(BaseEstimator, TransformerMixin):
    def _init_(self):
        self._full_pipeline = None
        self._columns = None
    def fit(self, X, y = None):
        num_attribs = list(X.select_dtypes(exclude = ['object']))
        cat_attribs = list(X.select_dtypes(include = ['object']))
        self._full_pipeline = ColumnTransformer([
            ("num", num_pipeline, num_attribs),
            ("cat", CustomOneHotEncoder(), cat_attribs),
        ])
        self._full_pipeline.fit(X)
        self._columns = pd.get_dummies(X).columns
        return self
    def transform(self, X, y = None):
        X_copy = X.copy()
        X_prep = self._full_pipeline.transform(X_copy)
        return pd.DataFrame(X_prep,
                           columns = self._columns,
                           index = X_copy. index)

## Lectura del DataSet.

In [12]:
df = load_kdd_dataset("datasets/NSL-KDD/KDDTrain+.arff")

In [13]:
df.head(10)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,...,25.0,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,...,1.0,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,26.0,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
3,0.0,tcp,http,SF,232.0,8153.0,0,0.0,0.0,0.0,...,255.0,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,...,255.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal
5,0.0,tcp,private,REJ,0.0,0.0,0,0.0,0.0,0.0,...,19.0,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,anomaly
6,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,9.0,0.04,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
7,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,15.0,0.06,0.07,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
8,0.0,tcp,remote_job,S0,0.0,0.0,0,0.0,0.0,0.0,...,23.0,0.09,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
9,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,13.0,0.05,0.06,0.0,0.0,1.0,1.0,0.0,0.0,anomaly


## Divicion del DataSet.

In [14]:
# Divicion del DataSet en los diferentes sub conjuntos 
train_set, val_set, test_set = train_val_split(df)

In [15]:
print("longitud del Trading Set: ", len(train_set))
print("Longitud del validation Set: ", len(val_set))
print("Longitud del test set: ", len(test_set))

longitud del Trading Set:  75583
Longitud del validation Set:  25195
Longitud del test set:  25195


para cada uno de los subconjuntos, se separan las etiquetas de las caracteristicas de entrada.

In [16]:
# Conjunto de datos general 
X_df = df.drop("class", axis = 1)
y_df = df['class'].copy()

In [17]:
#conjunto de datos de entrenamiento
X_train = train_set.drop("class", axis = 1)
y_train = train_set["class"].copy()

In [18]:
# Conjunto de datos de validacion 
X_val = val_set.drop("class", axis = 1)
y_val = val_set["class"].copy()

In [19]:
X_test = test_set.drop("class", axis = 1)
y_test = test_set["class"].copy()

# Preparacion del dataset

In [20]:
# instanciamos nuestro transformador personalisado.
data_preparer = DataFramePreparer()

In [21]:
# hacer el fit con el dataset general para que adquiera todos los valores posibles.
data_preparer.fit(X_df)

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


AttributeError: 'CustomOneHotEncoder' object has no attribute '_oh'

In [None]:
#transformamos el subconjunto de datos de estrenamuento 
X_train_prep = data_preparer.transform(X_train)

In [22]:
X_train.head(10)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
98320,0.0,icmp,ecr_i,SF,1032.0,0.0,0,0.0,0.0,0.0,...,210.0,65.0,0.31,0.01,0.31,0.0,0.0,0.0,0.0,0.0
8590,0.0,tcp,smtp,SF,1762.0,331.0,0,0.0,0.0,0.0,...,30.0,122.0,0.73,0.07,0.03,0.02,0.0,0.0,0.0,0.0
91385,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,...,2.0,126.0,1.0,0.0,1.0,0.25,0.0,0.0,0.0,0.0
54349,0.0,tcp,csnet_ns,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,18.0,0.07,0.07,0.0,0.0,1.0,1.0,0.0,0.0
69568,0.0,tcp,smtp,SF,1518.0,342.0,0,0.0,0.0,0.0,...,83.0,125.0,0.66,0.05,0.01,0.02,0.0,0.0,0.0,0.0
65413,0.0,tcp,http,SF,342.0,1148.0,0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
106434,0.0,tcp,http,SF,247.0,11193.0,0,0.0,0.0,0.0,...,11.0,255.0,1.0,0.0,0.09,0.01,0.0,0.0,0.0,0.0
16874,0.0,tcp,http,SF,314.0,255.0,0,0.0,0.0,0.0,...,25.0,255.0,1.0,0.0,0.04,0.04,0.0,0.0,0.0,0.0
106132,0.0,udp,domain_u,SF,45.0,131.0,0,0.0,0.0,0.0,...,255.0,226.0,0.89,0.02,0.0,0.0,0.0,0.0,0.0,0.0
6183,0.0,tcp,uucp,S0,0.0,0.0,0,0.0,0.0,0.0,...,255.0,15.0,0.06,0.07,0.0,0.0,1.0,1.0,0.0,0.0


In [23]:
X_train_prep.head(5)

NameError: name 'X_train_prep' is not defined

In [None]:
# transformar el subconjunto de datos de entrenamiento.
X_val_prep = data_preparer.transform(X_val)

In [None]:
X_val_prep.head(10)

## Entrenamiento de un Algoritmo de Regresión LIneal

la instantacion de un algoritmo de Machine Learning utilisando Sklearn se realiza utilisando los metodos expuestos por la Api de Sklearn tal y como se ha presentado en los notebook anteriores.

In [16]:
# Entrenamos un algoritmo basado en regresion logística.
from sklearn.lineal_model import LogisticRegression
clf = LogisticRegression(max_iter = 5000)
clf.fit(x_train_prep, y_train)

ModuleNotFoundError: No module named 'sklearn.lineal_model'

## prediccion de nuevos ejemplos 

Realizar una predicción con el modelo generado anteriormente tras el entrenamiento del algoritmo de Regresion Logistica utilizando el subconjunto de validacion.

In [None]:
y_pred = clf.predict(X_val_prep)

## 1.-Matriz de Confució.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_val, y_pred)

In [None]:
from sklearn.metrics import plot_confucion_matrix

plot_confusion_matrix(clf, X_val_prep, y_val, values_format = '3g')