# Evaluacion de resultados

En este Notebook se muestran para la evaluacion de los resultados de una prediccion con un algoritmo de Machine Learning

## Dataset

### Descripción 
NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

## Ficheros de datos
* <span style="color:green">**KDDTrain+.ARFF**: The full NSL-KDD train set with binary labels in ARFF format</span>
* KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
* KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file
* KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
* KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format
* KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
* KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
* KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21

## Imports 

In [1]:
import arff
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

## Funciones Auxiliares

In [2]:
def load_kdd_dataset(data_path):
    """Lectura de DataSet NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
    attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns = attributes)

In [3]:
# Construcción de una función que relice le particionado completo
def train_val_test_split(df, rsate = 42, shuffle = True, stratify = None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size = 0.4, random_state = rsate, shuffle = shuffle, stratify = strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size = 0.5, random_state = rsate, shuffle = shuffle, stratify = strat)
    return (train_set, val_set, test_set)

In [4]:
# construccion de un pipeline para los atributos numericos
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = "median")),
    ('rbst_scaler', RobustScaler())
])

In [5]:
# Transformador para codificar unicamente las columnas categoricas y devolver un DataFrame
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._oh = OneHotEncoder(sparse_output = False)
        self._columns = None
    def fit(self, x, y = None):
        x_cat = x.select_dtypes(include = ['object'])
        self._columns = pd.get_dummies(x_cat).columns
        self._oh.fit(x_cat)
        return self
    def transform(self, x, y = None):
        x_copy = x.copy()
        x_cat = x_copy.select_dtypes(include = ['object'])
        x_num = x_copy.select_dtypes(exclude = ['object'])
        x_cat_oh = self._oh.transform(x_cat)
        x_cat_oh = pd.DataFrame(x_cat_oh, columns = self._columns, index = x_copy.index)
        x_copy.drop(list(x_cat), axis = 1, inplace = True)
        return x_copy.join(x_cat_oh)

In [6]:
# Transformador queprepara todo el DataSet llamado Pipelines y transformadores personalizados
class DataFramePreparer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._full_pipeline = None
        self._columns = None
    def fit(self, x, y = None):
        num_attribs = list(x.select_dtypes(exclude = ['object']))
        cat_attribs = list(x.select_dtypes(include = ['object']))
        self._full_pipeline = ColumnTransformer([
            ("num", num_pipeline, num_attribs),
            ("cat", CustomOneHotEncoder(), cat_attribs),
        ])
        self._full_pipeline.fit(x)
        self._columns = pd.get_dummies(x).columns
        return self
    def transform(self, x, y = None):
        x_copy = x.copy()
        x_prep = self._full_pipeline.transform(x_copy)
        return pd.DataFrame(x_prep, columns = self._columns, index = x_copy.index)

## Lectura del DataSet

In [9]:
df = load_kdd_dataset('/home/pako0311/Escritorio/Simulacion/datasets/datasets/NSL-KDD/KDDTest+.arff')
df.head(10)

BadAttributeType: Bad @ATTRIBUTE type, at line 5.

## Division del DataSet

In [None]:
# Division del DataSet en los diferentes subconjuntos
train_set, val_set, test_set = train_val_test_split(df)

In [None]:
print("Longitud del Training Set: ", len(train_set))
print("Longitud de la Validacion: ", len(val_set))
print("Longitud del Test Set: ", len(test_set))

Para cada uno de los subconjunto, se separa las etiquetas pde las caracteristicas de entrada

In [None]:
# DataSet general 
x_df = df.drop("class", axis = 1)
y_df = df["class"].copy()

In [None]:
# DataSet de entrenamiento 
x_train = train_set.drop("class", axis = 1)
y_train = train_set["class"].copy()

In [None]:
# DataSet de validacion
x_val = val_set.drop("class", axis = 1)
y_val = val_set["class"].copy()

In [None]:
# DataSet de test
x_test = test_set.drop("class", axis = 1)
y_test = test_set["class"].copy()

## Preparacion del DataSet

In [None]:
# instanciamos nuestro transformador personalizado
data_preparer = DataFramePreparer()

In [None]:
# Hacer el fit con el DataSet General para que adquiera tdos los valores posibles 
data_preparer.fit(x_df)

In [None]:
# Transformar el DataSet de entrenamiento
x_train_prep = data_preparer.transform(x_train)

In [None]:
x_train.head(10)

In [None]:
x_train_prep.head(10)

In [None]:
# Transformar el DataSet de validacion 
x_val_prep = data_preparer.transform(x_val)

In [None]:
x_val.head(10)

In [None]:
x_val_prep.head(10)

## Entrenamiento del Algoritmo de Regresión Logistica

El instanciamiento de un algoritmo de Machine Learning, utilizando Sklearn se realiza utilizaando los metodos expuestos por la API de Sklearn, tal y como se ha presentado en Notebooks 

In [None]:
# Entrenar el algoritmo basado en Regresión Logistica 
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter = 5000)  
clf.fit(x_train_prep, y_train)

## Predicción de nuevos ejemplos

Realizar una predicción con el método generado anteriormente, tras el entrenamiento del algoritmo de Regresión Logistica.
Se utilizará el subconjunto de validación.

In [None]:
y_pred = clf.predict(x_val_prep)

## 1.- Matriz de Confusión

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_val, y_pred)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(clf, x_val_prep, y_val, values_format = '3g')

## 2.- Metricas derivadas de la Matriz de confusion 

### Precision

In [None]:
from sklearn.metrics import precision_score

print("Precisión: ",precision_score(y_val, y_pred, pos_label = 'anomaly'))

### Recall

In [None]:
from sklearn.metrics import recall_score

print("Recall: ", recall_score(y_val, y_pred, pos_label = 'anomaly'))

### 3.- Curva ROC 

In [None]:
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(clf, x_val_prep, y_val)

### Curva PR

In [None]:
from sklearn.metrics import PrecisionRecallDisplay

PrecisionRecallDisplay.from_estimator(clf, x_val_prep, y_val)

### 4.-Evaluacion del Modelo con el DataSet de pruebas

In [None]:
# Transformar el SubConjunto de Datos de validacion

x_test_prep = data_preparer.transform(x_test)
y_pred = clf.predict(x_test_prep)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(clf, x_test_prep, y_test, values_format = '3g')

In [None]:
from sklearn.metrics import f1_score

print("F1_score: ", f1_score(y_test, y_pred, pos_label = 'anomaly'))