# Evaluacion de Resultados


En este NoteBook se  muestra para la evaliacion de los resulstados de una prediccion con un algoritmo de Ml(Machine Learning).

## DATASET

### Descripcion

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

### Ficheros de datos

* KDDTrain+.ARFF: The full NSL-KDD train set with binary labels in ARFF format 
* KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format.
* KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file
* KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
* KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format
* KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
* KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
* KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21

## Imports

In [1]:
import arff
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.compose  import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer 
from sklearn.base import BaseEstimator, TransformerMixin

# Funciones Auxiliares 

In [2]:
def load_kdd_dataset(data_path):
    """lectura del dataset NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
    attributes=[attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns=attributes) 

In [3]:
#construccion de una funcion que realize el particionado completo
def train_val_test_split(df, rstate = 42 , shuffle = True, stratify = None):
    strat=df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df,test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat=test_set[stratify] if stratify else None
    val_set,test_set = train_test_split(
        test_set,test_size=0.5,random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [4]:
# funcion de construccion de un pipeline para los atributos numericos 
num_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy="median")),
    ('rbst_scaler', RobustScaler())
])

In [5]:
#tranformador para codificar unicamente las columnas categoricas y devolver un dataframe
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._oh=OneHotEncoder(sparce_output=False)
        self._columns = None
    def fit(self, X, y = None):
        X_cat=X.select_dtypes(include=["object"])
        self._columns=pd.get_dummies(X_cat).columns
        self._oh.fit(X_cat)
        return self
    def transform(self, X, y = None):
        X_copy=X.copy
        X_cat=X_copy.select_dtypes(include =["object"])
        X_num=X_copy.select_dtypes(exclude =["object"])
        X_cat_oh=self._oh.transform(X_cat)
        X_cat_oh=pd.DataFrame(X_cat_oh,
                              columns=self._columns,
                              index=X_copy.index)
        X_copy.drop(list(X_cat),axis=1, inplace=True)
        return X_copy.join(X_cat_oh)

In [6]:
# Transformador que prepara todo el dataset llamado pipelines
#trasnformadores personalizados 
class DataFramePreparer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._full_pipeline = None
        self._columns = None
    def fit(self,X, y = None):
        num_attribs=list(X.select_dtypes(exclude =["object"]))
        cat_attribs=list(X.select_dtypes(include =["object"]))
        self._full_pipeline=ColumnTransformer([
            ("num", num_pipeline,num_attribs),
            ("cat", CustomOneHotEncoder,cat_attribs)
        ])
        self._full_pipeline.fit(X)
        self._columns=pd.get_dummies(X).columns
        return self
    def transform(self, X, y=None):
        X_copy=X.copy()
        X_prep=self._full_pipeline.transform(X_copy)
        return pd.DataFrame(X_prep,columns=self._columns,index=X_copy.index)

In [7]:
df =load_kdd_dataset('datasets/NSL-KDD/KDDTrain+.arff')
df.head(10)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,...,25.0,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,...,1.0,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,26.0,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
3,0.0,tcp,http,SF,232.0,8153.0,0,0.0,0.0,0.0,...,255.0,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,...,255.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal
5,0.0,tcp,private,REJ,0.0,0.0,0,0.0,0.0,0.0,...,19.0,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,anomaly
6,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,9.0,0.04,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
7,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,15.0,0.06,0.07,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
8,0.0,tcp,remote_job,S0,0.0,0.0,0,0.0,0.0,0.0,...,23.0,0.09,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
9,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,13.0,0.05,0.06,0.0,0.0,1.0,1.0,0.0,0.0,anomaly


# Divicion del DataSet

In [8]:
#divicion de dataset en los diferentes subconjuntos 
train_set,val_set,test_set=train_val_test_split(df)

In [9]:
print("LOngitud del Training Set",len(train_set))
print("LOngitud del Validacion",len(val_set))
print("LOngitud del test set",len(test_set))

LOngitud del Training Set 75583
LOngitud del Validacion 25195
LOngitud del test set 25195


Para cada uno de los subconjuntos se separa las etiquetas de las caracteristicas de entrada

In [10]:
#dataset general
X_df=df.drop("class",axis=1)
y_df=df["class"].copy()

In [11]:
#dataset general de entrenamiento 
X_train=train_set.drop("class",axis=1)
y_train=train_set["class"].copy()

In [12]:
#conjunto de datos de validacion
X_val=val_set.drop("class",axis=1)
y_val=val_set["class"].copy()

# Preparacion del Dataset

In [13]:
#instanciamos el transformaion personalizado
data_preparer=DataFramePreparer()

In [15]:
#Hacer el fit con el data set general para que adquiera todos los valores posibles 
data_preparer.fit(X_df)

TypeError: Cannot clone object. You should provide an instance of scikit-learn estimator instead of a class.

In [16]:
#tranformar el dataset de entrenamieto 
X_train_prep=data_preparer.transform(X_train)

AttributeError: 'ColumnTransformer' object has no attribute 'transformers_'

In [None]:
X_train.head(10)

In [None]:
X_train_prep.head(10)

In [None]:
#tranformamos el dataset de validacion
X_val_prep=data_preparer.transform(X_val)

## Entrenamiento del Algoritmo de Regresion Logistica 

El Instanciamiento de un algoritmo de Machine Learning(ML)utilizando Sklearn se realiza utilizando los metodos expuestos por la api de SKlearn, tal como se ha presentado de Notebooks anteriores 

In [17]:
#entrenar el algoritmo basado en regresion logistica 
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=5000)
clf.fit(X_train_prep, y_train)

NameError: name 'X_train_prep' is not defined

## Prediccion de Nuevos ejemplos 

Realizar una prediccion con el metod generado anteriormente, tras el entrenamiento del algoritmo de Regresion LOgistica 
se utilizara el subconjuntio de validacion

In [None]:
y_pred=clf.predict(X_val_prep)

In [None]:
#Matriz de confucion
from sklearn.metrics import confusion_matrix
confusion_matrix(y_val,y_pred)

In [None]:
from sklearn.metrics import plot_conficion_matrix
plot_confucion_matrix(clf,X_val_prep,y_val,values_format='3g')

# Metricas derivadas de la matriz de confucion 

## Precicion

In [None]:
from sklearn.metrics import precision_score
print("precision",precision_score(y_val,y_prep,pos_label='anomaly'))

## Recall

In [None]:
from sklearn.metrics import recall_score
print("Recall",recall_score(y_val,y_prep,pos_label='anomaly'))

In [None]:
Curbas ROC y PR

In [None]:
from sklearn.metrics import plot_roc_curve
plot_roc_curve(clf,X_val_prep,y_val)