## Division del DataSet

En este notebook se muestran algunos de los resultados mas utilizados para la division del dataset

#### Dataset

### Descripción.

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in [1]. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods.

Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable

### Ficheros de Datos.

* <span style ="color:green" >**KDDTrain+.ARFF:** The full NSL-KDD train set with binary labels in ARFF format.</span>
* <span style ="color:green" >**KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format.</span>
* KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file.
* KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
* KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format.
* KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format.
* KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21.
* KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21.

### Descarga de los ficheros de datos.
https://www.unb.ca/cic/datasets/index.html

### Referencias adicionales sobre el conjunto de datos.
_M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009._

## 1.-Lectura del DataSet

In [3]:
import arff
import pandas as pd

In [4]:
def load_kdd_dataset(data_path):
    """Lectura del DataSet NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
    attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns=attributes)

In [5]:
load_kdd_dataset('datasets/NSL-KDD/KDDTrain+.arff')

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,...,25.0,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,...,1.0,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,26.0,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0.0,tcp,http,SF,232.0,8153.0,0,0.0,0.0,0.0,...,255.0,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,...,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,25.0,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8.0,udp,private,SF,105.0,145.0,0,0.0,0.0,0.0,...,244.0,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0.0,tcp,smtp,SF,2231.0,384.0,0,0.0,0.0,0.0,...,30.0,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0.0,tcp,klogin,S0,0.0,0.0,0,0.0,0.0,0.0,...,8.0,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


In [7]:
df = load_kdd_dataset('datasets/NSL-KDD/KDDTrain+.arff')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Data columns (total 42 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   duration                     125973 non-null  float64
 1   protocol_type                125973 non-null  object 
 2   service                      125973 non-null  object 
 3   flag                         125973 non-null  object 
 4   src_bytes                    125973 non-null  float64
 5   dst_bytes                    125973 non-null  float64
 6   land                         125973 non-null  object 
 7   wrong_fragment               125973 non-null  float64
 8   urgent                       125973 non-null  float64
 9   hot                          125973 non-null  float64
 10  num_failed_logins            125973 non-null  float64
 11  logged_in                    125973 non-null  object 
 12  num_compromised              125973 non-null  float64
 13 

## 2.- Division del Dataset

Se debe separar el Dataset en los diferentes subconjuntos necesarios para realizar los procesos de entrenamineto,validacion y pruebas. 
Sklearn tiene implementada la funcion **split_train_test_split** 

In [None]:
# Separar el Dataset 60% train_Set, 40% test_set
from sklearn.model_selection import train_split
train_set, test_set = train_test_split(df, test_size=0.4, random_state=42)

In [None]:
train_set.info

In [None]:
test_set.info

In [None]:
# Separar el Dataset de pruebas 50% Validation set, 50% test set
val_set, test_set = train_test_split(test, test_size = 0.5, random_state = 12)


In [None]:
print("Longitud del Training Set: ", len(train_set))
print("Longitud de validacion set: ", len(val_set))
print("Longitud de del Test Set: ", len(test_set))


## 3.- Particionado aleatorio y Stratified Sampling.


Sklearn tiene implementada la funcion **train_test_split**, sin embargo esta funcion por defecto realiza un particionado del conjunto de datos aleatoriopara cada vez que se ejecuta el script. Aun añadiendo una semilla para la generacion aleatoria, cada vez que se cargue uno nuevo elemento al dataset se genera nuevos subconjuntos. Esto puede ocacionar que despues de muchos intentos, el algoritmo vea todo el dataset.

PAra oslucionar este stopper, sklearn a implementado el parametro **shuffle** en la funcion frame _test_split

In [None]:
# Si shuffle  = False, el dataset no mezclara antes del particionado 
train_set_test, train_test_split(df, test_size =0.4, random_state = 42, shuffle = False)

Estos metodos para dividir el DataSet estan bien si tienes un conjunto de datos  muy grande pero sino se tiene se corre el riensgo de no introducir **Sampling Bias**.
Para evitar esto, se utitliza un metodo de sampling que se llama **Stratified Sampling**. La poblacion es dividida en subconjuntos homogeneos llamados **strata**. El objetivo es que no quede ninguna caracterirstica  del conjunto de datos sin representacion en ninguno de los conjuntos de datos para una o mas caracteristicas en particular.Sklearninplementa un parametro ** stratified** enla funcion **train_test_split** para controlar este comportamiento 

_This stratify parametrer makes a split so that the proportion of values in them sample produce will be the same as tha proportion of valuesprovided to parameter stratify._

For example if varieable y is a binary categorical varia_ble with values 0 and 1 and theere are 25% of zeros an 75% of ones, stratify = ywill make sure that your random split has 25% of zeros and 75% of unos._


In [None]:
train_set, test_set = train_test_split(df, test_size = 0.4, random_state= 42,stratify = df["protocol_type"])