# Division del DataSet.

En este notebook se muestran algunos de los mecanismos mas utlizados para la divicion del DataSet.

## DataSet

### Descripcion:

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

### Ficheros de datos
* <span style="color:blue">**KDDTrain+.ARFF:** The full NSL-KDD train set with binary labels in ARFF format</span>
* KDDTrain+.TXT:** The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
* <span style="color:blue">**KDDTrain+_20Percent.ARFF:** A 20% subset of the KDDTrain+.arff file</span>
* <span style="color:blue">**KDDTrain+_20Percent.TXT:** A 20% subset of the KDDTrain+.txt file</span>
* <span style="color:blue">**KDDTest+.ARFF:** The full NSL-KDD test set with binary labels in ARFF format</span>
* <span style="color:blue">**KDDTest+.TXT:** The full NSL-KDD test set including attack-type labels and difficulty level in CSV format</span>
* <span style="color:blue">**KDDTest-21.ARFF:** A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21</span>
* <span style="color:blue">**KDDTest-21.TXT:** A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21</span>

# 1.- Lectura del DataSet

In [1]:
import arff 
import pandas as pd

In [2]:
def load_kdd_dataset(data_path):
    """Lectura del DataSet NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
        
    attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns = attributes)

In [5]:
df = load_kdd_dataset('datasets/datasets/NSL-KDD/KDDTrain+.arff')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Data columns (total 42 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   duration                     125973 non-null  float64
 1   protocol_type                125973 non-null  object 
 2   service                      125973 non-null  object 
 3   flag                         125973 non-null  object 
 4   src_bytes                    125973 non-null  float64
 5   dst_bytes                    125973 non-null  float64
 6   land                         125973 non-null  object 
 7   wrong_fragment               125973 non-null  float64
 8   urgent                       125973 non-null  float64
 9   hot                          125973 non-null  float64
 10  num_failed_logins            125973 non-null  float64
 11  logged_in                    125973 non-null  object 
 12  num_compromised              125973 non-null  float64
 13 

# 2.- Division del DataSet

Se debe Separ el DataSet en los difentes subconjuntos necesarios para realizar los procesos de entrenamiento, de validacion y pruebas
sklearn tiene implementada la funsion **split_train_test.**

In [7]:
# Separar el dataSet de datos, 60% train_set, 40%t est_set
from sklearn.model_selection import train_test_split 

train_set, test_set = train_test_split(df, test_size =0.4, random_state = 42)

In [8]:
#40% datos
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 75583 entries, 98320 to 121958
Data columns (total 42 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     75583 non-null  float64
 1   protocol_type                75583 non-null  object 
 2   service                      75583 non-null  object 
 3   flag                         75583 non-null  object 
 4   src_bytes                    75583 non-null  float64
 5   dst_bytes                    75583 non-null  float64
 6   land                         75583 non-null  object 
 7   wrong_fragment               75583 non-null  float64
 8   urgent                       75583 non-null  float64
 9   hot                          75583 non-null  float64
 10  num_failed_logins            75583 non-null  float64
 11  logged_in                    75583 non-null  object 
 12  num_compromised              75583 non-null  float64
 13  root_shell      

In [9]:
#60% de datos
test_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50390 entries, 378 to 89600
Data columns (total 42 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     50390 non-null  float64
 1   protocol_type                50390 non-null  object 
 2   service                      50390 non-null  object 
 3   flag                         50390 non-null  object 
 4   src_bytes                    50390 non-null  float64
 5   dst_bytes                    50390 non-null  float64
 6   land                         50390 non-null  object 
 7   wrong_fragment               50390 non-null  float64
 8   urgent                       50390 non-null  float64
 9   hot                          50390 non-null  float64
 10  num_failed_logins            50390 non-null  float64
 11  logged_in                    50390 non-null  object 
 12  num_compromised              50390 non-null  float64
 13  root_shell         

In [12]:
# Separar el conjunto o el DataSet de pruebas 50% validation_set 50% test_set
val_set, test_set = train_test_split(test_set, test_size = 0.5, random_state =42)

In [13]:
print("Longitud del training_set", len(train_set))
print("Longitud de validation_set", len(val_set))
print ("Longitud del Test_set", len(test_set))

Longitud del training_set 75583
Longitud de validation_set 25195
Longitud del Test_set 25195


# 3.- Particionado y aleatorio y Stratified Sampling

Sklearn implementa la funsion **train_test_split**, sin embargo esta funsion por defecto realiza un particionado del dataset aleatorio para cada vez que se ejecuta el script. Aun anadiendo una semilla fija para la generacion aleatorea cada vez que se carge de nuevo el DataSet se generara nuevos subconjuntos esto puede ocacionar que despues de muchos intentos, el algoritmo "vea o conozca todo el DataSet".

para solucionar este stopper, Sklearn ha implementado **shuffle** en la funsion **train_test_split**

In [14]:
# Si shuffle = false, el DataSet no mezclara antes el particionado.
train_set, test_set = train_test_split(df, test_size =0.40, random_state =42, shuffle = False)

 Estos metodos de division del DataSet estan bien si se tiene un DataSet muy grande, pero si no se tiene se corre el resgo de introducir **"Sampling bias"**. 

 Para evitar esto, se utiliza un metodo de **Sampling** que se llama **stratify sampling**. El objetivo es que no quede nimguna caracteristica del DataSet sin representacion en ninguno de los subconjuntos de datos para una o mas caracteristicas en particular.

 Sklearn introduce el parametro **stratify** en la funsion **train_test_split** para controlar este comportamiento en particular.

 This **stratify** parameter makes a split so that the proportition of values in the sampleproducer will be same as the proportion of make sure that your random split has 25% of 0's and 75% 1's

In [15]:
train_set ,test_set = train_test_split(df, test_size = 0.4, random_state = 42, stratify = df["protocol_type"])

# 4.- generacion de una funsion de particionado.

In [22]:
# construccion de una funsion que realize el particionado completo 
def train_val_test_split(df, rstate = 42, shuffle = True, stratify = None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size = 0.4, random_state = rstate, shuffle = shuffle, stratify = strat)
    
    strat = test_set [stratify] if stratify else None
    val_set = test_set = train_test_split(
        test_set, test_size = 0.5, random_state = rstate, shuffle = shuffle, stratify = strat)
    
    return (train_set, val_set, test_set)

In [23]:
print ("longitud de Set de datos", len(df))

longitud de Set de datos 125973


In [24]:
train_set, val_set, test_set = train_val_test_split(df, stratify = 'protocol_type')

In [25]:
print("Longitud del training_set", len(train_set))
print("Longitud de validation_set", len(val_set))
print ("Longitud del Test_set", len(test_set))

Longitud del training_set 75583
Longitud de validation_set 2
Longitud del Test_set 2


#
%matplotlib inline 
import matplotlib.pyplot as plt 
df["protocol_type"].hist()

train_set["protocol_type"].hist()

val_set["protocol_type"].hist()

test_set["protocol_type"].hist()