## Description

This notebook shows some of the most used techniques to transform the data set

### Imports

In [1]:
from scipy.io import arff
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split



### Auxiliary functions

In [2]:
def load_kdd_dataset(data_path):
    data = arff.loadarff(data_path)
    df = pd.DataFrame(data[0])
    return df

In [3]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

### 1. Reading the data set

In [4]:
df = load_kdd_dataset("../datasets/NSL-KDD/KDDTrain+.arff")

In [5]:
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0.0,b'tcp',b'ftp_data',b'SF',491.0,0.0,b'0',0.0,0.0,0.0,...,25.0,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,b'normal'
1,0.0,b'udp',b'other',b'SF',146.0,0.0,b'0',0.0,0.0,0.0,...,1.0,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,b'normal'
2,0.0,b'tcp',b'private',b'S0',0.0,0.0,b'0',0.0,0.0,0.0,...,26.0,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,b'anomaly'
3,0.0,b'tcp',b'http',b'SF',232.0,8153.0,b'0',0.0,0.0,0.0,...,255.0,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,b'normal'
4,0.0,b'tcp',b'http',b'SF',199.0,420.0,b'0',0.0,0.0,0.0,...,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,b'normal'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,b'tcp',b'private',b'S0',0.0,0.0,b'0',0.0,0.0,0.0,...,25.0,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,b'anomaly'
125969,8.0,b'udp',b'private',b'SF',105.0,145.0,b'0',0.0,0.0,0.0,...,244.0,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,b'normal'
125970,0.0,b'tcp',b'smtp',b'SF',2231.0,384.0,b'0',0.0,0.0,0.0,...,30.0,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,b'normal'
125971,0.0,b'tcp',b'klogin',b'S0',0.0,0.0,b'0',0.0,0.0,0.0,...,8.0,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,b'anomaly'


### 2. Splitting of the data set

In [6]:
train_set, val_set, test_set = train_val_test_split(df, stratify='protocol_type')

In [7]:
print("Training Set Length:", len(train_set))
print("Validation Set Length:", len(val_set))
print("Test Set Length:", len(test_set))

Training Set Length: 75583
Validation Set Length: 25195
Test Set Length: 25195


### 3 Cleaning the data

Before we start, let's retrieve the clean dataset and separate the labels from the rest of the data, we don't necessarily want to apply the same transformations on both sets.

In [None]:
# We separate the input features from the output feature
X_train = train_set.drop("class", axis=1)
y_train = train_set["class"].copy()

In [None]:
# To illustrate this section we are going to add some null values to some features of the dataset
X_train.loc[(X_train["src_bytes"]>400) & (X_train["src_bytes"]<800), "src_bytes"] = np.nan
X_train.loc[(X_train["dst_bytes"]>500) & (X_train["dst_bytes"]<2000), "dst_bytes"] = np.nan
X_train

Most machine learning algorithms cannot work on features that contain null values. Therefore, there are three options to replace them:  

* Delete the corresponding rows
* Delete the corresponding attribute (column)
* Fill them with a given value (zero, mean...)

In [None]:
# Check if there is any attribute with null values
X_train.isna().any()

In [None]:
# Select the rows that contain null values
filas_valores_nulos  = X_train[X_train.isnull().any(axis=1)]
filas_valores_nulos

#### Option 1: We delete the rows with null values

In [None]:
# We copy the dataset so as not to alter the original
X_train_copy = X_train.copy()

In [None]:
# remove rows with null values
X_train_copy.dropna(subset=["src_bytes", "dst_bytes"], inplace=True)
X_train_copy

In [None]:
# Count the number of rows deleted
print("The number of rows removed is:", len(X_train) - len(X_train_copy))

#### Option 2: We remove the attributes with null values

In [None]:
# We copy the dataset so as not to alter the original
X_train_copy = X_train.copy()

In [None]:
# Remove attributes with null values
X_train_copy.drop(["src_bytes", "dst_bytes"], axis=1, inplace=True)
X_train_copy

In [None]:
# Count the number of attributes removed
print("The number of attributes removed is:", len(list(X_train)) - len(list(X_train_copy)))

#### Option 3: We fill the null values with a certain value

In [None]:
# We copy the dataset so as not to alter the original
X_train_copy = X_train.copy()

In [None]:
# We fill the null values with the average of the attribute values
media_srcbytes = X_train_copy["src_bytes"].mean()
media_dstbytes = X_train_copy["dst_bytes"].mean()

X_train_copy["src_bytes"].fillna(media_srcbytes, inplace=True)
X_train_copy["dst_bytes"].fillna(media_dstbytes, inplace=True)

X_train_copy

In [None]:
# We copy the dataset so as not to alter the original
X_train_copy = X_train.copy()

In [None]:
# A very high value in the attribute can trigger the average
# Fill the values with the median
mediana_srcbytes = X_train_copy["src_bytes"].median()
mediana_dstbytes = X_train_copy["dst_bytes"].median()

X_train_copy["src_bytes"].fillna(mediana_srcbytes, inplace=True)
X_train_copy["dst_bytes"].fillna(mediana_dstbytes, inplace=True)

X_train_copy

#### There is another alternative to option 3 which is to use sklearn's Imputer class

In [None]:
# We copy the dataset so as not to alter the original
X_train_copy = X_train.copy()

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

In [None]:
# The imputer class does not support categorical values, we remove the categorical attributes
X_train_copy_num = X_train_copy.select_dtypes(exclude=['object'])
X_train_copy_num.info()

In [None]:
# Numerical attributes are provided for you to calculate the values
imputer.fit(X_train_copy_num)

In [None]:
# fill in the null values
X_train_copy_num_nonan = imputer.transform(X_train_copy_num)

In [None]:
# We transform the result to a Pandas DataFrame
X_train_copy = pd.DataFrame(X_train_copy_num_nonan, columns=X_train_copy_num.columns)

In [None]:
X_train_copy.head(10)

#### sklearn APIs

* **Estimators**: Any object that can estimate some parameter:  
    * The estimator itself is formed by the fit() method, which always takes a dataset as an argument.  
    * Any other parameter of this method is a hyperparameter.  
* **Transformers**: They are estimators capable of transforming the data set (as Inputer).  
    * The transformation is done using the transform() method.  
    * Reciben un dataset como parámetro de entrada.  
* **Predictors**: They are estimators capable of making predictions.  
    * Prediction is done using the predict() method.
    * Reciben un dataset como entrada.
    * Retornan un dataset con las predicciones.
    * They have a score() method to evaluate the result of the prediction.

### 4. Transformation of categorical attributes to numeric

In [None]:
X_train = train_set.drop("class", axis=1)
y_train = train_set["class"].copy()

Machine Learning algorithms generally ingest numerical data. In our data set we have a lot of categorical values and therefore we need to convert them to numeric.

In [None]:
X_train.info()

There are different ways to convert categorical attributes to numeric. Probably the simplest is the one provided by Pandas' **factorize** method. Which transforms each category into a sequential number.

In [None]:
protocol_type = X_train['protocol_type']
protocol_type_encoded, categorias = protocol_type.factorize()

In [None]:
# We show on the screen how they have been encoded
for i in range(10):
    print(protocol_type.iloc[i], "=", protocol_type_encoded[i])

In [None]:
print(categorias)

#### Advanced transformations using sklearn

##### Ordinal Encoding  
Performs the same encoding as Pandas' **factorize** method

In [None]:
from sklearn.preprocessing import OrdinalEncoder

protocol_type = X_train[['protocol_type']]

ordinal_encoder = OrdinalEncoder()
protocol_type_encoded = ordinal_encoder.fit_transform(protocol_type)

In [None]:
# We show on the screen how they have been encoded
for i in range(10):
    print(protocol_type["protocol_type"].iloc[i], "=", protocol_type_encoded[i])

In [None]:
print(ordinal_encoder.categories_)

The problem with this type of encoding is that certain ML algorithms that work by measuring the similarity of two points by distance will consider that 1 is closer to 2 than to 3, and in this case (for these categorical values) , Has no sense. Therefore, other categorization methods are used, such as One-Hot encoding.

##### One-Hot Encoding  
Generates for each category of the categorical attribute a binary matrix that represents the value.

In [None]:
# The sparse matrix only stores the position of values that are not '0' to save memory
from sklearn.preprocessing import OneHotEncoder

protocol_type = X_train[['protocol_type']]

oh_encoder = OneHotEncoder()
protocol_type_oh = oh_encoder.fit_transform(protocol_type)
protocol_type_oh

In [None]:
# Convert the sparse matrix to a Numpy array
protocol_type_oh.toarray()

In [None]:
# We show on the screen how they have been encoded
for i in range(10):
    print(protocol_type["protocol_type"].iloc[i], "=", protocol_type_oh.toarray()[i])

In [None]:
print(ordinal_encoder.categories_)

On many occasions, when partitioning the data set or when making a prediction with new examples, new values appear for certain categories that will produce an error in the transform() function. The OneHotEncoding class provides the handle_uknown parameter to either raise an error or ignore if an unknown categorical feature is present during the transformation (default is to throw an error).  

When this parameter is set to "ignore" and an unknown category is encountered during the transformation, the resulting encoded columns for this feature will be all zeros. In the reverse transformation, an unknown category will be denoted as None.

In [None]:
oh_encoder = OneHotEncoder(handle_unknown='ignore')

##### Get Dummies  
Get Dummies is an easy to use method that allows you to apply One-Hot Encoding to a Pandas Data Frame.

In [None]:
pd.get_dummies(X_train['protocol_type'])

### 5. Dataset Scaling

In [None]:
X_train = train_set.drop("class", axis=1)
y_train = train_set["class"].copy()

As a general rule, Machine Learning algorithms do not behave properly if the values of the features they receive as input are in very different ranges. Therefore, different scaling techniques are used. It is important to note that these scaling mechanisms should not be applied to labels.  
* **Normalization**: The attribute values are scaled to acquire a value between 0 and 1.  
* **Standardization**: The attribute values are scaled and receive a similar value but it is not within a range.  

***It is important that to test these values, the transformations are performed only on the training data set. Then, they will be applied on the test data set to test.***

In [None]:
from sklearn.preprocessing import RobustScaler

scale_attrs = X_train[['src_bytes', 'dst_bytes']]

robust_scaler = RobustScaler()
X_train_scaled = robust_scaler.fit_transform(scale_attrs)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=['src_bytes', 'dst_bytes'])

In [None]:
X_train_scaled.head(10)

In [None]:
X_train.head(10)