# <center>**Pre-Processing**

Many of the requirements for this notebook have been fulfilled in the EDA. I will copy and paste over the code that I used in EDA for those parts. But first we must import packages and import the data.

## Import Packages

In [1]:
import numpy as np
import pandas as pd


from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

import os

In [3]:
# Find and set the working directory for this project
os.chdir(r'C:\Users\jdrel\OneDrive\Documents\Data_Science\Springboard\Capstone-2')

Load the Data

In [5]:
# Import the data
data = pd.read_csv('./data/raw/Full.data.corrected')
# Look at the data
data.head()

Unnamed: 0,0,tcp,http,SF,215,45076,0.1,0.2,0.3,0.4,...,0.17,0.00.6,0.00.7,0.00.8,0.00.9,0.00.10,0.00.11,0.00.12,0.00.13,normal.
0,0,tcp,http,SF,162,4528,0,0,0,0,...,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,236,1228,0,0,0,0,...,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,233,2032,0,0,0,0,...,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,239,486,0,0,0,0,...,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,238,1282,0,0,0,0,...,5,1.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,normal.


In [7]:
# I copy and pasted the column names from the website into this list
data.columns = ["duration", "protocol_type", "service", "flag", "src_bytes",
                "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
                "num_failed_logins", "logged_in", "num_compromised",
                "root_shell", "su_attempted", "num_root", "num_file_creations",
                "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
                "is_guest_login", "count", "srv_count", "serror_rate","srv_serror_rate",
                "rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
                "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
                "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate", "dst_host_serror_rate", 
                "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "labels"]
data.head()

Between 'num_outbound_cmds', and 'is_host_login' there are 2 non zero values both are observations that had no intrusions.

In [12]:
data['num_outbound_cmds'].value_counts()

0    4898430
Name: num_outbound_cmds, dtype: int64

In [11]:
data.loc[data['is_host_login'] == 1, 'labels'].value_counts()

normal.    2
Name: labels, dtype: int64

Since both variables add no value I will drop them.

In [13]:
# Use the drop methods to get rid of the constants
data = data.drop(columns = ['num_outbound_cmds', 'is_host_login'], axis = 1)

Now we separate the data into the feature variables (X) and the target variable (y). With the features dataframe we can use the pd.get_dummies function to create dummy features for all the categories in the categorical columns. This will make it possible to analyze all the features for multicolinearity and then use lasso regularization to determine the most important features.

In [None]:
# Only use the X data so that it is easy to test for multicolinearity
X_data = data.drop('labels', axis = 1)

# Find the categorical columns that need to be made numerical for analysis
cat_cols = list(X_data.select_dtypes(include = 'object').columns)

# Creating a completely numerical dataset that is usable for analysis
x_num_data = pd.get_dummies(X_data, columns = cat_cols, 
                        # When testing for multi-co-linearity it is important to drop one of the dummies
                        # so that that column doesn't get flagged
                            drop_first = True)

**Skewed Features**

Skewed features are a problem as they make it hard for models to accurately describe interactions between different features because the arbitrary size of some feature will completely warp the math. To fix this we can scale the features with StandardScaler so as to preserve the nature of the feature without destroying the model by its size.

In [None]:
# Create the scaler object
scaler = StandardScaler()
# fit the scaler to the dataset
scaler.fit(x_num_data)
# Scale the dataset
x_num_data_scaled = pd.DataFrame(scaler.transform(x_num_data), columns = x_num_data.columns)

Combining all of the features that have syn error and rej error.

In [None]:
# Define the Syn Error columns
serror = ['serror_rate', 'srv_serror_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate']

# Define the Syn Error columns
rerror = ['rerror_rate', 'srv_rerror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']

# Drop the redundant columns
df = x_num_data_scaled.drop(columns = [*serror, *rerror])

# Create new Syn Error column as mean of the former Syn Error columns
df['Syn Error'] = x_num_data_scaled[serror].max(axis = 1)


# Create new Syn Error column as mean of the former Syn Error columns
df['Rej Error'] = x_num_data_scaled[rerror].max(axis = 1)

Some more preprocessing

In [None]:
# Drop These columns that have no clear connection but are multi-colinear
df = df.drop(['srv_count','service_ecr_i', 'dst_host_same_src_port_rate'],axis = 1)

# Drop the non-rate column
df = df.drop('dst_host_srv_count', axis = 1)

# Create the srv_rate column
srvrate = ['dst_host_same_srv_rate', 'same_srv_rate']

# Define srvrate
df['srv_rate'] = df[srvrate].max(axis = 1)

# Drop the srvrate columns
df = df.drop(srvrate, axis = 1)

In [None]:
features = [['wrong_fragment', 'hot', 'count', 'srv_diff_host_rate', 'dst_host_count', 'protocol_type_udp',
            'service_eco_i', 'service_ftp_data', 'service_smtp', 'flag_RSTR', 'Syn Error', 'Rej Error']]

With the amount of time that it took to run the eda for the bigger dataset I have made the decision to only use the features from the smaller df. Projects like this always have more work to do but my computer doesn't have enough power to analyze the larger dataset in a reasonable time frame.

In [None]:
# Generate the training dataframe with the non zero lasso coefficients
X_train = df[features]
# Take a look at the dataset
X_train.head()
# Generate y_train
y_train = data['labels']