# **Pre-Processing**

The dataset that I have been working with so far was only 10% of the original dataset. In this notebook I will run the full dataset through the same process that I used on the 10% version. I believe that with GPU friendly models I will be able to run the full dataset through the training process. If not I can revert back to the 10% version. 

Since this is a competition dataset the data was already split into training data and test data. 

#### **Import Packages**

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler

import os

#### **Set Directory**

In [2]:
# Find and set the working directory for this project
os.chdir(r'C:\Users\jdrel\OneDrive\Documents\Data_Science\Springboard\Capstone-2')

####  **Load the Data**

In [13]:
# Import the data
data = pd.read_csv('./data/raw/Full.data.corrected')

# Load the columns of small_df
small_df = pd.read_csv('./data/interim/Small_df.csv', nrows = 1)

# Load columns of big_df
big_df = pd.read_csv('./data/interim/Big_df.csv', nrows = 1)

#### **Process Function**

In [23]:
def processing(dfx):
    # I copy and pasted the column names from the website into this list
    dfx.columns = ["duration", "protocol_type", "service", "flag", "src_bytes",
                    "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
                    "num_failed_logins", "logged_in", "num_compromised",
                    "root_shell", "su_attempted", "num_root", "num_file_creations",
                    "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
                    "is_guest_login", "count", "srv_count", "serror_rate","srv_serror_rate",
                    "rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
                    "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
                    "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate", "dst_host_serror_rate", 
                    "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "labels"]
    
    # Between 'num_outbound_cmds', and 'is_host_login' there are 2 non zero values both are observations that were both normal.
    # Use the drop methods to get rid of the near constants
    dfx = dfx.drop(columns = ['num_outbound_cmds', 'is_host_login'], axis = 1)
    '''
    Now we separate the data into the feature variables (X) and the target variable (y). With the features dataframe we can use 
    the pd.get_dummies function to create dummy features for all the categories in the categorical columns. This will make it possible 
    to analyze all the features for multicolinearity and then use lasso regularization to determine the most important features.
    '''
    # Only use the X data so that it is easy to test for multicolinearity
    X_data = dfx.drop('labels', axis = 1)

    # Find the categorical columns that need to be made numerical for analysis
    cat_cols = list(X_data.select_dtypes(include = 'object').columns)

    # Creating a completely numerical dataset that is usable for analysis
    x_num_data = pd.get_dummies(X_data, columns = cat_cols, 
                            # When testing for multi-co-linearity it is important to drop one of the dummies
                            # so that that column doesn't get flagged
                                drop_first = True)
    
    '''
    Skewed features are a problem as they make it hard for models to accurately describe interactions between different features because
    the arbitrary size of some feature will completely warp the math. To fix this we can scale the features with StandardScaler so as to
    preserve the nature of the feature without destroying the model by its size.
    '''
    # Create the scaler object
    scaler = StandardScaler()
    # fit the scaler to the dataset
    scaler.fit(x_num_data)
    # Scale the dataset
    x_num_data_scaled = pd.DataFrame(scaler.transform(x_num_data), columns = x_num_data.columns)

    '''
    Creating the features 'Syn Error' and 'Rej Error'
    '''
    # Define the Syn Error columns
    serror = ['serror_rate', 'srv_serror_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate']

    # Define the Syn Error columns
    rerror = ['rerror_rate', 'srv_rerror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']

    # Drop the redundant columns
    df = x_num_data_scaled.drop(columns = [*serror, *rerror])

    # Create new Syn Error column as mean of the former Syn Error columns
    df['Syn Error'] = x_num_data_scaled[serror].max(axis = 1)


    # Create new Syn Error column as mean of the former Syn Error columns
    df['Rej Error'] = x_num_data_scaled[rerror].max(axis = 1)

    '''
    get the list of the columns in big and small dataframe
    '''
    # Create the list of all the features that were in small_df
    small_features = list(small_df.columns)
    # Drop the target column 'labels' from that list
    small_features.remove('labels')
    # Create the list of all the features that were in big_df
    big_features = list(big_df.columns)
    # Drop the target column 'labels' from that list
    big_features.remove('labels')

    '''
    Now it is time to create the dataset
    '''
    #Store as a global variable since it is the end product
    global X_small
    # Generate the training dataframe for features that determine intrusion
    X_small = df[small_features]
    #Store as a global variable since it is the end product
    global X_big
    # Generate the training dataframe for features that determine type of intrusion
    X_big = df[big_features]
    #Store as a global variable since it is the end product
    global y
    # Generate y_train
    y = data['labels']

In [None]:
processing(data)