# Wrangling Overview
In this notebook, I wrangle the network intrusion dataset in the following steps:

- Set up a Spark virtual environment
- Ingest the full intrusion detection dataset
- Reduce 'smurf' and 'neptune' attack classes by 95% through isolating their rows and sampling
    * Use sampleBy with python dict containing each class's fraction
- Convert dataset to binary classes by combining all attack classes into category 'anomalous'
- Split the reduced dataset 50%-30%-20% for model training, validation, and testing
    * Stratify the target column between train and validate split
    * Sequester the validate and test splits for later use; only use train split for exploration and model fit
    
# Setup

In [1]:
import pandas as pd

import pyspark
from pyspark.sql.functions import *
spark = pyspark.sql.SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/22 22:22:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Ingest data from CSV

In [2]:
# ingest data
df = spark.read.csv('kddcup.data.corrected', header=True)

# Reduce 'smurf' and 'neptune' attack classes by 95% 

In [3]:
# create fractions dataframe for sampleBy
fraction_df = df.select('target').distinct().withColumn('fraction', 
                                                        when((df.target == 'neptune.') | 
                                                             (df.target == 'smurf.'), 
                                                             0.05)
                                                        .otherwise(1))

In [4]:
# convert fractions df to dict, use dict in sampleBy
df = df.sampleBy('target', fraction_df.toPandas().set_index('target').to_dict()['fraction'])
# check work
df.count()

                                                                                

1212502

# Convert 'target' column to binary classes

In [5]:
# convert to binary classes
df = df.withColumn('target', when(df.target != 'normal.', 'anomalous').otherwise('normal'))

# Split the dataset into 50%-30%-20%

In [6]:
# split data
train, validate, test = df.randomSplit([0.5, 0.3, 0.2], seed=42)

# Check wrangle.py script

In [7]:
import wrangle

In [8]:
df, train, validate, text = wrangle.prep_explore()

                                                                                

In [9]:
print('Row count of df (expect 1.2m):', df.count())
print('Row count of train (expect 600k):', train.count())
print('\nFirst Observation in Train')
train.show(1, vertical=True)

22/01/22 22:23:11 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Row count of df (expect 1.2m): 1212357


                                                                                

Row count of train (expect 600k): 606923

First Observation in Train


[Stage 17:>                                                         (0 + 1) / 1]

-RECORD 0-----------------------------
 duration                    | 0      
 protocol_type               | icmp   
 service                     | eco_i  
 flag                        | SF     
 src_bytes                   | 30     
 dst_bytes                   | 0      
 land                        | 0      
 wrong_fragment              | 0      
 urgent                      | 0      
 hot                         | 0      
 num_failed_logins           | 0      
 logged_in                   | 0      
 num_compromised             | 0      
 root_shell                  | 0      
 su_attempted                | 0      
 num_root                    | 0      
 num_file_creations          | 0      
 num_shells                  | 0      
 num_access_files            | 0      
 num_outbound_cmds           | 0      
 is_host_login               | 0      
 is_guest_login              | 0      
 count                       | 1      
 srv_count                   | 1      
 serror_rate             

