## Spark Group Assignment

Group O-2-8

Attacks fall into four main categories:

* DOS: denial-of-service, e.g. syn flood;
* R2L: unauthorized access from a remote machine, e.g. guessing password;
* U2R:  unauthorized access to local superuser (root) privileges, e.g., various buffer overflow attacks;
* probing: surveillance and other probing, e.g., port scanning.

It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data.  This makes the task more realistic.  Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants.  The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only. 

Agenda
1. Load Data
2. Inspect Data
3. Preprocess Data
4. Create A Model
5. Make Predictions
6. Evaluate Predictions

### 1. Spark Setup

In [1]:
import os
print(os.environ['SPARK_HOME'])
dataset_path="/home/ubuntu/challenge_1/"

/usr/local/software/spark


In [2]:
import pandas as pd

In [3]:
#import findspark
#findspark.init()
import pyspark

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Dataset") \
    .getOrCreate()

In [5]:
spark.version

'2.2.0'

### 2. Data Loading

Data inspection shows that the data does not have a header.

In [6]:
# ---------
# Optione 1: Use SparkSession and infer schema, then add a header
# ---------

df = spark.read \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"full.data")

In [7]:
features=["duration", "protocol_type", "service", "flag", "src_bytes","dst_bytes", \
          "land","wrong_fragment","urgent","hot","num_failed_logins","logged_in", \
          "num_compromised","root_shell","su_attempted","num_root","num_file_creations", \
          "num_shells","num_access_files","num_outbound_cmds","is_host_login","is_guest_login", \
          "count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate",\
          "same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", \
          "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", \
          "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",\
          "dst_host_srv_rerror_rate"]

target=["connection"]

fieldnames=features+target

rawnames=df.schema.names

# Create a small function
def updateColNames(df,oldnames,newnames):
    for i in range(len(newnames)):
        df=df.withColumnRenamed(oldnames[i], newnames[i])
    return df

df=updateColNames(df,rawnames,fieldnames)

In [None]:
# @ADOLFO: Please check if it works for you.

# pd_df = df.toPandas()

### 3. Data Inspection


* How many records do we have?
* What is the schema of our data?
* Is it numerical , is it categorical?
* Visualize your data

In [9]:
# Print the number of records in the data frame
print('Nb. of records  : %d' % df.count())

Nb. of records  : 4898431


In [10]:
# Check the Schema
df.printSchema()

root
 |-- duration: integer (nullable = true)
 |-- protocol_type: string (nullable = true)
 |-- service: string (nullable = true)
 |-- flag: string (nullable = true)
 |-- src_bytes: integer (nullable = true)
 |-- dst_bytes: integer (nullable = true)
 |-- land: integer (nullable = true)
 |-- wrong_fragment: integer (nullable = true)
 |-- urgent: integer (nullable = true)
 |-- hot: integer (nullable = true)
 |-- num_failed_logins: integer (nullable = true)
 |-- logged_in: integer (nullable = true)
 |-- num_compromised: integer (nullable = true)
 |-- root_shell: integer (nullable = true)
 |-- su_attempted: integer (nullable = true)
 |-- num_root: integer (nullable = true)
 |-- num_file_creations: integer (nullable = true)
 |-- num_shells: integer (nullable = true)
 |-- num_access_files: integer (nullable = true)
 |-- num_outbound_cmds: integer (nullable = true)
 |-- is_host_login: integer (nullable = true)
 |-- is_guest_login: integer (nullable = true)
 |-- count: integer (nullable = true

In [11]:
# Some stats on numerical features
df.select('duration','hot').describe().show()

+-------+-----------------+--------------------+
|summary|         duration|                 hot|
+-------+-----------------+--------------------+
|  count|          4898431|             4898431|
|   mean|48.34243046395876|0.012437656057623349|
| stddev|723.3298112546812|  0.4689781645888025|
|    min|                0|                   0|
|    max|            58329|                  77|
+-------+-----------------+--------------------+



In [None]:
# Somehow this is not working ?!

# Just to check that Pandas is installed and working
# pd_df=df.toPandas()

In [None]:
# Create a table for SQL access
# df.registerTempTable("train_data")

In [None]:
# df.describe().toPandas().to_csv("data_summary")

In [12]:
# How many distict flags we have
df.groupby('protocol_type').count().show()

+-------------+-------+
|protocol_type|  count|
+-------------+-------+
|          tcp|1870598|
|          udp| 194288|
|         icmp|2833545|
+-------------+-------+



In [13]:
# How many distict services we have
df.groupby('service').count().show()

+---------+-----+
|  service|count|
+---------+-----+
|   telnet| 4277|
|      ftp| 5214|
|     auth| 3382|
| iso_tsap| 1052|
|   systat| 1056|
|     name| 1067|
|  sql_net| 1052|
|    ntp_u| 3833|
|      X11|  135|
|    pop_3| 1981|
|     ldap| 1041|
|  discard| 1059|
|   tftp_u|    3|
|   Z39_50| 1078|
|  daytime| 1056|
| domain_u|57782|
|    login| 1045|
|     smtp|96554|
|http_2784|    1|
|      mtp| 1076|
+---------+-----+
only showing top 20 rows



In [14]:
# How many distict flags we have
df.groupby('flag').count().show()

+------+-------+
|  flag|  count|
+------+-------+
|RSTOS0|    122|
|    S3|     50|
|    SF|3744328|
|    S0| 869829|
|   OTH|     57|
|   REJ| 268874|
|  RSTO|   5344|
|  RSTR|   8094|
|    SH|   1040|
|    S2|    161|
|    S1|    532|
+------+-------+



In [16]:
df.groupby('connection').count()\
    .orderBy('count', ascending =False)\
    .show(100)

+----------------+-------+
|      connection|  count|
+----------------+-------+
|          smurf.|2807886|
|        neptune.|1072017|
|         normal.| 972781|
|          satan.|  15892|
|        ipsweep.|  12481|
|      portsweep.|  10413|
|           nmap.|   2316|
|           back.|   2203|
|    warezclient.|   1020|
|       teardrop.|    979|
|            pod.|    264|
|   guess_passwd.|     53|
|buffer_overflow.|     30|
|           land.|     21|
|    warezmaster.|     20|
|           imap.|     12|
|        rootkit.|     10|
|     loadmodule.|      9|
|      ftp_write.|      8|
|       multihop.|      7|
|            phf.|      4|
|           perl.|      3|
|            spy.|      2|
+----------------+-------+



In [None]:
# Visualize , Visualize : do MOOORE PLOTS
# numericCols = ['duration', 'src_bytes','dst_bytes']
# sns.pairplot(df[numericCols], dropna=True)

In [None]:
# 3a. Create a in-memory DataFrame 
# df2.registerTempTable("network_data")

In [None]:
# num_features = [
    "duration","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate"
#]
#features = df[num_features]
#features.describe()

### Preprocess Data

The data inspetion shows that our dataset contains categorical variables. 
For example : workclass , education ,      marital-status , occupation , relationship

#### Feature Transformation

Since models work over nunmerical values we have to transform  these variables into numeric representation. For this transformation process ( categorical -> numerical ) we will use the following 'functions':

 1. **StringIndexer** 
     https://spark.apache.org/docs/2.2.1/ml-features.html#stringindexer
     StringIndexer encodes a string column of labels to a column of label indices.

 2. **OneHotEncoder**: 
     https://spark.apache.org/docs/2.2.1/ml-features.html#onehotencoder
     OneHotEncoder maps a column of label indices to a column of binary vectors, with at most a single one-value.
     This encoding allows algorithms which expect continuous features, such as Logistic Regression, 
     to use categorical features.Each categorical column will be indexed using the StringIndexer, 
     and then converted nto one-hot encoded variables using the One-Hot encoder. 
     The resulting output has the binary vectors appended to the end of each row.
   
 3. **VectorAssembler**: 
     https://spark.apache.org/docs/2.2.1/ml-features.html#vectorassembler
     TBW

 4. **Pipelines** : 
    We will have more than 1 'process' or stage in our transforamtion so we use a **Pipeline** 
    to put stages   together. This greately 'cleans' the code elaboration.

In [27]:
# Adding a Boolean column for attack (=1) or normal (=0)
df = df.withColumn('attack', \
              when(df["connection"] == 'normal', 0).otherwise(1))

df.select('attack').show(2)

+------+
|attack|
+------+
|     1|
|     1|
+------+
only showing top 2 rows



In [28]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

# For demonstration purposes , let's see how the indexer works
a_df=df.select('flag').distinct()
indexer = StringIndexer(inputCol='flag', outputCol='flagIndex')
model = indexer.fit(a_df)
indexed = model.transform(a_df)
indexed.show()

+------+---------+
|  flag|flagIndex|
+------+---------+
|RSTOS0|      4.0|
|    S3|      7.0|
|    SF|      8.0|
|    S0|      3.0|
|   OTH|      5.0|
|   REJ|      2.0|
|  RSTO|      9.0|
|  RSTR|      1.0|
|    SH|      6.0|
|    S2|     10.0|
|    S1|      0.0|
+------+---------+



In [29]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = [ \
           "protocol_type", "service", "flag"]

stages = [] # stages in our Pipeline
for col in categoricalColumns:
  
  # Category Indexing with StringIndexer
  indexer = StringIndexer(inputCol=col, outputCol=col+"_index")
   
  # Use OneHotEncoder to convert categorical variables into binary SparseVectors
  encoder = OneHotEncoder(inputCol=col+"_index", outputCol=col+"_vector")
  
  # Add stages.  These are not run here, but will run all at once later on.
  stages += [indexer, encoder]

In [None]:
# @ADOLFO: The professor uses this in Lab 4. I don't think we have to do it since our target variable (=attack)
#          has already been transformed into a Boolean. What do you think?

# Use StringIndexer to encode ALSO our target (income) to label indices.
# Convert label into label indices using the StringIndexer
# label_stringIdx = StringIndexer(inputCol = "outcome", outputCol = "label")
# stages += [label_stringIdx]

#### VectorAssembler 
combine all the feature columns into a single vector column. 
Vector assembler can be used to combine raw features and features generated by different feature transformers 
into a single feature vector, in order to train ML models like logistic regression 
This output will include both the numeric columns and the one-hot encoded binary vector columns in our dataset.

In [30]:
# Transform all numerical features into a vector using VectorAssembler

numericCols = [
    "duration","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate"
]

assemblerInputs = [ col + "_vector" for col in categoricalColumns ] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [31]:
print(assemblerInputs)

['protocol_type_vector', 'service_vector', 'flag_vector', 'duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']


In [32]:
# Check the stages of our pipeline
n=0
for s in stages:
    print('stage number %d %s' %(n,s.getOutputCol()))
    n+=1 

stage number 0 protocol_type_index
stage number 1 protocol_type_vector
stage number 2 service_index
stage number 3 service_vector
stage number 4 flag_index
stage number 5 flag_vector
stage number 6 features


### Make Model
 * Create a Pipeline object to group together the stages we defined ( feature transformations )
 * Create the model
 * Split data into train and test data
 * Train the model with train data
 * Test model predictions with test data

In [34]:
from pyspark.ml import Pipeline
# Create a Pipeline.
pipeline = Pipeline(stages=stages)

# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.

transformer = pipeline.fit(df)
transformed_df = transformer.transform(df)

# Keep relevant columns
selection = ["label", "features", "age", "occupation"] + assemblerInputs
dataset = transformed_df.select(selection)

KeyboardInterrupt: 

In [35]:
### Randomly split data into training (70%) and test (30%) sets. set seed for reproducibility
(train_data, test_data) = df.randomSplit([0.7, 0.3], seed = 123)
print('Training records : %d' % train_data.count())
print('Test records : %d ' % test_data.count())
train_data.cache()

Training records : 3427798
Test records : 1470633 


DataFrame[duration: int, protocol_type: string, service: string, flag: string, src_bytes: int, dst_bytes: int, land: int, wrong_fragment: int, urgent: int, hot: int, num_failed_logins: int, logged_in: int, num_compromised: int, root_shell: int, su_attempted: int, num_root: int, num_file_creations: int, num_shells: int, num_access_files: int, num_outbound_cmds: int, is_host_login: int, is_guest_login: int, count: int, srv_count: int, serror_rate: double, srv_serror_rate: double, rerror_rate: double, srv_rerror_rate: double, same_srv_rate: double, diff_srv_rate: double, srv_diff_host_rate: double, dst_host_count: int, dst_host_srv_count: int, dst_host_same_srv_rate: double, dst_host_diff_srv_rate: double, dst_host_same_src_port_rate: double, dst_host_srv_diff_host_rate: double, dst_host_serror_rate: double, dst_host_srv_serror_rate: double, dst_host_rerror_rate: double, dst_host_srv_rerror_rate: double, connection: string, attack: int]

In [None]:
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

# Train model with Training Data
model = lr.fit(train_data)

In [None]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.

predictions = model.transform(test_data)

In [None]:
# See model's predictions and probabilities of each prediction class
# You can select any columns in the above schema to view as well. 
# For example's sake we will choose age & occupation
selected = predictions.select("label", "prediction", "probability", "age", "occupation")

In [None]:
# Probability : 
# Here the probability column specifies the probability that the label is 0 (<=50K/yr) or 1 (>50K/yr)
# The algorithm selects (= predicts) the outcome with the highest probability
selected.toPandas()

#### Evaluation Metrics:
https://spark.apache.org/docs/2.2.1/mllib-evaluation-metrics.html

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
score = evaluator.evaluate(predictions)
print('Score is : %03f' % score )