<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="4" color="black"><b>Use Spark ML and Scala to detect network intrusions</b></font></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr> 
   <tr style="border: none">
       <td style="border: none"><img src="https://github.com/pmservice/wml-sample-models/raw/master/tensorflow/hand-written-digit-recognition/images/experiment_banner.png" width="600" height = "200" alt="Icon"></td>
   </tr>
</table>

This notebook shows you how to easily build two classification models using the Spark Machine Learning (ML) library to detect network intrusions. It uses the Random Forest (RF) classifier and the Multilayer Perceptron (MLP) classifier to build the required algorithms.

<a href="http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html" target="_blank" rel="noopener noreferrer">UCI kddcup data</a> (743MB) is used in this notebook. This data set can be audited and provides intrusions simulated in a military network environment. It was originally used for the **The Third International Knowledge Discovery and Data Mining Tools Competition** organized for **KDD-99**. 


This notebook runs on Scala and Spark and it was tested with Watson Studio Spark Environments.

## Table of contents

1. [Download data](#download)<br>
2. [Load and prepare data](#load)<br>
3. [Build the models](#build)<br>
    3.1 [Set up the Random Forest model](#rf)<br>
    3.2 [Set up the Multilayer Perceptron model](#mlp)<br>
4.  [Summary and next steps](#summary)  


<a id="download"></a>
## 1. Download data

First, download the prerequisite data set from Watson Studio using the following url: <a href="https://dataplatform.ibm.com/exchange-api/v1/entries/1438a61212a64ac435c837ba046efc19/data?accessKey=903188bb984a30f38bb889102a7db39f" target="_blank" rel="noopener noreferrer">https://dataplatform.ibm.com/exchange-api/v1/entries/1438a61212a64ac435c837ba046efc19/data?accessKey=903188bb984a30f38bb889102a7db39f</a> 

Assign this URL to the variable `url`.


In [None]:
import sys.process._

val url = "LINK-TO-DATA-SET-URL"
val filename = "./kddcup.zip"
s"wget $url -O $filename".!

Create a ```kddcup``` directory and **unzip** the file that you downloaded:

In [None]:
"mkdir ./kddcup".!
"unzip ./kddcup.zip -d ./kddcup/".!

List the content of the unzipped file:

In [None]:
"ls ./kddcup/".!

To use the entire data set ```kddcup.data``` (743 MB) run **gunzip** to unzip the file to the same directory:

In [None]:
"gunzip ./kddcup/kddcup.data.gz -d ./kddcup/kddcup.data".!

<a id="load"></a>
## 2. Load and prepare data
You can use the ```SparkSession``` to read the data directly into a dataframe because the data is provided in CSV (comma-separated values) format.

In [7]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder().
    getOrCreate()
val df = spark.
    read.format("csv").
    option("inferSchema", "true").
    load("./kddcup/kddcup.data")
df.show(5)


+---+---+----+---+---+-----+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+
|_c0|_c1| _c2|_c3|_c4|  _c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|_c39|_c40|   _c41|
+---+---+----+---+---+-----+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+
|  0|tcp|http| SF|215|45076|  0|  0|  0|  0|   0|   1|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   1|   1| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0|   0|   0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|normal.|
|  0|tcp|http| SF|162| 4528|  0|  0|  0|  0|   0|   1|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   2|   2| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0|   1|   1| 1.0| 0.0

spark = org.apache.spark.sql.SparkSession@2510b661
df = [_c0: int, _c1: string ... 40 more fields]


[_c0: int, _c1: string ... 40 more fields]

Now take a look at the schema and labels of the last column ```_c41```.

In [8]:
df.printSchema

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: integer (nullable = true)
 |-- _c7: integer (nullable = true)
 |-- _c8: integer (nullable = true)
 |-- _c9: integer (nullable = true)
 |-- _c10: integer (nullable = true)
 |-- _c11: integer (nullable = true)
 |-- _c12: integer (nullable = true)
 |-- _c13: integer (nullable = true)
 |-- _c14: integer (nullable = true)
 |-- _c15: integer (nullable = true)
 |-- _c16: integer (nullable = true)
 |-- _c17: integer (nullable = true)
 |-- _c18: integer (nullable = true)
 |-- _c19: integer (nullable = true)
 |-- _c20: integer (nullable = true)
 |-- _c21: integer (nullable = true)
 |-- _c22: integer (nullable = true)
 |-- _c23: integer (nullable = true)
 |-- _c24: double (nullable = true)
 |-- _c25: double (nullable = true)
 |-- _c26: double (nullable = true)
 |-- _c27: d

In [9]:
df.select("_c41").groupBy("_c41").count().show()

+----------------+-------+
|            _c41|  count|
+----------------+-------+
|    warezmaster.|     20|
|          smurf.|2807886|
|            pod.|    264|
|           imap.|     12|
|           nmap.|   2316|
|   guess_passwd.|     53|
|        ipsweep.|  12481|
|      portsweep.|  10413|
|          satan.|  15892|
|           land.|     21|
|     loadmodule.|      9|
|      ftp_write.|      8|
|buffer_overflow.|     30|
|        rootkit.|     10|
|    warezclient.|   1020|
|       teardrop.|    979|
|           perl.|      3|
|            phf.|      4|
|       multihop.|      7|
|        neptune.|1072017|
+----------------+-------+
only showing top 20 rows



According to the <a href="http://kdd.ics.uci.edu/databases/kddcup99/training_attack_types" target="_blank" rel="noopener noreferrer">description</a>, the labels should be recoded into five categories using an SQL query. The new column name ```label_s``` stands for *label in string*.

In [10]:
df.createOrReplaceTempView("attack")

val query = """SELECT *, 
    CASE _c41 
        WHEN 'back.' THEN 'dos'
        WHEN 'buffer_overflow.' THEN 'u2r'
        WHEN 'ftp_write.' THEN 'r2l'
        WHEN 'guess_passwd.' THEN 'r2l'
        WHEN 'imap.' THEN 'r2l'
        WHEN 'ipsweep.' THEN 'probe'
        WHEN 'land.' THEN 'dos'
        WHEN 'loadmodule.' THEN 'u2r'
        WHEN 'multihop.' THEN 'r2l'
        WHEN 'neptune.' THEN 'dos'
        WHEN 'nmap.' THEN 'probe'
        WHEN 'perl.' THEN 'u2r'
        WHEN 'phf.' THEN 'r2l'
        WHEN 'pod.' THEN 'dos'
        WHEN 'portsweep.' THEN 'probe'
        WHEN 'rootkit.' THEN 'u2r'
        WHEN 'satan.' THEN 'probe'
        WHEN 'smurf.' THEN 'dos'
        WHEN 'spy.' THEN 'r2l'
        WHEN 'teardrop.' THEN 'dos'
        WHEN 'warezclient.' THEN 'r2l'
        WHEN 'warezmaster.' THEN 'r2l'
        ELSE 'normal'
END AS label_s 
FROM attack"""
val labeled = spark.sql(query)
labeled.select("label_s").groupBy("label_s").count().show()

+-------+-------+
|label_s|  count|
+-------+-------+
|    u2r|     52|
| normal| 972781|
|    r2l|   1126|
|  probe|  41102|
|    dos|3883370|
+-------+-------+



query = 


        WHEN 'warez...


SELECT *, 
    CASE _c41 
        WHEN 'back.' THEN 'dos'
        WHEN 'buffer_overflow.' THEN 'u2r'
        WHEN 'ftp_write.' THEN 'r2l'
        WHEN 'guess_passwd.' THEN 'r2l'
        WHEN 'imap.' THEN 'r2l'
        WHEN 'ipsweep.' THEN 'probe'
        WHEN 'land.' THEN 'dos'
        WHEN 'loadmodule.' THEN 'u2r'
        WHEN 'multihop.' THEN 'r2l'
        WHEN 'neptune.' THEN 'dos'
        WHEN 'nmap.' THEN 'probe'
        WHEN 'perl.' THEN 'u2r'
        WHEN 'phf.' THEN 'r2l'
        WHEN 'pod.' THEN 'dos'
        WHEN 'portsweep.' THEN 'probe'
        WHEN 'rootkit.' THEN 'u2r'
        WHEN 'satan.' THEN 'probe'
        WHEN 'smurf.' THEN 'dos'
        WHEN 'spy.' THEN 'r2l'
        WHEN 'teardrop.' THEN 'dos'
        WHEN 'warezclient.' THEN 'r2l'
        WHEN 'warezmaster.' THEN 'r2l'
        ELSE 'normal'
END AS label_s 
FROM attack

Now, build a pipeline to prepare the data before building the models.

**Data preparation pipeline:**

*    StringIndexers: c1, c2, and c3 are categorical strings. They must be indexed first.
*    OneHotEncoders: When the categorical strings have been indexed, you can use one-hot encoding to the indexed columns.
*    VectorAssembler: Include the wanted columns and assemble them as a feature vector.
*    labelIndexer: Another StringIndexer is used to index the label_s column to output it as label column.

In [11]:
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorAssembler, OneHotEncoder}
import org.apache.spark.ml.Pipeline

val indexer1 = new StringIndexer()
  .setInputCol("_c1")
  .setOutputCol("i_c1")
//   .setHandleInvalid("skip")

val indexer2 = new StringIndexer()
  .setInputCol("_c2")
  .setOutputCol("i_c2")
//   .setHandleInvalid("skip")

val indexer3 = new StringIndexer()
  .setInputCol("_c3")
  .setOutputCol("i_c3")
//   .setHandleInvalid("skip")

val encoder1 = new OneHotEncoder()
  .setInputCol("i_c1")
  .setOutputCol("v_c1")

val encoder2 = new OneHotEncoder()
  .setInputCol("i_c2")
  .setOutputCol("v_c2")

val encoder3 = new OneHotEncoder()
  .setInputCol("i_c3")
  .setOutputCol("v_c3")

val featurenames = Array("_c0", "v_c1", "v_c2", "v_c3", "_c4", "_c5", "_c6", 
                         "_c7", "_c8", "_c9", "_c10", "_c11", "_c12", "_c13", 
                         "_c14", "_c15", "_c16", "_c17", "_c18", "_c19",
                         "_c22", "_c23", "_c24", "_c25", "_c26", "_c27", 
                         "_c28", "_c29", "_c30", "_c31", "_c32", "_c33", "_c34", 
                         "_c35", "_c36", "_c37", "_c38", "_c39", "_c40")
val assembler = new VectorAssembler()
  .setInputCols(featurenames)
  .setOutputCol("features")

val labelIndexer = new StringIndexer()
  .setInputCol("label_s")
  .setOutputCol("label")

val pipeline_prepare = new Pipeline()
  .setStages(Array(indexer1,indexer2,indexer3,encoder1,encoder2,encoder3,assembler,labelIndexer))


indexer1 = strIdx_a7eb7d249c8b
indexer2 = strIdx_bae7a99525b3
indexer3 = strIdx_d2290e2b7896
encoder1 = oneHot_0ba66b85978c
encoder2 = oneHot_2cc465e87014
encoder3 = oneHot_7587f352e24a
featurenames = Array(_c0, v_c1, v_c2, v_c3, _c4, _c5, _c6, _c7, _c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19, _c22, _c23, _c24, _c25, _c26, _c27, _c28, _c29, _c30, _c31, _c32, _c33...




[_c0, v_c1, v_c2, v_c3, _c4, _c5, _c6, _c7, _c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19, _c22, _c23, _c24, _c25, _c26, _c27, _c28, _c29, _c30, _c31, _c32, _c33, _c34, _c35, _c36, _c37, _c38, _c39, _c40]

You can now fit and transform the data.

In [12]:
val data = pipeline_prepare.fit(labeled).transform(labeled)

data = [_c0: int, _c1: string ... 49 more fields]


[_c0: int, _c1: string ... 49 more fields]

<a id="build"></a>
## 3. Build the models
This section describes how to build the models.
Because of the large amount of data, we can use 60/40 split to mitigate overfitting:
* 60% for the ```training``` set
* 40% for the ```testing``` set

In [13]:
val Array(train, test) = data.randomSplit(Array(0.6, 0.4))

train = [_c0: int, _c1: string ... 49 more fields]
test = [_c0: int, _c1: string ... 49 more fields]


[_c0: int, _c1: string ... 49 more fields]

<a id="rf"></a>
### 3.1 Set up the Random Forest model

As the Random Forest (RF) algorithm is provided by Spark ML, you only have to set it up. Train and fit the model to the training data.

**Note:** There are 70 categories in the ```c2``` column. This is larger than the default of ```_MaxBins_```. To avoid errors ```_MaxBins_``` is set to 72, because it has to be larger than the biggest number of categories of all categorical variables. 


In [14]:
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
val t = System.nanoTime
val rf = new RandomForestClassifier()
  .setLabelCol("label")
  .setFeaturesCol("features")
  .setNumTrees(5)
  .setMaxBins(72)

val rf_model = rf.fit(train)
val duration = (System.nanoTime - t) / 1e9d
println(s"Training process takes $duration secs")

Training process takes 124.641922243 secs


t = 4439303327041089
rf = rfc_a37b39f56d8b
rf_model = RandomForestClassificationModel (uid=rfc_a37b39f56d8b) with 5 trees
duration = 124.641922243


124.641922243

Details about the random forest classification model can be printed and will look something like the following:
```
RandomForestClassificationModel (uid=rfc_a37b39f56d8b) with 5 trees
  Tree 0 (weight 1.0):
    If (feature 1 in {0.0})
     If (feature 104 <= 0.325)
      If (feature 116 <= 0.005)
       If (feature 100 <= 0.375)
        If (feature 113 <= 0.025)
         Predict: 1.0
        Else (feature 113 > 0.025)
         Predict: 2.0
       Else (feature 100 > 0.375)
        If (feature 110 <= 0.14500000000000002)
         Predict: 0.0
        Else (feature 110 > 0.14500000000000002)
         Predict: 0.0
      Else (feature 116 > 0.005)
       If (feature 0 <= 0.5)
        If (feature 105 <= 0.41500000000000004)
         Predict: 0.0
        Else (feature 105 > 0.41500000000000004)
         Predict: 2.0
       Else (feature 0 > 0.5)
        If (feature 110 <= 0.025)
         Predict: 1.0
        Else (feature 110 > 0.025)
         Predict: 2.0
     Else (feature 104 > 0.325)
      If (feature 73 in {0.0})
       If (feature 90 <= 0.5)
        If (feature 83 <= 17.0)
         Predict: 1.0
        Else (feature 83 > 17.0)
         Predict: 1.0
       Else (feature 90 > 0.5)
        If (feature 87 <= 1.5)
         Predict: 1.0
        Else (feature 87 > 1.5)
         Predict: 0.0
      Else (feature 73 not in {0.0})
       If (feature 108 <= 143.5)
        If (feature 111 <= 0.095)
         Predict: 0.0
        Else (feature 111 > 0.095)
         Predict: 0.0
       Else (feature 108 > 143.5)
        If (feature 110 <= 0.045)
         Predict: 1.0
        Else (feature 110 > 0.045)
         Predict: 0.0
    Else (feature 1 not in {0.0})
     If (feature 100 <= 0.01)
      If (feature 111 <= 0.245)
       If (feature 110 <= 0.015)
        Predict: 1.0
       Else (feature 110 > 0.015)
        If (feature 108 <= 18.5)
         Predict: 1.0
        Else (feature 108 > 18.5)
         Predict: 0.0
      Else (feature 111 > 0.245)
       If (feature 99 <= 48.5)
        If (feature 115 <= 0.005)
         Predict: 2.0
        Else (feature 115 > 0.005)
         Predict: 0.0
       Else (feature 99 > 48.5)
        If (feature 108 <= 24.5)
         Predict: 2.0
        Else (feature 108 > 24.5)
         Predict: 0.0
     Else (feature 100 > 0.01)
      Predict: 1.0
      ...
```

In [None]:
rf_model.toDebugString

Then, check the error and accuracy. Notice that the model is very good!

In [16]:
val rf_predictions = rf_model.transform(test)

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(rf_predictions)
println("Test Error of RF = " + (1.0 - accuracy))

Test Error of RF = 0.006327153168961042


evaluator = mcEval_4e127ff5dadb
accuracy = 0.993672846831039


0.993672846831039

<a id="mlp"></a>
### 3.2 Set up the Multilayer Perceptron model
Before building the Multilayer Perceptron (MLP) model, you need to know how many nodes are required for the input layer. 

Check the length of feature vector:

In [17]:
train.select("features").show(5)

+--------------------+
|            features|
+--------------------+
|(117,[1,10,72,82,...|
|(117,[1,10,72,82,...|
|(117,[1,10,72,82,...|
|(117,[1,10,72,82,...|
|(117,[1,10,72,82,...|
+--------------------+
only showing top 5 rows



The **input** layer should have 117 nodes and the **output** layer should have ```5``` (5 label categories). This definition also contains an additional hidden layer with ```10``` nodes to build the model. You can change the definition of hidden layer(s).

In [18]:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
val layers = Array[Int](117, 10, 5)
val mlp = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(25)

layers = Array(117, 8, 5)
mlp = mlpc_ce78101a560d


mlpc_ce78101a560d

Now train and fit the model to the training data.

In [19]:
val t = System.nanoTime
val mlp_model = mlp.fit(train)
val duration = (System.nanoTime - t) / 1e9d
println(s"Training process takes $duration secs")

Training process takes 89.397066077 secs


t = 4439491555134199
mlp_model = mlpc_ce78101a560d
duration = 89.397066077


89.397066077

Now check the performance of the MLP model.

In [20]:
val mlp_predictions = mlp_model.transform(test)
val accuracy = evaluator.evaluate(mlp_predictions)
println("Test Error of MLP = " + (1.0 - accuracy))

Test Error of MLP = 0.015756382401562186


mlp_predictions = [_c0: int, _c1: string ... 52 more fields]
accuracy = 0.9842436175984378


0.9842436175984378

<a id="summary"></a>
## 4. Summary and next steps     
This notebook shows how to build two well-performing models using the Spark environment in Watson Studio. It is easy to build models using the Spark API and Watson Studio. Just provision the Spark environment, create the notebook, and you are ready to write your code!


### Citations

Dua, D. and Karra Taniskidou, E. (2017). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

### Author

**Bufan Zeng** is a Data Scientist in IBM and a member of the Watson Studio offering management team.

Copyright © IBM Corp. 2018. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>