<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="4" color="black"><b>Use Spark ML and Python to detect network intrusions</b></font></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr> 
   <tr style="border: none">
       <td style="border: none"><img src="https://github.com/pmservice/wml-sample-models/raw/master/tensorflow/hand-written-digit-recognition/images/experiment_banner.png" width="600" height = "200" alt="Icon"></td>
   </tr>
</table>


This notebook shows you how to easily build two classification models using the Spark Machine Learning (ML) library to detect network intrusions. It uses the Random Forest (RF) classifier and the Multilayer Perceptron (MLP) classifier to build the required algorithms.

<a href="http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html" target="_blank" rel="noopener noreferrer">UCI kddcup data</a> (743MB) is used in this notebook. This data set can be audited and provides intrusions simulated in a military network environment. It was originally used for the **The Third International Knowledge Discovery and Data Mining Tools Competition** organized for **KDD-99**. 


This notebook runs on Python 3.6 with Spark 2.3 in Watson Studio Spark Environments.


## Table of contents

1. [Download data](#download)<br>
2. [Load and prepare data](#load)<br>
3. [Build the models](#build)<br>
    3.1 [Set up the Random Forest model](#rf)<br>
    3.2 [Set up the Multilayer Perceptron model](#mlp)<br>
4. [Saving models](#save)<br>
5. [Summary and next steps](#summary)  

  
<a id="download"></a>
 
## 1. Download data <a id="download"></a>

First, download the prerequisite data set from Watson Studio: <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/1438a61212a64ac435c837ba046efc19" target="_blank" rel="noopener noreferrer">UCI: KDD Cup 1999 Data</a> 

In [None]:
url = "https://dataplatform.ibm.com/exchange-api/v1/entries/1438a61212a64ac435c837ba046efc19/data?accessKey=903188bb984a30f38bb889102a7db39f"
filename = "./kddcup.zip"
!wget $url -O $filename

Create a ```kddcup``` directory and **unzip** the file that you downloaded to the directory:

In [2]:
# !rm -rf kddcup
!mkdir kddcup
!unzip kddcup.zip -d ./kddcup/

Archive:  kddcup.zip
  inflating: ./kddcup/KDD-CUP-99 Task Description.html  
   creating: ./kddcup/__MACOSX/
  inflating: ./kddcup/__MACOSX/._KDD-CUP-99 Task Description.html  
  inflating: ./kddcup/kddcup.newtestdata_10_percent_unlabeled.gz  
  inflating: ./kddcup/__MACOSX/._kddcup.newtestdata_10_percent_unlabeled.gz  
  inflating: ./kddcup/kddcup.data_10_percent.gz  
  inflating: ./kddcup/__MACOSX/._kddcup.data_10_percent.gz  
  inflating: ./kddcup/kddcup.names   
  inflating: ./kddcup/__MACOSX/._kddcup.names  
  inflating: ./kddcup/corrected.gz   
  inflating: ./kddcup/__MACOSX/._corrected.gz  
  inflating: ./kddcup/training_attack_types  
  inflating: ./kddcup/__MACOSX/._training_attack_types  
  inflating: ./kddcup/kddcup.testdata.unlabeled_10_percent.gz  
  inflating: ./kddcup/__MACOSX/._kddcup.testdata.unlabeled_10_percent.gz  
  inflating: ./kddcup/kddcup.testdata.unlabeled.gz  
  inflating: ./kddcup/__MACOSX/._kddcup.testdata.unlabeled.gz  
  inflating: ./kddcup/kddcup.data.g

List the content of the unzipped file:

In [3]:
!ls ./kddcup

'KDD Cup 1999 Data.html'	     kddcup.names
'KDD-CUP-99 Task Description.html'   kddcup.newtestdata_10_percent_unlabeled.gz
 __MACOSX			     kddcup.testdata.unlabeled.gz
 corrected.gz			     kddcup.testdata.unlabeled_10_percent.gz
 kddcup.data.gz			     training_attack_types
 kddcup.data_10_percent.gz


To use the entire data set ```kddcup.data``` (743 MB), run **gunzip** to unzip the file to the same directory:

In [4]:
!gunzip ./kddcup/kddcup.data.gz -d ./kddcup/kddcup.data

gzip: ./kddcup/kddcup.data: unknown suffix -- ignored


<a id="load"></a>
## 2. Load and prepare data
You can use the ```SparkSession``` to read the data directly into a dataframe because the data is provided in CSV (comma-separated values) format.

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read\
  .format('csv')\
  .option("inferSchema", "true")\
  .load("./kddcup/kddcup.data")
df.show(5)

+---+---+----+---+---+-----+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+
|_c0|_c1| _c2|_c3|_c4|  _c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|_c39|_c40|   _c41|
+---+---+----+---+---+-----+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+
|  0|tcp|http| SF|215|45076|  0|  0|  0|  0|   0|   1|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   1|   1| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0|   0|   0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|normal.|
|  0|tcp|http| SF|162| 4528|  0|  0|  0|  0|   0|   1|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   2|   2| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0|   1|   1| 1.0| 0.0

Now take a look at the schema and labels of the last column ```_c41```.

In [6]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: integer (nullable = true)
 |-- _c7: integer (nullable = true)
 |-- _c8: integer (nullable = true)
 |-- _c9: integer (nullable = true)
 |-- _c10: integer (nullable = true)
 |-- _c11: integer (nullable = true)
 |-- _c12: integer (nullable = true)
 |-- _c13: integer (nullable = true)
 |-- _c14: integer (nullable = true)
 |-- _c15: integer (nullable = true)
 |-- _c16: integer (nullable = true)
 |-- _c17: integer (nullable = true)
 |-- _c18: integer (nullable = true)
 |-- _c19: integer (nullable = true)
 |-- _c20: integer (nullable = true)
 |-- _c21: integer (nullable = true)
 |-- _c22: integer (nullable = true)
 |-- _c23: integer (nullable = true)
 |-- _c24: double (nullable = true)
 |-- _c25: double (nullable = true)
 |-- _c26: double (nullable = true)
 |-- _c27: d

In [7]:
df.select("_c41").groupBy("_c41").count().show()

+----------------+-------+
|            _c41|  count|
+----------------+-------+
|    warezmaster.|     20|
|          smurf.|2807886|
|            pod.|    264|
|           imap.|     12|
|           nmap.|   2316|
|   guess_passwd.|     53|
|        ipsweep.|  12481|
|      portsweep.|  10413|
|          satan.|  15892|
|           land.|     21|
|     loadmodule.|      9|
|      ftp_write.|      8|
|buffer_overflow.|     30|
|        rootkit.|     10|
|    warezclient.|   1020|
|       teardrop.|    979|
|           perl.|      3|
|            phf.|      4|
|       multihop.|      7|
|        neptune.|1072017|
+----------------+-------+
only showing top 20 rows



According to the <a href="http://kdd.ics.uci.edu/databases/kddcup99/training_attack_types" target="_blank" rel="noopener noreferrer">description</a>, the labels should be recoded into five categories using an SQL query. The new column name ```label_s``` stands for *label in string*.

In [8]:
df.createOrReplaceTempView("attack")
query = """SELECT *, 
    CASE _c41 
        WHEN 'back.' THEN 'dos'
        WHEN 'buffer_overflow.' THEN 'u2r'
        WHEN 'ftp_write.' THEN 'r2l'
        WHEN 'guess_passwd.' THEN 'r2l'
        WHEN 'imap.' THEN 'r2l'
        WHEN 'ipsweep.' THEN 'probe'
        WHEN 'land.' THEN 'dos'
        WHEN 'loadmodule.' THEN 'u2r'
        WHEN 'multihop.' THEN 'r2l'
        WHEN 'neptune.' THEN 'dos'
        WHEN 'nmap.' THEN 'probe'
        WHEN 'perl.' THEN 'u2r'
        WHEN 'phf.' THEN 'r2l'
        WHEN 'pod.' THEN 'dos'
        WHEN 'portsweep.' THEN 'probe'
        WHEN 'rootkit.' THEN 'u2r'
        WHEN 'satan.' THEN 'probe'
        WHEN 'smurf.' THEN 'dos'
        WHEN 'spy.' THEN 'r2l'
        WHEN 'teardrop.' THEN 'dos'
        WHEN 'warezclient.' THEN 'r2l'
        WHEN 'warezmaster.' THEN 'r2l'
        ELSE 'normal'
END AS label_s 
FROM attack"""

labeled = spark.sql(query)
labeled.select("label_s").groupBy("label_s").count().show()

+-------+-------+
|label_s|  count|
+-------+-------+
|    u2r|     52|
| normal| 972781|
|    r2l|   1126|
|  probe|  41102|
|    dos|3883370|
+-------+-------+



Now, build a pipeline to prepare data before building models.

**Data preparation pipeline:**
* StringIndexers: ```c1```, ```c2```, and ```c3``` are categorical strings. They must be indexed first.
* OneHotEncoders: When the categorical strings have been indexed, you can use one-hot encoding to the indexed columns.
* VectorAssembler: Include the wanted columns and assemble them as a feature vector.
* labelIndexer: Another ```StringIndexer``` is used to index the ```label_s``` column to output it as ```label``` column.

In [9]:
from pyspark.ml.feature import IndexToString, StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml import Pipeline
indexer1 = StringIndexer(inputCol="_c1", outputCol="i_c1")
indexer2 = StringIndexer(inputCol="_c2", outputCol="i_c2")
indexer3 = StringIndexer(inputCol="_c3", outputCol="i_c3")

encoder1 = OneHotEncoder(inputCol="i_c1", outputCol="v_c1")
encoder2 = OneHotEncoder(inputCol="i_c2", outputCol="v_c2")
encoder3 = OneHotEncoder(inputCol="i_c3", outputCol="v_c3")

featurenames = ["_c0", "v_c1", "v_c2", "v_c3", "_c4", "_c5", "_c6", 
                         "_c7", "_c8", "_c9", "_c10", "_c11", "_c12", "_c13", 
                         "_c14", "_c15", "_c16", "_c17", "_c18", "_c19",
                         "_c22", "_c23", "_c24", "_c25", "_c26", "_c27", 
                         "_c28", "_c29", "_c30", "_c31", "_c32", "_c33", "_c34", 
                         "_c35", "_c36", "_c37", "_c38", "_c39", "_c40"]
assembler = VectorAssembler(inputCols=featurenames, outputCol="features")

labelIndexer = StringIndexer(inputCol="label_s", outputCol="label")

pipeline_prepare = Pipeline(stages=[indexer1,indexer2,indexer3,encoder1,encoder2,encoder3,assembler,labelIndexer])


You can now fit and transform the data to train the model.

In [10]:
data = pipeline_prepare.fit(labeled).transform(labeled)

<a id="build"></a>
## 3. Build the models
This section describes how to build the models. Because of the large amount of data, we can use 60/40 split to mitigate overfitting:
* 60% for the ```training``` set
* 40% for the ```testing``` set

In [11]:
(train, test) = data.randomSplit([0.6, 0.4])

<a id="rf"></a>
### 3.1 Set up the Random Forest model

As the Random Forest (RF) algorithm is provided by Spark ML, you only have to set it up. 

In order to use the Watson Machine Learning (WML) service to store and deploy model, you need to put the classifier into a pipeline object.

**Note:** There are 70 categories in the ```c2``` column. This is larger than the default of ```_MaxBins_```. To avoid errors ```_MaxBins_``` is set to 72, because it has to be larger than the biggest number of categories of all categorical variables. 


In [12]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=5, maxBins=72)

rf_pipeline = Pipeline(stages=[rf])

Now train and fit the model to the training data.

In [13]:
import time
start_time = time.time()
rf_model = rf_pipeline.fit(train)
print("Training process takes %s secs" % (time.time() - start_time))

Training process takes 403.03326416015625 secs


Details about the random forest classification model can be printed and will look something like the following:
```
RandomForestClassificationModel (uid=RandomForestClassifier_45c1962e4b7f0c44ea25) with 5 trees
  Tree 0 (weight 1.0):
    If (feature 83 <= 2.0)
     If (feature 102 <= 0.04)
      If (feature 105 <= 0.32)
       If (feature 108 <= 254.5)
        If (feature 98 <= 16.5)
         Predict: 1.0
        Else (feature 98 > 16.5)
         Predict: 0.0
       Else (feature 108 > 254.5)
        If (feature 82 <= 299.5)
         Predict: 2.0
        Else (feature 82 > 299.5)
         Predict: 0.0
      Else (feature 105 > 0.32)
       If (feature 98 <= 5.5)
        If (feature 114 <= 0.05)
         Predict: 1.0
        Else (feature 114 > 0.05)
         Predict: 0.0
       Else (feature 98 > 5.5)
        If (feature 94 <= 0.5)
         Predict: 0.0
        Else (feature 94 > 0.5)
         Predict: 1.0
     Else (feature 102 > 0.04)
      If (feature 4 in {0.0})
       If (feature 75 in {0.0})
        If (feature 74 in {0.0})
         Predict: 0.0
        Else (feature 74 not in {0.0})
         Predict: 1.0
       Else (feature 75 not in {0.0})
        If (feature 6 in {0.0})
         Predict: 2.0
        Else (feature 6 not in {0.0})
         Predict: 1.0
      Else (feature 4 not in {0.0})
       If (feature 110 <= 0.10500000000000001)
        If (feature 105 <= 0.365)
         Predict: 0.0
        Else (feature 105 > 0.365)
         Predict: 2.0
       Else (feature 110 > 0.10500000000000001)
        Predict: 2.0
    Else ...
    ```

In [14]:
rf = rf_model.stages[0]
print(rf.toDebugString)

RandomForestClassificationModel (uid=RandomForestClassifier_464f80ab08c662d8d1ea) with 5 trees
  Tree 0 (weight 1.0):
    If (feature 89 <= 0.5)
     If (feature 98 <= 16.5)
      If (feature 107 <= 4.5)
       If (feature 72 in {0.0})
        If (feature 110 <= 0.675)
         Predict: 1.0
        Else (feature 110 > 0.675)
         Predict: 2.0
       Else (feature 72 not in {0.0})
        If (feature 105 <= 0.005)
         Predict: 2.0
        Else (feature 105 > 0.005)
         Predict: 1.0
      Else (feature 107 > 4.5)
       If (feature 108 <= 16.5)
        If (feature 115 <= 0.065)
         Predict: 1.0
        Else (feature 115 > 0.065)
         Predict: 2.0
       Else (feature 108 > 16.5)
        If (feature 14 in {0.0})
         Predict: 1.0
        Else (feature 14 not in {0.0})
         Predict: 0.0
     Else (feature 98 > 16.5)
      If (feature 106 <= 0.005)
       If (feature 2 in {0.0})
        If (feature 108 <= 254.5)
         Predict: 1.0
        Else (feature 108 

Now check the error and accuracy. Notice that the model is very good!

In [15]:
rf_prediction = rf_model.transform(test)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
start_time = time.time()
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
rf_accuracy = evaluator.evaluate(rf_prediction)
print("Evaluating process takes %s secs" % (time.time() - start_time))
print("Test Error of RF = %g " % (1.0 - rf_accuracy))

Evaluating process takes 171.3987214565277 secs
Test Error of RF = 0.00161474 


<a id="mlp"></a>
### 3.2 Set up the Multilayer Perceptron model
Before building the Multilayer Perceptron (MLP) model, you need to know how many nodes are required for the input layer. 

Check the length of feature vector:

In [16]:
inputlayer = len(train.select("features").take(1)[0][0])

The **output** layer should have ```5``` nodes (5 label categories). This definition also contains an additional hidden layer with ```10``` nodes to build the model. You can change the definition of hidden layer(s).

In [17]:
layers = [inputlayer, 10, 5]

Put the MLP classifier into a pipeline.

In [18]:
from pyspark.ml.classification import MultilayerPerceptronClassifier

mlp = MultilayerPerceptronClassifier(maxIter=25, layers=layers, blockSize=128, seed=1234)

mlp_pipeline = Pipeline(stages=[mlp])

Now train and fit the model to the training data.

In [19]:
start_time = time.time()
mlp_model = mlp_pipeline.fit(train)
print("Training process takes %s secs" % (time.time() - start_time))

Training process takes 329.73344683647156 secs


Now check the performance of the MLP model.

In [20]:
mlp_prediction = mlp_model.transform(test)

start_time = time.time()
mlp_accuracy = evaluator.evaluate(mlp_prediction)
print("Evaluating process takes %s secs" % (time.time() - start_time))
print("Test Error of MLP = %g " % (1.0 - mlp_accuracy))

Evaluating process takes 154.5317394733429 secs
Test Error of MLP = 0.0116643 


<a id="save"></a>
## 4. Saving models

You can use the <a href="https://cloud.ibm.com/catalog/services/machine-learning" target="_blank" rel="noopener noreferrer">Watson Machine Learning (WML) service</a> to save and deploy your models.


Import the WML client and credentials. Credentials can be generated and copied from bluemix.net. Then connect to WML using credentials.

<b>Tip:</b> Authentication information (your credentials) can be found in the <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-get-wml-credentials.html" target="_blank" rel="noopener noreferrer">Service Credentials</a> tab of the service instance that you created on IBM Cloud.
If you cannot see the instance_id field in Service Credentials, click New credential (+) to generate new authentication information.

<b>Action:</b> Enter your Watson Machine Learning service instance credentials here.

In [22]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient
import json
wml_credentials = {
    "apikey": "...",
    "username": "...",
    "password": "...",
    "instance_id": "...",
    "url": "https://ibm-watson-ml.mybluemix.net"
}

In [24]:
client = WatsonMachineLearningAPIClient(wml_credentials)

After connecting to the WML client, you can manage the models you have. API details can be found <a href="https://wml-api-pyclient.mybluemix.net/" target="_blank" rel="noopener noreferrer">here</a>.

List the models:

In [25]:
client.repository.list_models()

## To delete the model(s):
#client.repository.delete(GUID)

------------------------------------  -------------------------------------------  ------------------------  -----------------
GUID                                  NAME                                         CREATED                   FRAMEWORK
53b90fb3-0d12-4d66-96c8-b4e4cf96a0ae  Customer churn Spark model                   2019-07-03T15:47:49.776Z  mllib-2.3
ad24c140-f97a-49f2-b02c-f8ce44a58c27  Custom ARIMA estimator for sklearn pipeline  2019-07-03T01:04:33.998Z  scikit-learn-0.19
471b39cc-9c3b-4ff9-a8ea-0297efe0ca5d  Boston house price prediction                2019-05-20T18:19:55.433Z  scikit-learn-0.19
fc5462c8-7eb9-4dda-8b0a-947e2faa30da  WML Product Line Prediction Model            2019-05-17T17:33:53.278Z  mllib-2.3
a57e82a9-076e-4236-8bd2-7465e726c419  WML Product Line Prediction Model            2019-05-17T17:28:36.434Z  mllib-2.3
914b598b-f52e-4a87-bfed-2f4163eba25e  Boston house price prediction                2019-05-13T21:11:04.132Z  scikit-learn-0.19
43dbda1f-34f4-43


Meta data can be added to the model. You then pass the fitted pipeline object and the training data to save the model.

In [None]:
rf_props = {client.repository.ModelMetaNames.AUTHOR_NAME: "Bufan", 
               client.repository.ModelMetaNames.NAME: "RF_AttackDetection_PySpark"}

rf_saved_model = client.repository.store_model(model=rf_model, pipeline = rf_pipeline, meta_props=rf_props, training_data=train)

Details of the saved model can be printed and will look something like the following: 
```
{
  "entity": {
    "deployments": {
      "count": 0,
      "url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/40ac6090-70f5-40d9-836c-30dff9242a25/published_models/b365e08c-a480-4195-a920-94d8da0d0001/deployments"
    },
    "label_col": "label",
    "model_type": "mllib-2.3",
    "evaluation_metrics_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/40ac6090-70f5-40d9-836c-30dff9242a25/published_models/b365e08c-a480-4195-a920-94d8da0d0001/evaluation_metrics",
    "author": {
      "name": "Bufan"
    },
    "name": "RF_AttackDetection_PySpark",
    "learning_iterations_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/40ac6090-70f5-40d9-836c-30dff9242a25/published_models/b365e08c-a480-4195-a920-94d8da0d0001/learning_iterations",
    "input_data_schema": {
      "type": "struct",
      "fields": [
        {
          "nullable": true,
          "name": "_c0",
          "type": "integer",
          "metadata": {}
        },
        {
          "nullable": true,
          "name": "_c1",
          "type": "string",
          "metadata": {}
        },
        {
          "nullable": true,
          "name": "_c2",
          "type": "string",
          "metadata": {}
        },
        ....
```

In [27]:
rf_model_uid = client.repository.get_model_uid(rf_saved_model)
model_details = client.repository.get_details(rf_model_uid)
print(json.dumps(model_details, indent=2))

{
  "metadata": {
    "guid": "4f46199c-4d91-4495-b14e-ac21f60c0535",
    "url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/b4b6c696-172c-4164-8049-c0b621dbf3c9/published_models/4f46199c-4d91-4495-b14e-ac21f60c0535",
    "created_at": "2019-07-08T23:06:48.768Z",
    "modified_at": "2019-07-08T23:06:48.845Z"
  },
  "entity": {
    "runtime_environment": "spark-2.3",
    "learning_configuration_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/b4b6c696-172c-4164-8049-c0b621dbf3c9/published_models/4f46199c-4d91-4495-b14e-ac21f60c0535/learning_configuration",
    "author": {
      "name": "Bufan"
    },
    "name": "RF_AttackDetection_PySpark",
    "label_col": "label",
    "learning_iterations_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/b4b6c696-172c-4164-8049-c0b621dbf3c9/published_models/4f46199c-4d91-4495-b14e-ac21f60c0535/learning_iterations",
    "training_data_schema": {
      "fields": [
        {
          "metadata": {},
          "name": "_c0",
   

Now the model should be in the WML service model list:

In [28]:
client.repository.list_models()

------------------------------------  -------------------------------------------  ------------------------  -----------------
GUID                                  NAME                                         CREATED                   FRAMEWORK
4f46199c-4d91-4495-b14e-ac21f60c0535  RF_AttackDetection_PySpark                   2019-07-08T23:06:48.768Z  mllib-2.3
53b90fb3-0d12-4d66-96c8-b4e4cf96a0ae  Customer churn Spark model                   2019-07-03T15:47:49.776Z  mllib-2.3
ad24c140-f97a-49f2-b02c-f8ce44a58c27  Custom ARIMA estimator for sklearn pipeline  2019-07-03T01:04:33.998Z  scikit-learn-0.19
471b39cc-9c3b-4ff9-a8ea-0297efe0ca5d  Boston house price prediction                2019-05-20T18:19:55.433Z  scikit-learn-0.19
fc5462c8-7eb9-4dda-8b0a-947e2faa30da  WML Product Line Prediction Model            2019-05-17T17:33:53.278Z  mllib-2.3
a57e82a9-076e-4236-8bd2-7465e726c419  WML Product Line Prediction Model            2019-05-17T17:28:36.434Z  mllib-2.3
914b598b-f52e-4a87-bfed-

Once the model is in saved to the WML service, it can be loaded by connecting to WML client using the GUID.

In [29]:
rf_loaded_model = client.repository.load(rf_model_uid)

You can print the details of the loaded_model, looks like this:
```
RandomForestClassificationModel (uid=RandomForestClassifier_45c1962e4b7f0c44ea25) with 5 trees
  Tree 0 (weight 1.0):
    If (feature 83 <= 2.0)
     If (feature 102 <= 0.04)
      If (feature 105 <= 0.32)
       If (feature 108 <= 254.5)
        If (feature 98 <= 16.5)
         Predict: 1.0
        Else (feature 98 > 16.5)
         Predict: 0.0
       Else (feature 108 > 254.5)
        If (feature 82 <= 299.5)
         Predict: 2.0
        Else (feature 82 > 299.5)
         Predict: 0.0
      Else (feature 105 > 0.32)
       If (feature 98 <= 5.5)
        If (feature 114 <= 0.05)
         Predict: 1.0
        Else (feature 114 > 0.05)
         Predict: 0.0
       Else (feature 98 > 5.5)
        If (feature 94 <= 0.5)
         Predict: 0.0
        Else (feature 94 > 0.5)
         Predict: 1.0
     Else (feature 102 > 0.04)
      If (feature 4 in {0.0})
       If (feature 75 in {0.0})
        If (feature 74 in {0.0})
         Predict: 0.0
        Else (feature 74 not in {0.0})
         Predict: 1.0
       Else (feature 75 not in {0.0})
        If (feature 6 in {0.0})
         Predict: 2.0
        Else (feature 6 not in {0.0})
         Predict: 1.0
      Else (feature 4 not in {0.0})
       If (feature 110 <= 0.10500000000000001)
        If (feature 105 <= 0.365)
         Predict: 0.0
        Else (feature 105 > 0.365)
         Predict: 2.0
       Else (feature 110 > 0.10500000000000001)
        Predict: 2.0
        ....
```

In [30]:
loaded_debug_string = rf_loaded_model.stages[0].toDebugString
print(loaded_debug_string)

RandomForestClassificationModel (uid=RandomForestClassifier_464f80ab08c662d8d1ea) with 5 trees
  Tree 0 (weight 1.0):
    If (feature 89 <= 0.5)
     If (feature 98 <= 16.5)
      If (feature 107 <= 4.5)
       If (feature 72 in {0.0})
        If (feature 110 <= 0.675)
         Predict: 1.0
        Else (feature 110 > 0.675)
         Predict: 2.0
       Else (feature 72 not in {0.0})
        If (feature 105 <= 0.005)
         Predict: 2.0
        Else (feature 105 > 0.005)
         Predict: 1.0
      Else (feature 107 > 4.5)
       If (feature 108 <= 16.5)
        If (feature 115 <= 0.065)
         Predict: 1.0
        Else (feature 115 > 0.065)
         Predict: 2.0
       Else (feature 108 > 16.5)
        If (feature 14 in {0.0})
         Predict: 1.0
        Else (feature 14 not in {0.0})
         Predict: 0.0
     Else (feature 98 > 16.5)
      If (feature 106 <= 0.005)
       If (feature 2 in {0.0})
        If (feature 108 <= 254.5)
         Predict: 1.0
        Else (feature 108 

In [31]:
loaded_debug_string == rf.toDebugString

True

<a id="summary"></a>
## 5. Summary and next steps     
This notebook shows how to build two well-performing models using the Spark environment in Watson Studio. It is easy to build models using the Spark API and Watson Studio. Just provision the Spark environment, create the notebook, and you are ready to write your code!




### Citations

Dua, D. and Karra Taniskidou, E. (2017). <a href="http://archive.ics.uci.edu/ml" target="_blank" rel="noopener noreferrer">UCI Machine Learning Repository</a>. Irvine, CA: University of California, School of Information and Computer Science.

## Author

**Bufan Zeng**: a Data Scientist with the Watson Studio offering management team at IBM.

Copyright © IBM Corp. 2018, 2019. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>