## Machine Learning Tutorials with Watson Machine Learning
### Part 3 - Decision Trees 

This tutoral was adapted from the Spark documentation at https://spark.apache.org/docs/2.1.1/ml-classification-regression.html#decision-tree-classifier

### 3.1 Add Data

Before beginning, you will find it is necessary to load the text file into dsx. 
Download it from the GitHub repository and upload into dsx in the upper right hand corner "Find and Add Data"

In [None]:
# once upload, select "sample_libsvm_data.txt"
# under Insert to Code, select Insert SparkSession Setup, place that code here
import ibmos2spark

# @hidden_cell
credentials = {
    'auth_url': 'https://identity.open.softlayer.com',
    'project_id': 'df584add39774a4492e9ac43bbfe2944',
    'region': 'dallas',
    'user_id': '27783e2082124d2e9bd2fe45de4ec98d',
    'username': 'member_6159c7d53ab652ef6d55e41ec4e92bf82ce9db34',
    'password': 'W4x3?,k_2WCn_I-1'
}

configuration_name = 'os_eea0fae16ed84b69a7875db7dcc2ba81_configs'
bmos = ibmos2spark.bluemix(sc, credentials, configuration_name)

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Please read the documentation of PySpark to learn more about the possibilities to load data files.
# PySpark documentation: https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
# The SparkSession object is already initalized for you.
# The following variable contains the path to your file on your Object Storage.
path_1 = bmos.url('OReilly', 'sample_libsvm_data.txt')


In [None]:
#print the path created for you and copy it in the next cell
path_1

In [None]:
data = spark.read.format("libsvm").load('swift2d://OReilly.os_eea0fae16ed84b69a7875db7dcc2ba81_configs/sample_libsvm_data.txt')


### 3.2 Build Decision Tree Model pipline 
What is differs between building a Decision Tree Model and a Linear Regression Model?

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print "Test Error = %g " % (1.0 - accuracy)

treeModel = model.stages[2]
# summary only
print treeModel

### 3.3 Exercise - Watson Machine Learning
#### save this model and publish it to Watson Machine Learning 

In [None]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

In [None]:
service_path = 'https://ibm-watson-ml.mybluemix.net'
instance_id = "437c58bc-4c25-4002-98de-e822fa4ec797"
username = "cfd3c4a9-159e-476d-b097-1cd133d884d7"
password = "ca465385-c8e6-4d35-8e82-0e47e6d399bf"
partner_saved_model_uid = '"485019ab-37df-4bb5-b391-c1064781a9e7"'

    
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

model_artifact = MLRepositoryArtifact(model, name='decision tree test', training_data=trainingData)
saved_model = ml_repository_client.models.save(model_artifact)

loadedModel = ml_repository_client.models.get(saved_model.uid)
#print str(loadedModel.name)

loadedModel.model_instance().transform(testData).select('indexedLabel', 'prediction').show()