<center><h1> Predict heart failure with Watson Machine Learning</h1></center>
![alt text](https://www.cdc.gov/dhdsp/images/heart_failure.jpg "Heart failure")
<p>This notebook contains steps and code to create a predictive model to predict heart failure and then deploy that model to Watson Machine Learning so it can be used in an application.</p>
## Learning Goals
The learning goals of this notebook are:
* Load a CSV file into the  Object Storage Service linked to your Data Science Experience 
* Create an Apache® Spark machine learning model
* Train and evaluate a model
* Persist a model in a Watson Machine Learning repository

## 1. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:
* Create a Watson Machine Learning Service instance (a free plan is offered) and associate it with your project
* Upload heart failure  data to the Object Store service that is part of your data Science Experience trial


In [None]:
# IMPORTANT Follow the lab instructions to insert authentication and access info here to get access to the data used in this notebook
import ibmos2spark

# @hidden_cell


from pyspark.sql import SparkSession

## 2. Load and explore data
<p>In this section you will load the data as an Apache® Spark DataFrame and perform a basic exploration.</p>

<p>Load the data to the Spark DataFrame from your associated Object Storage instance.</p>

In [None]:
spark = SparkSession.builder.getOrCreate()

# Read data file and create a Data Frame
df_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load(cos.url('patientdataV6.csv', 'watsonmlintegrationdc6181d94d83494798e5ba1e23e00d1d'))

Explore the loaded data by using the following Apache® Spark DataFrame methods:
* print schema
* print top ten records
* count all records

In [None]:
df_data.printSchema()

As you can see, the data contains ten  fields. The  HEARTFAILURE field is the one we would like to predict (label).

In [None]:
df_data.show()

In [None]:
df_data.describe().show()

In [None]:
df_data.count()

As you can see, the data set contains 10800 records.

## 3 Interactive Visualizations w/PixieDust

In [None]:
# To confirm you have the latest version of PixieDust on your system, run this cell
!pip install --user --upgrade pixiedust

If indicated by the installer, restart the kernel and rerun the notebook until here and continue with the workshop.

In [None]:
import pixiedust

### Simple visualization using bar charts
With PixieDust display(), you can visually explore the loaded data using built-in charts, such as, bar charts, line charts, scatter plots, or maps.
To explore a data set: choose the desired chart type from the drop down, configure chart options, configure display options.

In [None]:
display(df_data)

## 4. Create an Apache® Spark machine learning model
In this section you will learn how to prepare data, create and train an Apache® Spark machine learning model.

### 4.1: Prepare data
In this subsection you will split your data into: train and  test  data sets.

In [None]:
split_data = df_data.randomSplit([0.8, 0.20], 24)
train_data = split_data[0]
test_data = split_data[1]

print "Number of training records: " + str(train_data.count())
print "Number of testing records : " + str(test_data.count())

As you can see our data has been successfully split into two data sets:
* The train data set, which is the largest group, is used for training.
* The test data set will be used for model evaluation and is used to test the assumptions of the model.

### 4.2: Create pipeline and train a model
In this section you will create an Apache® Spark machine learning pipeline and then train the model.
In the first step you need to import the Apache® Spark machine learning packages that will be needed in the subsequent steps.

A sequence of data processing is called a _data pipeline_.  Each step in the pipeline processes the data and passes the result to the next step in the pipeline, this allows you to transform and fit your model with the raw input data.

In [None]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, NaiveBayes
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml import Pipeline, Model

In the following step, convert all the string fields to numeric ones by using the StringIndexer transformer.

In [None]:
stringIndexer_label = StringIndexer(inputCol="HEARTFAILURE", outputCol="label").fit(df_data)
stringIndexer_sex = StringIndexer(inputCol="SEX", outputCol="SEX_IX")
stringIndexer_famhist = StringIndexer(inputCol="FAMILYHISTORY", outputCol="FAMILYHISTORY_IX")
stringIndexer_smoker = StringIndexer(inputCol="SMOKERLAST5YRS", outputCol="SMOKERLAST5YRS_IX")


In the following step, create a feature vector by combining all features together.

In [None]:
vectorAssembler_features = VectorAssembler(inputCols=["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX_IX","FAMILYHISTORY_IX","SMOKERLAST5YRS_IX","EXERCISEMINPERWEEK"], outputCol="features")

Next, define estimators you want to use for classification. Random Forest is used in the following example.

In [None]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
#lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr = LogisticRegression()

nb = NaiveBayes(smoothing=1.0)

Finally, indexed labels back to original labels.

In [None]:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)

In [None]:
transform_df_pipeline = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features])
transformed_df = transform_df_pipeline.fit(df_data).transform(df_data)

Let's build the pipeline now. A pipeline consists of transformers and an estimator.

In [None]:
 
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features, rf, labelConverter])
# pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features, lr, labelConverter])


pipeline1 = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, 
                             stringIndexer_famhist, stringIndexer_smoker, 
                             vectorAssembler_features, rf, labelConverter])
m1Name = "Random Forest Default"

pipeline2 = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, 
                             stringIndexer_famhist, stringIndexer_smoker, 
                             vectorAssembler_features, lr, labelConverter])
m2Name = "Logistic Regression (first try)"

pipeline3 = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, 
                             stringIndexer_famhist, stringIndexer_smoker, 
                             vectorAssembler_features, nb, labelConverter])
m3Name = "Naive Bayes"



Now, you can train your Random Forest model by using the previously defined **pipeline** and **training data**.

In [None]:
#model_rf = pipeline_rf.fit(train_data)

model1 = pipeline1.fit(train_data)
model2 = pipeline2.fit(train_data)
model3 = pipeline3.fit(train_data)

You can check your **model accuracy** now. To evaluate the model, use **test data**.

In [None]:
from pyspark.sql.types import Row
import numpy as np

def getCMEntries(threshold):
    newThresholdDF = spark.sql("select label, p1, prediction as oldPrediction," 
          " case when p1 > " + str(threshold) + " then 1.0 else 0.0 end as newPrediction"
          " from inputToThreshold")

    newThresholdDF.registerTempTable("newThreshold")

    # begin exercise +++++++++++++++++++++++++++++++++ 
    tpA = spark.sql("select label, oldPrediction, newPrediction " 
          " from newThreshold where label = 1 and newprediction = 1 ")

    fpA = spark.sql("select label, oldPrediction, newPrediction " 
          " from newThreshold where label = 0 and newprediction = 1 ")

    fnA = spark.sql("select label, oldPrediction, newPrediction " 
          " from newThreshold where label = 1 and newprediction = 0 ")

    tnA = spark.sql("select label, oldPrediction newPrediction " 
          " from newThreshold where label = 0 and newprediction = 0  ")

    # end exercise +++++++++++++++++++++++++++++++++ 

    return (tpA.count(), fpA.count(), fnA.count(), tnA.count())




## Compute Metrics as a Function of Threshold

It may be the case that the default threshold of 0.5 for classification is not ideal.  Let's explore this possiblity, and use some standard metrics to evaluate model fitness.

True Positive Rate:

$TPR = TP/(FP + FN) = TP/P$

False Positive Rate:
$FPR = FP/N$

Matthews Correlation Coefficient:
$\text{MCC} = \frac{ TP \times TN - FP \times FN } {\sqrt{ (TP + FP) ( TP + FN ) ( TN + FP ) ( TN + FN ) } }$


Area Under Curve (AUC):

Using the trapezoid rule for each discrete element, the Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) is:

$\int_{x_0}^{x_N} f(x)\, dx = \sum_{i=0}^{N}(x_{i+1}-x_i) \cdot \left[\frac{f(x_{i+1}) + f(x_{i})}{2} \right]$

Use this formula to compute the AUC for the ROC, which is a plot of TPR (y axis) vs FPR (x axis).


In [None]:
import numpy as np
numBins = 10 
thresholds = np.array(range(0, numBins + 1))*1.0/numBins

In [None]:
def getModelThresholdStats(model_df, data):
    tp = np.array([i for i in range(0, numBins + 1 )])
    fp = np.array([i for i in range(0, numBins + 1 )])
    fn = np.array([i for i in range(0, numBins + 1 )])
    tn = np.array([i for i in range(0, numBins + 1 )])


#generate dataframe to be used in thresholding:
    predictionsForROC = model_df.transform(data)
    predictionsForROC.registerTempTable("predictions")
    columnsForCM = spark.sql("select probability, prediction, label from predictions")
    extractedProbability = columnsForCM.rdd.map(lambda x: Row(p1 = np.asscalar(x[0][1]), prediction=x[1] , label=x[2])).toDF()
    extractedProbability.registerTempTable("inputToThreshold")

# get the total number of positives and negatives in dataset:
    # begin exercise +++++++++++++++++++++++++++++++++ 
    p = spark.sql("select label from predictions where label = 1").count()
    n = spark.sql("select label from predictions where label = 0").count()

# We know the number of true positives, etc. at the threshold edges:
    (tp[0],fp[0],fn[0],tn[0]) = (p, n, 0, 0)
    (tp[-1],fp[-1],fn[-1],tn[-1]) = (0, 0, p, n)
    # end exercise +++++++++++++++++++++++++++++++++    
    
    for (i, threshold) in zip(range(0, numBins + 1),thresholds):
        print(i, threshold)
        if (i>0 and i<numBins): 
            (tp[i],fp[i],fn[i],tn[i]) = getCMEntries(threshold)
        print(tp[i],fp[i],fn[i],tn[i])

# begin exercise +++++++++++++++++++++++++++++++++ 
    tpr = tp*1.0/(tp + fn)
    fpr = fp*1.0/(fp + tn)
    mcc = (tp*tn - fp*fn)*1.0 / np.sqrt((tp + fp)*(tp + fn)*(tn + fp)*(tn + fn))
    accThreshold = (tp + tn)*1.0/(p + n)  # accuracy as a function of threshold
    auc = - np.array([(fpr[i+1]-fpr[i])*0.5*(tpr[i+1]+tpr[i]) for i in range(0,numBins)]).sum()    
# end exercise +++++++++++++++++++++++++++++++++
    return (tpr, fpr, mcc, accThreshold, auc)

In [None]:
print("getting stats for " + m1Name + ":  train")
(tpr1,fpr1,mcc1,acc1,auc1) = getModelThresholdStats(model1, train_data)
print("getting stats for " + m1Name + ":  test")
(tpr1Test,fpr1Test,mcc1Test,acc1Test,auc1Test) = getModelThresholdStats(model1, test_data)

print("getting stats for " + m2Name)
(tpr2,fpr2,mcc2,acc2,auc2) = getModelThresholdStats(model2, train_data)
print("getting stats for " + m1Name + ":  test")
(tpr2Test,fpr2Test,mcc2Test,acc2Test,auc2Test) = getModelThresholdStats(model2, test_data)

print("getting stats for " + m3Name + ":  train")
(tpr3,fpr3,mcc3,acc3,auc3) = getModelThresholdStats(model3, train_data)
print("getting stats for " + m3Name + ":  test")
(tpr3Test,fpr3Test,mcc3Test,acc3Test,auc3Test) = getModelThresholdStats(model3, test_data)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15,5)

plt.subplot(121)
plt.plot( thresholds, mcc1, '-o')
plt.plot(thresholds, mcc2, '-o')
plt.plot(thresholds, mcc3, '-o')
plt.title("Matthews Correlation Coefficient")
plt.xlabel("Threshold")
plt.xlabel("MCC")
plt.ylim(0,0.75)
plt.legend([m1Name + ": AUC (train) = "+ str(round(auc1,3)),
            m2Name + ": AUC (train) = "+ str(round(auc2,3)),
            m3Name + ": AUC (train) = "+ str(round(auc3,3))])

plt.subplot(122)
plt.plot( thresholds, mcc1Test, '-o')
plt.plot( thresholds, mcc2Test, '-o')
plt.plot( thresholds, mcc3Test, '-o')
plt.title("Matthews Correlation Coefficient")
plt.xlabel("Threshold")
plt.xlabel("MCC")
plt.ylim(0,0.75)
plt.legend([m1Name + ": AUC (test) = "+ str(round(auc1Test,3)),
            m2Name + ": AUC (test) = "+ str(round(auc2Test,3)),
            m3Name + ": AUC (test) = "+ str(round(auc3Test,3))])
#plt.figure(figzsize=(18,16))
plt.show()

plt.subplot(121)
plt.plot(thresholds, acc1,'-o')
plt.plot(thresholds, acc2,'-o')
plt.plot(thresholds, acc3,'-o')
plt.title("Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Threshold")
plt.ylim(0,1.0)
plt.subplot(122)
plt.plot(thresholds, acc1Test,'-o')
plt.plot(thresholds, acc2Test,'-o')
plt.plot(thresholds, acc3Test,'-o')
plt.title("Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Threshold")
plt.ylim(0,1.0)

plt.show()

plt.subplot(121)
plt.title("Receiver Operating Characteristic")
plt.plot(fpr1, tpr1,'-o')
plt.plot(fpr2, tpr2,'-o')
plt.plot(fpr3, tpr3,'-o')
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.subplot(122)
plt.title("Receiver Operating Characteristic")
plt.plot(fpr1, tpr1Test,'-o')
plt.plot(fpr2, tpr2Test,'-o')
plt.plot(fpr3, tpr3Test,'-o')
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()

You can tune your model now to achieve better accuracy. For simplicity of this example tuning section is omitted.
## 5. Persist model
In this section you will learn how to store your pipeline and model in Watson Machine Learning repository by using Python client libraries.
First, you must import client libraries.

In [None]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

Authenticate to Watson Machine Learning service on Bluemix.

## **STOP here !!!!:** 
Put authentication information (username and password)  from your instance of Watson Machine Learning service here.

In [None]:
service_path = 'https://ibm-watson-ml.mybluemix.net'
username = 'xxxxxxxxxxxx'
password = 'xxxxxxxxxxxx'

**Tip:** service_path, username and password can be found on Service Credentials tab of the Watson Machine Learning service instance created in Bluemix.

In [None]:
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

Create model artifact (abstraction layer).

In [None]:
model_artifact = MLRepositoryArtifact(model_rf, training_data=train_data, name="Heart Failure Prediction Model")

**Tip:** The MLRepositoryArtifact method expects a trained model object, training data, and a model name. (It is this model name that is displayed by the Watson Machine Learning service).
## 5.1: Save pipeline and model¶
In this subsection you will learn how to save pipeline and model artifacts to your Watson Machine Learning instance.

In [None]:
saved_model = ml_repository_client.models.save(model_artifact)

Get saved model metadata from Watson Machine Learning.
**Tip:** Use *meta.availableProps* to get the list of available props.

In [None]:
saved_model.meta.available_props()

In [None]:
print "modelType: " + saved_model.meta.prop("modelType")
print "trainingDataSchema: " + str(saved_model.meta.prop("trainingDataSchema"))
print "creationTime: " + str(saved_model.meta.prop("creationTime"))
print "modelVersionHref: " + saved_model.meta.prop("modelVersionHref")
print "label: " + saved_model.meta.prop("label")


## 5.2 Load model to verify that it was saved correctly
You can load your model  to make sure that it was saved  correctly.

In [None]:
loadedModelArtifact = ml_repository_client.models.get(saved_model.uid)

Print the  model name to make sure that model artifact has been loaded correctly.

In [None]:
print str(loadedModelArtifact.name)

Congratulations. You've sucessfully created a predictive model and saved it in the Watson Machine Learning service. You can now switch to the Watson Machine Learning console to deploy the model and then test it in application.


## 6.0 Accessing Watson ML Models and Deployments through API
Instead of jumping from your notebook into a web browser manage your model and delopment through a set of APIs


Recap of deploying an existing ML model through using a Python SDK

`pip install watson-machine-learning-client`

[SDK Documentation](https://watson-ml-staging-libs.mybluemix.net/repository-python/index.html)

In [None]:
#Import Python WatsonML Repository SDK
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

#Specify your username and password credientials for Watson ML
service_path = 'https://ibm-watson-ml.mybluemix.net'
username = 'xxxxxxxxxx'
password = 'xxxxxxxxxx'

#Authenticate
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

#Deploy a new model.  I renamed the existing model as it has already been created above
model_artifact = MLRepositoryArtifact(model_rf, training_data=train_data, name="Heart Failure Prediction Model V2")

### 6.1 Get the Watson ML API Token
The Watson ML API authenticates all requests through a token, start by requesting the token from our Watson ML Service.

In [None]:
url = 'https://ibm-watson-ml.mybluemix.net'
username = 'xxxxxxxxxx'
password = 'xxxxxxxxxx'
instance_id = "xxxxxxxxxxx"

In [None]:
import json
import requests
from base64 import b64encode

token_url = url + "/v3/identity/token"

headers = {'authorization': "Basic {}".format(b64encode(username + ":" + password).decode("ascii"))}

response = requests.request("GET", token_url, headers=headers)

watson_ml_token = json.loads(response.text)['token']
print(watson_ml_token)

### 6.2 Preview currenly published models

In [None]:
model_url = url + "/v3/wml_instances/" + instance_id + "/published_models"

headers = {'authorization': 'Bearer ' + watson_ml_token }
response = requests.request("GET", model_url, headers=headers)

published_models = json.loads(response.text)
print(json.dumps(published_models, indent=2))

Read the details of any returned models

In [None]:
print('{} model(s) are available in your Watson ML Service'.format(published_models['count']))
for model in published_models['resources']:
    print('\t- name:        {}'.format(model['entity']['name']))
    print('\t  model_id:    {}'.format(model['metadata']['guid']))
    print('\t  deployments: {}'.format(model['entity']['deployments']['count']))

Create a new deployment of the Model

In [None]:
#model_id = 'fceec826-db51-4217-b15b-15ff635fb30e'
model_id = '904c9abc-9460-4f6e-a5a8-20b9579f0913'

deployment_url = url + "/v3/wml_instances/" + instance_id + "/published_models/" + model_id + "/deployments"

payload = "{\"name\": \"Heart Failure Prediction Model Deployment\", \"description\": \"First deployment of Heart Failure Prediction Model\", \"type\": \"online\"}"
headers = {'authorization': 'Bearer ' + watson_ml_token, 'content-type': "application/json" }

response = requests.request("POST", deployment_url, data=payload, headers=headers)

print(response.text)

In [None]:
deployment = json.loads(response.text)

print('Model {} deployed.'.format(model_id))
print('\tname: {}'.format(deployment['entity']['name']))
print('\tdeployment_id: {}'.format(deployment['metadata']['guid']))
print('\tstatus: {}'.format(deployment['entity']['status']))
print('\tscoring url: {}'.format(deployment['entity']['scoring_url']))

Monitor the status of deployment

In [None]:
#deployment_id = "eaa399a5-ce94-42cf-889e-0b9ee5f57642"
deployment_id = "6d38f6d8-efce-4159-b4e0-faa20df57f65"
deployment_details_url = url + "/v3/wml_instances/" + instance_id + "/published_models/" + model_id + "/deployments/" + deployment_id

headers = {'authorization': 'Bearer ' + watson_ml_token, 'content-type': "application/json" }

response = requests.request("GET", deployment_url, headers=headers)
print(response.text)

In [None]:
deployment_details = json.loads(response.text)

for resources in deployment_details['resources']:
    print('name: {}'.format(resources['entity']['name']))
    print('status: {}'.format(resources['entity']['status']))
    print('scoring url: {}'.format(resources['entity']['scoring_url']))

## 6.3 Invoke prediction model deployment
Define a method to call scoring url. Replace the **scoring_url** in the method below with the scoring_url returned from above.

In [None]:
def get_prediction_ml(ahb, ppd, chol, bmi, age, sex, fh, smoker, exercise_minutes ):
    scoring_url = 'https://ibm-watson-ml.mybluemix.net/v3/wml_instances/024597e5-3b8c-43a1-a2b9-c3295a07bb2f/published_models/' + model_id + '/deployments/' + deployment_id + '/online'
    scoring_payload = { "fields":["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX","FAMILYHISTORY","SMOKERLAST5YRS","EXERCISEMINPERWEEK"],"values":[[ahb, ppd, chol, bmi, age, sex, fh, smoker, exercise_minutes]]}
    header = {'authorization': 'Bearer ' + watson_ml_token, 'content-type': "application/json" }
    scoring_response = requests.post(scoring_url, json=scoring_payload, headers=header)
    return (json.loads(scoring_response.text).get("values")[0][18])

### Call get_prediction_ml method exercising our prediction model

In [None]:
print('Is a 44 year old female that smokes with a low BMI at risk of Heart Failure?: {}'.format(get_prediction_ml(100,85,242,24,44,"F","Y","Y",125)))