# Predict heart failure with Watson Machine Learning

![](https://www.cdc.gov/vitalsigns/heartdisease-stroke/images/graph4_980px.jpg)

This notebook contains steps and code to create a predictive model to predict heart failure and then deploy that model to Watson Machine Learning so it can be used in an application.

## Learning Goals

The learning goals of this notebook are:

* Load a CSV file into the Object Storage service linked to your Watson Studio
* Create an Apache Spark machine learning model
* Train and evaluate a model
* Persist a model in a Watson Machine Learning repository

## 1. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks (also mentioned in the course "Analyzing and Predicting Heart Failure on IBM Cloud"):

* Create a Watson Machine Learning service instance (a free plan is offered) and associate it with your project
* Upload heart failure data to the Object Store service that is part of Watson Studio

We'll be using a few libraries for this exercise:

1. [Machine learning and AI in Watson Studio](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/wml-ai.html): Client library to work with the Watson Machine Learning service on IBM Cloud.
1. [Pixiedust](https://github.com/pixiedust/pixiedust): Python Helper library for Jupyter Notebooks
1. [ibmos2spark](https://github.com/ibm-watson-data-lab/ibmos2spark): Facilitates Data I/O between Spark and IBM Object Storage services

In [None]:
!pip install -U ibm-watson-machine-learning
!pip install --upgrade pixiedust

Next we will try to import the sparksession just to see if everything is ok. If all is good, then you should see no errors raised after executing the cell.

In [None]:
try:
    from pyspark.sql import SparkSession
except:
    print('Error: Spark runtime is missing. If you are using Watson Studio change the notebook runtime to Spark.')
    raise

## 2. Load and explore data

In this section you will load the data as an Apache Spark DataFrame and perform a basic exploration. Load the data to the Spark DataFrame from your associated Object Storage instance.

> **IMPORTANT**: Follow the lab instructions to insert an Apache Spark DataFrame in the cell below.

> **IMPORTANT**: Ensure the DataFrame is named `df_data`.

> **IMPORTANT**: Add `.option('inferSchema','True')\` to the inserted code.

In [None]:

  .option('inferSchema','True')\


Explore the loaded data by using the following Apache® Spark DataFrame methods:

* `df_data.printSchema` to print the data schema
* `df_data.describe()` to print the top twenty records
* `df_data.count()` to count all records

In [None]:
df_data.printSchema()

As you can see, the data contains ten  fields. The  HEARTFAILURE field is the one we would like to predict (label).

In [None]:
df_data.show()

In [None]:
df_data.describe().show()

In [None]:
df_data.count()

As you can see, the data set contains 10800 records.

## 3. Interactive Visualizations w/PixieDust

In [None]:
import pixiedust

### Simple visualization using bar charts

With PixieDust's `display()` method you can visually explore the loaded data using built-in charts, such as, bar charts, line charts, scatter plots, or maps.
To explore a data set: choose the desired chart type from the drop down, configure chart options, configure display options.

In [None]:
display(df_data)

## 4. Create a Spark machine learning model

In this section you will learn how to prepare data, create and train a Spark machine learning model.

### 4.1 Prepare data

In this subsection you will split your data into: train and test data sets.

In [None]:
split_data = df_data.randomSplit([0.8, 0.20], 24)
train_data = split_data[0]
test_data = split_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

As you can see our data has been successfully split into two data sets:

* The train data set, which is the largest group, is used for training.
* The test data set will be used for model evaluation and is used to test the assumptions of the model.

### 4.2 Create pipeline and train a model

In this section you will create a Spark machine learning pipeline and then train the model. In the first step you need to import the Spark machine learning packages that will be needed in the subsequent steps. A sequence of data processing is called a _data pipeline_. Each step in the pipeline processes the data and passes the result to the next step in the pipeline, this allows you to transform and fit your model with the raw input data.

In [None]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, convert all the string fields to numeric ones by using the StringIndexer transformer.

In [None]:
stringIndexer_label = StringIndexer(inputCol="HEARTFAILURE", outputCol="label").fit(df_data)
stringIndexer_sex = StringIndexer(inputCol="SEX", outputCol="SEX_IX")
stringIndexer_famhist = StringIndexer(inputCol="FAMILYHISTORY", outputCol="FAMILYHISTORY_IX")
stringIndexer_smoker = StringIndexer(inputCol="SMOKERLAST5YRS", outputCol="SMOKERLAST5YRS_IX")


In the following step, create a feature vector by combining all features together.

In [None]:
vectorAssembler_features = VectorAssembler(inputCols=["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX_IX","FAMILYHISTORY_IX","SMOKERLAST5YRS_IX","EXERCISEMINPERWEEK"], outputCol="features")

Next, define estimators you want to use for classification. Random Forest is used in the following example.

In [None]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

Finally, indexed labels back to original labels.

In [None]:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)

In [None]:
transform_df_pipeline = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features])
transformed_df = transform_df_pipeline.fit(df_data).transform(df_data)
transformed_df.show()

Let's build the pipeline now. A pipeline consists of transformers and an estimator.

In [None]:
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features, rf, labelConverter])

Now, you can train your Random Forest model by using the previously defined **pipeline** and **training data**.

In [None]:
model_rf = pipeline_rf.fit(train_data)

You can check your **model accuracy** now. To evaluate the model, use **test data**.

In [None]:
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

You can tune your model now to achieve better accuracy. For simplicity of this example tuning section is omitted.

## 5. Persist model

In this section you will learn how to store your pipeline and model in Watson Machine Learning repository by using Python client libraries that we installed earlier



> **IMPORTANT**: Update the `wml_credentials` variable below. Replace the value (Replace me) for apikey with the APIKEY that you copied earlier within the course, as for the URL, replace the value with one of these based on what location your machine learning service is based in:

    Dallas - "https://us-south.ml.cloud.ibm.com"
    London - "https://eu-gb.ml.cloud.ibm.com"
    Frankfurt - "https://eu-de.ml.cloud.ibm.com"
    Tokyo - "https://jp-tok.ml.cloud.ibm.com"


In [None]:
from ibm_watson_machine_learning import APIClient
wml_credentials = {
                   "url": "(Replace Me)",
                   "apikey":"(Replace Me)"
                  }
client = APIClient(wml_credentials)

Just to test we print the client version.

In [None]:
print(client.version)

> **IMPORTANT**: Update the `space_uid` variable below. Replace the value (Replace me) with the Space UID that you copied earlier within the course. You can also get your space id by running this code: client.spaces.list(limit=5)

In [None]:
client.spaces.list(limit=5)

In [None]:
space_uid = "(Replace Me)"

Now we set it as our default space

In [None]:
client.set.default_space(space_uid)

Now we specify some software sepecifications and create model artifact (abstraction layer) for the model to run properly. It has already been filled for you.


In [None]:
# Model Metadata
software_spec_uid = client.software_specifications.get_uid_by_name('spark-mllib_2.4-py37')

model_props={
    client.repository.ModelMetaNames.NAME: "Heart failure",
    client.repository.ModelMetaNames.SPACE_UID: space_uid,
    client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    client.repository.ModelMetaNames.TYPE: "mllib_2.4"
}

In [None]:
published_model = client.repository.store_model(model=model_rf, pipeline=pipeline_rf, meta_props=model_props, training_data=train_data)


## 5.1 Save pipeline and model

In this subsection you will learn how to save pipeline and model artifacts to your Watson Machine Learning instance.

Let's try to print the ID of the published model just to double check everything.

In [None]:
published_model_ID = client.repository.get_model_uid(published_model)
print("Model Id: " + str(published_model_ID))


## 5.2 Load model to verify that it was saved correctly

You can load your model to make sure that it was saved correctly.

In [None]:
loaded_model = client.repository.load(published_model_ID)
print(loaded_model)

Call model against test data to verify that it has been loaded correctly. Examine top 3 results

In [None]:
test_predictions = loaded_model.transform(test_data)
test_predictions.select('probability', 'predictedLabel').show(n=3, truncate=False)

## <font color=green>Congratulations</font>, you've sucessfully created a predictive model and saved it in the Watson Machine Learning service. 

That's about it for the notebook. Please make sure you save your work and then switch back to the course to see how to deploy and integrate this model with your web app.

***