<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Build a Loan default scoring model and service</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
</table>

This notebook contains steps and code to get a loan dataset, create a predictive model, and start scoring new data. This notebook introduces commands for getting data and for basic data cleaning and exploration, model creation, model training, model persistence to the Open Prediction Service, model deployment, and scoring.

Some familiarity with Python is helpful. This notebook uses Python 3.


## Learning goals

You will learn how to:

-  Load a CSV file into a Pandas DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create a scikit-learn machine learning model.
-  Train and evaluate a model.
-  Store and deploy a model in Open Predictive Service.
-  Score sample scoring data using a Open Predictive Service invocation.
-  Explore and visualize the prediction result using the plotly package.


## Contents

This notebook contains the following parts:

1.	[Set up](#setup)
2.	[Load and explore data](#load)
3.	[Create a Scikit learn machine learning model](#model)
4.	[Store the model in the provider of your choice](#provider)
5.	[Use Plotly to visualize data](#plotly)
6.	[Summary and next steps](#summary)

<a id="setup"></a>
## 1. Set up

Before you use the sample code in this notebook, you must perform the following setup tasks:

You will need at least one provider to use your model in ADS.

For a **Watson Machine Learning provider** you must:
- Create a <a href="https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/" target="_blank" rel="noopener no referrer">Watson Machine Learning (WML) Service</a> instance (a lite plan is offered and information about how to create the instance is <a href="https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html" target="_blank" rel="noopener no referrer">here</a>)

For an **Open Prediciton Service provider** you must:

**TODO change LINK to documentation LINK**
-  Create a <a href="https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/" target="_blank" rel="noopener no referrer">Open Prediction Service</a> instance


<a id="load"></a>
## 2. Load and explore data

In this section you will load the data as a Pandas DataFrame and perform a basic exploration.

Load the data to the Pandas DataFrame by using *wget* to upload the data to gpfs and then use pandas *read* method to read data. 

In [None]:
# Install wget if you don't already have it.
!pip install wget

In [None]:
import wget

#link_to_data = 'https://raw.githubusercontent.com/ODMDev/decisions-on-spark/master/data/miniloan/miniloan-riskscore-1K-v1.0.csv'
#link_to_data = 'https://raw.githubusercontent.com/ODMDev/decisions-on-spark/master/data/miniloan/miniloan-payment-default-risk-v2.0.csv'
link_to_data = 'https://raw.githubusercontent.com/ODMDev/decisions-on-spark/master/data/miniloan/miniloan-payment-default-cases-v2.0.csv'

filename = wget.download(link_to_data)

print(filename)

Import required libraires to create our Panda DataFrame

In [None]:
import numpy as np
import pandas as pd

Load the file to Pandas DataFrame using code below

In [None]:
used_names = ['creditScore', 'income', 'loanAmount', 'monthDuration', 'rate', 'yearlyReimbursement', 'paymentDefault']

df = pd.read_csv(
    filename,
    header=0,
    delimiter=r'\s*,\s*',
    engine='python'
).replace(
    [np.inf, -np.inf], np.nan
).dropna().loc[:, used_names]

Explore the loaded data by using the following Pandas DataFrame methods:
-  print types
-  print top ten records
-  count all records

In [None]:
df.dtypes

As you can see, the data contains five fields. default field is the one you would like to predict (label).

In [None]:
df.head()

In [None]:
print("Number of records: " + str(len(df)))

<a id="model"></a>
## 3. Create a Scikit learn machine learning model

In this section you will learn how to:

- [3.1 Prepare data](#prep)
- [3.2 Create a model](#pipe)
- [3.3 Train a model](#train)

### 3.1 Prepare data<a id="prep"></a>

In this subsection you will split your data into: 
- train data set
- test data set
- predict data set

In [None]:
splitted_data = np.split(df.sample(frac=1, random_state=42), [int(.8*len(df)), int((.8+.18)*len(df))])
train_data = splitted_data[0]
test_data = splitted_data[1]
predict_data = splitted_data[2]

print("Number of training records: " + str(len(train_data)))
print("Number of testing records : " + str(len(test_data)))
print("Number of prediction records : " + str(len(predict_data)))

As you can see your data has been successfully split into three data sets: 

-  The train data set, which is the largest group, is used for training.
-  The test data set will be used for model evaluation and is used to test the assumptions of the model.
-  The predict data set will be used for prediction.

### 3.2 Create a model<a id="pipe"></a>

In this section you will create a Scikit-Learn machine learning model and then train the model.

In the first step you need to import the Scikit-Learn machine learning packages that will be needed in the subsequent steps.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

Now construct the model. A linear model with Stochastic Gradient Descent is used in the following example. We use a pipeline to add an input scaling step.

In [None]:
clf = SGDClassifier(loss="log", penalty="l2", random_state=42, tol=1e-3)
scaler = StandardScaler()


In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('standardize', scaler),
    ("classifier", clf)
])

### 3.3 Train the model<a id="train"></a>
Now, you can train your Random Forest model by using the previously defined **pipeline** and **train data**.

In [None]:
train_data.dtypes

In [None]:
x_train_data = train_data.loc[:, used_names[:-1]]
y_train_data = train_data.loc[:, used_names[-1]]

In [None]:
pipeline.fit(x_train_data, y_train_data)

# we defined a variable trainedAt to keep track of when the model was trained
import datetime;
ts = datetime.datetime.now()
trainedAt = ts.strftime("%Y-%m-%dT%H:%M:%S.000Z")

You can check your **model accuracy** now. Use **test data** to evaluate the model.

In [None]:
x_test_data = test_data.loc[:, used_names[:-1]]
y_test_data = test_data.loc[:, used_names[-1]]

predictions = pipeline.predict(x_test_data)

We define a **metrics** variable to keep track of the metrics values

In [None]:
from sklearn.metrics import mean_squared_error, classification_report, balanced_accuracy_score, accuracy_score, confusion_matrix

metrics = []

name = "Coefficient of determination R^2"
r2 = pipeline.score(x_test_data, y_test_data)
metrics.append({ "name": name, "value": r2 })

name = "Root Mean Squared Error (RMSE)"
rmse = mean_squared_error(y_test_data, predictions)
metrics.append({ "name": name, "value": rmse })

name = "Accuracy"
acc = accuracy_score(y_test_data, predictions)
metrics.append({ "name": name, "value": acc })

name = "Balanced accuracy"
balanced_acc = balanced_accuracy_score(y_test_data, predictions)
metrics.append({ "name": name, "value": balanced_acc })

name = "Confusion Matrix"
confusion_mat = confusion_matrix(y_test_data, predictions, labels=[0, 1])
metrics.append({ "name": name, "value": str(confusion_mat.tolist()) })

for metric in metrics:
    print(metric["name"], "on test data =", metric["value"])

In [None]:
print(classification_report(y_test_data, predictions))

### 3.4 Save as pmml file

In [None]:
!pip install sklearn2pmml

In [None]:
model_name = type(clf).__name__
scaler_name = type(scaler).__name__

from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

pmml_pipeline = make_pmml_pipeline(
    pipeline,
    active_fields=x_train_data.columns,
    target_fields=['paymentDefault']
)
pmml_filename = "ML-Sample-" + model_name + '-' + scaler_name + "-pmml.xml"
sklearn2pmml(pmml_pipeline, pmml_filename, with_repr = True)
print(pmml_filename)

<a id="provider"></a>
## 4. Store the model in the provider of your choice
In this section you will learn how to use Python client libraries to store your model in the provider of your choice.

**Action** Click the provider you want to use.
1.	[Watson Machine Learning provider](#wml)
2.	[Open Prediction Service provider](#ops)

<a id="wml"></a>
### 4.1 Watson Machine Learning provider

In this section you will learn how to use Python client libraries to store your pipeline and model in WML repository.

- [4.1.1 Import the libraries](#lib)
- [4.1.2 Save model](#save)
- [4.1.3 Invoke model](#local)

#### 4.1.1 Import the libraries<a id="lib"></a>

First, you must install and import the `watson-machine-learning-client` libraries.

**Note**: Python 3.5 and Apache Spark 2.1 is required.

In [None]:
!rm -rf $PIP_BUILD/watson-machine-learning-client

In [None]:
!pip install watson-machine-learning-client --upgrade

Authenticate to the Watson Machine Learning service on IBM Cloud.

**Tip**: Authentication information (your credentials) can be found in the <a href="https://console.bluemix.net/docs/services/service_credentials.html#service_credentials" target="_blank" rel="noopener no referrer">Service credentials</a> tab of the service instance that you created on IBM Cloud. 

If you cannot see the **instance_id** field in **Service Credentials**, click **New credential (+)** to generate new authentication information. 

**Action**: Enter your Watson Machine Learning service instance credentials here.

In [None]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_credentials = {
  "apikey": "TO REPLACE",
  "iam_apikey_description": "TO REPLACE",
  "iam_apikey_name": "TO REPLACE",
  "iam_role_crn": "TO REPLACE",
  "iam_serviceid_crn": "TO REPLACE",
  "instance_id": "421f8abf-1cc1-4aa8-80a0-0fb491f48308",
  "url": "TO REPLACE"
}

client = WatsonMachineLearningAPIClient( wml_credentials )

#### 4.1.2 Save the pipeline and deploy model<a id="save"></a>

In this subsection you will learn how to save pipeline and model artifacts to your Watson Machine Learning instance.

Publish model based on PMML file.

In [None]:
metadata = {
    client.repository.ModelMetaNames.NAME: 'Loan Fraud Detection - PMML',
    client.repository.ModelMetaNames.FRAMEWORK_NAME: 'pmml',
    client.repository.ModelMetaNames.FRAMEWORK_VERSION: '4.4',
    client.repository.ModelMetaNames.RUNTIME_NAME: 'java',
    client.repository.ModelMetaNames.RUNTIME_VERSION: '1.8',
    client.repository.ModelMetaNames.EVALUATION_METHOD: 'multiclass',
    client.repository.ModelMetaNames.EVALUATION_METRICS: metrics
}

published_model_details = client.repository.store_model(pmml_filename, meta_props=metadata, training_data=None)

Publish model directly from pipeline.

In [None]:
# metadata = {
#     client.repository.ModelMetaNames.NAME: 'Loan Fraud Detection - Scikit Learn',
#     client.repository.ModelMetaNames.FRAMEWORK_NAME: 'scikit-learn',
#     client.repository.ModelMetaNames.FRAMEWORK_VERSION: '0.20',
#     client.repository.ModelMetaNames.RUNTIME_NAME: 'python',
#     client.repository.ModelMetaNames.RUNTIME_VERSION: '3.6',
#     client.repository.ModelMetaNames.LABEL_FIELD: 'paymentDefault'
#     client.repository.ModelMetaNames.EVALUATION_METRICS: metrics
# }

# published_model_details = client.repository.store_model(model=pipeline, meta_props=metadata, training_data=x_train_data)


In [None]:
model_uid = client.repository.get_model_uid( published_model_details )

print( "model_uid: ", model_uid )

In [None]:
deploymnt_name   = "fraud prediction"
deployment_desc  = "Online deployment of Loan payment default predictive service"
deployment       = client.deployments.create( model_uid, deploymnt_name, deployment_desc )
scoring_endpoint = client.deployments.get_scoring_url( deployment )
print( "scoring_endpoint: ", scoring_endpoint )

**Tip**: Use `client.repository.ModelMetaNames.show()` to get the list of available props.

In [None]:
client.repository.ModelMetaNames.show()

<a id="local"></a>
#### 4.1.3 Invoke model


In this subsection you will score the *predict_data* data set.
You will learn how to invoke a saved model from a specified instance of Watson Machine Learning.

In [None]:
x_predict_data = predict_data.loc[:, used_names[:-1]]
y_predict_data = predict_data.loc[:, used_names[-1]]

scoring_payload = {
    "fields": x_predict_data.columns.values.tolist(),
    "values": x_predict_data.values.tolist()
}
predictions_predict_data = client.deployments.score(scoring_endpoint, scoring_payload)

print(json.dumps(predictions_predict_data, indent=4))

Preview some results metrics

In [None]:
label_predictions = []
for result in predictions_predict_data['values']:
    if result[0] >= 0.5:
        label_predictions.append(0)
    elif result[0] < 0.5:
        label_predictions.append(1)
        
balanced_acc = balanced_accuracy_score(y_predict_data, label_predictions)

confusion_mat = confusion_matrix(y_predict_data, label_predictions, labels=[0, 1])

acc = accuracy_score(y_predict_data, label_predictions)

print('Accuracy', acc)
print('Balanced accuracy', balanced_acc)
print('Confusion Matrix', confusion_mat)

Provider is all setup and ready to be used in ADS ?
You can now go to section 
[WIP Use Plotly to visualize data](#visualization)

<a id="ops"></a>
### 4.2 Open Prediction Service provider
In this section you will learn how to use Python client libraries to store your model in your Open Predicitve Service.

- [4.2.1 Set up](#lib)
- [4.2.2 Deploy model](#save)
- [4.2.3 Invoke the model](#load)

#### 4.2.1 Set up <a id="lib"></a>

In order to save your model into your Open Prediciton Service
You must first:

- Check that your Open Prediciton Service is up and running
- Define a model configuration
- Save your model in a pickle file

Let's check that your Open Prediciton Service is up and running

**Action**: Enter your Open Prediciton Service instance here.

In [None]:
OPS_REQUEST_URL = 'http://localhost:8080/v1/'

In [None]:
from urllib.parse import urljoin, urlparse
import json, requests

# Checking that that Open Prediciton Service is up and running
parsedUrl = urlparse(OPS_REQUEST_URL)
statusUrl = urljoin(OPS_REQUEST_URL, urljoin(parsedUrl.path, 'status'))
r = requests.get(statusUrl)

status = r.status_code == requests.codes.ok

if status:
    print('Open Prediciton Service is up and running')
    print(json.loads(r.text)[u'model_count'], 'models are alreday deployed')
else:
    print('An error occured when reaching out to your Open Prediciton Service instance', r.status_code, r.text)

Next you need to define a configuration for your model.
**TODO LINK TO DOC ?**

**Action**: Complete all required data in the following variables

In [None]:
# Parameters defining the model unicity:
MODEL_NAME = "loan-risk"
MODEL_VERSION = "v0"

# Complementary parameters
METHOD_NAME = "predict_proba"
# For classification problems
CLASS_NAMES = {
    "0": "False",
    "1": "True"
}

# Metadata
METADATA_DESCRIPTION = "Sample loan risk predictive model"
METADATA_AUTHOR = "ADD_YOUR_NAME"


Here we are automating the input and output schema generation

In [None]:
from pandas.io.json import build_table_schema

mappingToOPSSchema = {
    'integer': 'int64',
    'number': 'float64'
}

def getInputSchema(dataFrame):
    inputSchema = build_table_schema(dataFrame, index=False, version=False)

    for index, field in enumerate(inputSchema['fields']):
        inputSchema['fields'][index]['type'] = mappingToOPSSchema[field['type']]
        inputSchema['fields'][index]['order'] = index
    return inputSchema['fields']

# attributes schema for regression models for example
# (probabilites)
predictionAsFloatAttributesSchema = [
    {
        "name": "prediction",
        "type": "float"
    }
]

# attributes schema for classification models
# (label and probabilites)
predictionAsStringOutputSchema = [
    {
        "name": "prediction",
        "type": "string"
    },
    {
        "name": "probabilities",
        "type": "[Probability]"
    }
]

In [None]:
# Retrieving input and output schema
inputSchema = getInputSchema(x_train_data)

outputSchema = {
    "attributes": predictionAsFloatAttributesSchema
}

if METHOD_NAME == 'predict_proba':
    outputSchema['attributes'] = predictionAsStringOutputSchema


We finally have a complete configuration object to be bundled with the model.

In [None]:
model_configuration = {
  "name": MODEL_NAME,
  "version": MODEL_VERSION,
  "method_name": METHOD_NAME,
  "input_schema": inputSchema,
  "output_schema": outputSchema,
  "metadata": {
    "class_names": CLASS_NAMES,
    "description": METADATA_DESCRIPTION,
    "author": METADATA_AUTHOR,
    "trained_at": trainedAt,
    "metrics": metrics
  }
}

print(json.dumps(model_configuration, indent=4))

Save your model in a pickle file

In [None]:
import pickle

In [None]:
def save_model_pickle(pickle_filename, model, model_configuration):
        with open(pickle_filename, 'wb') as f:
            pickle.dump({
                'model': model,
                'model_config': model_configuration
            }, f)

In [None]:
pickle_filename = MODEL_NAME + '-' + MODEL_VERSION + '-archive.pkl'

In [None]:
save_model_pickle(pickle_filename, pipeline, model_configuration)

#### 4.2.2 Deploy model<a id="save"></a>

In [None]:
modelUrl = urljoin(OPS_REQUEST_URL, urljoin(parsedUrl.path, 'models'))
print(modelUrl)

In [None]:
files = {'file': open(pickle_filename, 'rb')}

r = requests.post(modelUrl, files=files)

status = r.status_code == requests.codes.ok

if status:
    print('Model was succesfully deployed.')
else:
    print('Model was not deployed:', r.status_code, r.text)
    print('You might want to check if your model does not alreday exist under the same name and version.')

#### 4.2.3 Invoke the model<a id="load"></a>

In [None]:
invokeUrl = urljoin(OPS_REQUEST_URL, urljoin(parsedUrl.path, 'invocations'))
print(invokeUrl)

In [None]:
x_predict_data = predict_data.loc[:, used_names[:-1]]
y_predict_data = predict_data.loc[:, used_names[-1]]

In [None]:
raw_predict_data = x_predict_data.to_numpy()

In [None]:
from copy import deepcopy

data = {
  "model_name": MODEL_NAME,
  "model_version": MODEL_VERSION,
  "params": []
}
featureLabels = x_predict_data.columns;

predictions_np = []

for row in raw_predict_data:
    tmpData = deepcopy(data)
    for index, value in enumerate(row):
        tmpData['params'].append({
            "name": featureLabels[index],
            "value": value
        })
    print(json.dumps(tmpData, indent=4))
    r = requests.post(invokeUrl,  data=json.dumps(tmpData))
    status = r.status_code == requests.codes.ok
    if status:
        result = json.loads(r.text)[u'prediction']
        predictions_np.append(result)
    else:
        print('Model was not invoked:', r.status_code, r.text)
        break

predictions_np = np.array(predictions_np, dtype=object)
predictions_np = predictions_np == "True"

predictions = pd.DataFrame(data=predictions_np, columns=["prediction"])

predictions.head()

In [None]:
y_predict_data_Bool = y_predict_data.astype({"paymentDefault": bool})

print(y_predict_data_Bool.head(5))

In [None]:
balanced_acc = balanced_accuracy_score(y_predict_data_Bool, predictions)

confusion_matrix = confusion_matrix(y_predict_data_Bool, predictions, labels=[0, 1])

acc = accuracy_score(y_predict_data_Bool, predictions)

print('Accuracy', acc)
print('Balanced accuracy', balanced_acc)
print('Confusion Matrix', confusion_matrix)

<a id="plotly"></a>
## 5. Use Plotly to visualize data

In this subsection you will use the Plotly package to explore the prediction results. Plotly is an online analytics and data visualization tool.

First, you need to install the required packages. You can do it by running the following code. Run it one time only.

In [None]:
!pip install "notebook>=5.3" "ipywidgets>=7.2"

Import Plotly and the other required packages.

In [None]:
import sys
import pandas
import plotly.graph_objects as go

In [None]:
predict_data.index.equals(predictions.index)

In [None]:
if not predict_data.index.equals(predictions.index):
    predict_data = predict_data.reset_index()
    predictions = pd.concat([predictions, predict_data], axis=1)

In [None]:
cumulative_stats = predictions.groupby(['prediction']).count()
product_data = [go.Pie(labels=cumulative_stats.index, values=cumulative_stats['income'])]
product_layout = go.Layout(title='Predicted default income distribution')

fig = go.Figure(data=product_data, layout=product_layout)
fig.show()

With this data set, you might want to do some analysis of the mean loan amount by using a bar chart.

In [None]:
age_data = [go.Bar(y=predictions.groupby(['prediction']).mean()["loanAmount"], x=cumulative_stats.index)]

age_layout = go.Layout(
    title='Mean loanAmount per predicted default',
    xaxis=dict(title = "default", showline=False),
    yaxis=dict(title = "loanAmount"))

fig = go.Figure(data=age_data, layout=age_layout)
fig.show()

Based on the bar plot you created, you might make the following conclusion: The mean amount for loan that present a default are 100k higher than the loans for which there is no default.

<a id="summary"></a>
## 6. Summary and next steps
You successfully completed this notebook! 
 
You learned how to use Scikit Learn machine learning API as well as Open Prediction Service for model creation and deployment. 
 
Check out our [Online Documentation](https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html) for more samples, tutorials, documentation, how-tos, and blog posts. 

### Authors

This notebook was inspired by original notebook written by Pierre Feillet using Apache Spark and Watson Machine Learning.
It was adapted for Scikit Learn and Open Prediction Service by Marine Collery.