<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Build a Loan default scoring model in OPS</b></th>
   </tr>
</table>

This notebook is a simple example to build and deploy a Machine Learning model ready to be used in Automation Decision Service.
The deployed model is stored in  Open Prediction Service endpoint. You can find information about OPS implementations in this [documentation](https://github.com/IBM/open-prediction-service-hub).
This other [notebook](https://github.com/icp4a/automation-decision-services-samples/tree/master/samples/MLNotebooks/Predict%20loan%20default%20with%20scikit-learn%20in%20WML.ipynb)
builds the same model and stores it in Watson Machine Learning.

Note that this model is built on a small synthetic dataset to serve as an example, its predictions are not realistic.

After running this notebook, you can use the deployed model in a decision project in Automation Decision Service. You find a detailed description for this kind of integration in the [ML Start tutorial](https://github.com/icp4a/automation-decision-services-samples/tree/21.0.1/samples/MLStart).

Some familiarity with Python is helpful. This notebook uses Python 3.

## Prerequisites

- You need to have an Open Prediction Service instance up and running.
- To use the notebooks, follow the documentation [Creating a project](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/projects.html?audience=wdp).
- Others notebooks are available in this [Samples documentation](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-samples-overview.html).

## Learning goals

You will learn how to:

-  Load a CSV file into a Pandas DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create a scikit-learn machine learning model.
-  Store a machine learning model in the Open Prediction Service provider.
-  Train and evaluate a model.


## Contents

This notebook contains the following parts:

1.	[Load and explore data](#load)
2.	[Create a Scikit learn machine learning model](#model)
3.	[Store the model in Open Prediction Service provider](#provider)
4.	[Summary and next steps](#summary)

<a id="load"></a>
## 1. Load and explore data

In this section you will load the data as a Pandas DataFrame and perform a basic exploration.

Load the data to the Pandas DataFrame by using *wget* to upload the data to gpfs and then use pandas *read* method to read data. 

In [None]:
# Install wget if you don't already have it.
!pip install wget

In [None]:
!pip uninstall --yes scikit-learn 

In [None]:
!pip install 'numpy>=1.19.5'
!pip install 'pandas>=1.1.2'
!pip install 'scikit-learn==0.23.2'
!pip install 'requests>=2.25.1'

In [None]:
import wget
link_to_data = 'https://raw.githubusercontent.com/ODMDev/decisions-on-spark/master/data/miniloan/miniloan-payment-default-cases-v2.0.csv'
filename = wget.download(link_to_data)

print(filename)

Import required libraires to create our Panda DataFrame

In [None]:
import numpy as np
import pandas as pd

Load the file to Pandas DataFrame using code below

In [None]:
used_names = ['creditScore', 'income', 'loanAmount', 'monthDuration', 'rate', 'yearlyReimbursement', 'paymentDefault']

df = pd.read_csv(
    filename,
    header=0,
    delimiter=r'\s*,\s*',
    engine='python'
).replace(
    [np.inf, -np.inf], np.nan
).dropna().loc[:, used_names]

You can now explore the loaded data

In [None]:
# convert all columns of DataFrame to float to avoid scaler warnings
df = df.astype({'creditScore': float, "income": np.float64, "loanAmount": np.float64, "monthDuration": np.float64, "yearlyReimbursement": np.float64, "paymentDefault": np.float64})
df.dtypes

As you can see, the data contains five fields. default field is the one you would like to predict (label).

In [None]:
df.head()

In [None]:
print("Number of records: " + str(len(df)))

<a id="model"></a>
## 2. Create a Scikit learn machine learning model

In this section you will learn how to:

- [3.1 Prepare data](#prep)
- [3.2 Create a model](#pipe)
- [3.3 Train a model](#train)

### 2.1 Prepare data<a id="prep"></a>

In this subsection you will split your data into: 
- train data set
- test data set
- predict data set

In [None]:
splitted_data = np.split(df.sample(frac=1, random_state=42), [int(.7*len(df)), int((.7+.2)*len(df))])
train_data = splitted_data[0]
test_data = splitted_data[1]
predict_data = splitted_data[2]

print("Number of training records: " + str(len(train_data)))
print("Number of testing records : " + str(len(test_data)))
print("Number of prediction records : " + str(len(predict_data)))

As you can see your data has been successfully split into three data sets: 

-  The train data set, which is the largest group, is used for training.
-  The test data set will be used for model evaluation and is used to test the assumptions of the model.
-  The predict data set will be used for prediction.

### 2.2 Create a model<a id="pipe"></a>

In this section you will create a Scikit-Learn machine learning model and then train the model.

In the first step you need to import the Scikit-Learn machine learning packages that will be needed in the subsequent steps.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

Now construct the model. A linear model with Stochastic Gradient Descent is used in the following example. We use a pipeline to add an input scaling step.

In [None]:
clf = SGDClassifier(loss="log", penalty="l2", random_state=42, tol=1e-3)
scaler = StandardScaler()

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('standardize', scaler),
    ("classifier", clf)
])

### 2.3 Train the model<a id="train"></a>
Now, you can train your Random Forest model by using the previously defined **pipeline** and **train data**.

In [None]:
train_data.dtypes

In [None]:
x_train_data = train_data.loc[:, used_names[:-1]]
y_train_data = train_data.loc[:, used_names[-1]]

In [None]:
pipeline.fit(x_train_data, y_train_data)

# we defined a variable trainedAt to keep track of when the model was trained
import datetime
ts = datetime.datetime.now()
trainedAt = ts.strftime("%Y-%m-%dT%H:%M:%S.000Z")

You can check your **model accuracy** now. Use **test data** to evaluate the model.

In [None]:
x_test_data = test_data.loc[:, used_names[:-1]]
y_test_data = test_data.loc[:, used_names[-1]]

predictions = pipeline.predict(x_test_data)

We define a **metrics** variable to keep track of the metrics values

In [None]:
from sklearn.metrics import mean_squared_error, classification_report, balanced_accuracy_score, accuracy_score, confusion_matrix

metrics = []

name = "Coefficient of determination R^2"
r2 = pipeline.score(x_test_data, y_test_data)
metrics.append({ "name": name, "value": r2 })

name = "Root Mean Squared Error (RMSE)"
rmse = mean_squared_error(y_test_data, predictions)
metrics.append({ "name": name, "value": rmse })

name = "Accuracy"
acc = accuracy_score(y_test_data, predictions)
metrics.append({ "name": name, "value": acc })

name = "Balanced accuracy"
balanced_acc = balanced_accuracy_score(y_test_data, predictions)
metrics.append({ "name": name, "value": balanced_acc })

name = "Confusion Matrix"
confusion_mat = confusion_matrix(y_test_data, predictions, labels=[0, 1])
metrics.append({ "name": name, "value": str(confusion_mat.tolist()) })

for metric in metrics:
    print(metric["name"], "on test data =", metric["value"])

In [None]:
print(classification_report(y_test_data, predictions))

<a id="provider"></a>
## 3. Store the model in Open Prediction Service provider


In this section you will learn how to use Python client libraries to store your model in your Open Prediction Service.

- [3.1 Set up](#lib)
- [3.2 Deploy model](#save)
- [3.3 Invoke the model](#load)

### 3.1 Set up <a id="lib"></a>

In order to save your model into your Open Prediction Service
You must first:

- Check that your Open Prediction Service is up and running
- Define a model configuration
- Save your model in a pickle file

Let's check that your Open Prediction Service is up and running

**Action**: Enter your Open Prediction Service URL instance in the cell above. Change its type to code.

In [None]:
OPS_REQUEST_URL = 'PUT OPS URL '  # For local test: 'http://localhost:8080'

In [None]:
from urllib.parse import urljoin, urlparse
import json, requests

# Checking that that Open Prediction Service is up and running
parsedUrl = urlparse(OPS_REQUEST_URL)
statusUrl = urljoin(OPS_REQUEST_URL, urljoin(parsedUrl.path, 'info'))
r = requests.get(statusUrl)

status = r.status_code == requests.codes.ok
versions= json.loads(r.text)[u'info']['libraries']
if status:
    print('Open Prediction Service is up and running.')
    print('OPS scikit-learn version is : '+ versions['scikit-learn'])
else:
    print('An error occured when reaching out to your Open Prediction Service instance', r.status_code, r.text)

Here is an example of a working output :
Open Prediction Service is up and running.
OPS sklearn version is : 0.23.2

Next you need to define a configuration for your model

**Action**: Complete all required data in the following variables

In [None]:
# Parameters defining the model unicity:
MODEL_NAME = "loan-risk"
MODEL_VERSION = "v0"

# Metadata
METADATA_DESCRIPTION = "Sample loan risk predictive model"
METADATA_AUTHOR = "ADD_YOUR_NAME"

Here we are automating the input and output schema generation

In [None]:
from pandas.io.json import build_table_schema

mappingToOPSSchema = {
    'integer': 'int64',
    'number': 'float64'
}

def getInputSchema(dataFrame):
    inputSchema = build_table_schema(dataFrame, index=False, version=False)

    for index, field in enumerate(inputSchema['fields']):
        inputSchema['fields'][index]['type'] = mappingToOPSSchema[field['type']]
        inputSchema['fields'][index]['order'] = index
    return inputSchema['fields']

In [None]:
# Retrieving input schema
inputSchema = getInputSchema(x_train_data)


We finally have a complete configuration object to be bundled with the model.

In [None]:
model_configuration = {
  "name": MODEL_NAME,
  "version": MODEL_VERSION,
  "input_schema": inputSchema,
  "metadata": {
    "description": METADATA_DESCRIPTION,
    "author": METADATA_AUTHOR,
    "trained_at": trainedAt,
    "metrics": metrics
  }
}

print(json.dumps(model_configuration, indent=4))

### 3.2 Deploy model<a id="save"></a>

You need to add first the model configurartion

In [None]:
modelUrl = urljoin(OPS_REQUEST_URL, urljoin(parsedUrl.path, 'models'))
body= json.dumps(model_configuration)

r=requests.post(modelUrl, data=body)
status= r.status_code == requests.codes.ok
content= json.loads(r.text)
model_id= (content[u'id'])

if status:
    print('Model configuration was succesfully added.')
else:
    print('Model configuration was not added:', r.status_code, r.text)

Save your model in a pickle file.

In [None]:
import pickle

def save_model_pickle(pickle_filename, model):
        with open(pickle_filename, 'wb') as f:
            pickle.dump(model, f)

In [None]:
pickle_filename = MODEL_NAME + '-' + MODEL_VERSION + '-archive.pkl'

In [None]:
save_model_pickle(pickle_filename, pipeline)

In [None]:
modelUrl = urljoin(OPS_REQUEST_URL, urljoin(parsedUrl.path, 'models/'+ model_id ))
print(modelUrl)

In [None]:
files = {'file': open(pickle_filename, 'rb')}
r = requests.post(modelUrl,data={'format':'pickle', 'input_data_structure': 'DataFrame', 'output_data_structure': 'ndarray'},files=files)
if r.status_code == 201:
    print("Model was succesfully added.")
else:
    print('Model was not deployed:', r.status_code, r.text)
    print('You might want to check if your model does not already exist under the same name and version.')

### 3.3 Invoke the model<a id="load"></a>

In [None]:
invokeUrl = urljoin(OPS_REQUEST_URL, urljoin(parsedUrl.path, '/predictions'))
print(invokeUrl)

In [None]:
x_predict_data = predict_data.loc[:, used_names[:-1]]
y_predict_data = predict_data.loc[:, used_names[-1]]

raw_predict_data = x_predict_data.to_numpy()

from copy import deepcopy

data = {
  "target": [
    {
      "rel": "endpoint",
      "href": "/endpoints/"+ model_id
    }
  ],
  "parameters": []
}
featureLabels = x_predict_data.columns

predictions_np = []

for row in raw_predict_data:
    tmpData = deepcopy(data)
    for index, value in enumerate(row):
        tmpData['parameters'].append({
            "name": featureLabels[index],
            "value": value
        })
    print(json.dumps(tmpData, indent=4))
    r = requests.post(invokeUrl,  data=json.dumps(tmpData))
    status = r.status_code == requests.codes.ok
    if status:
        result = json.loads(r.text)[u'result'][u'predictions']
        predictions_np.append(result)
    else:
        print('Model was not invoked:', r.status_code, r.text)
        break

predictions_np = np.array(predictions_np, dtype=int)

predictions = pd.DataFrame(data=predictions_np, columns=["prediction"]).astype({"prediction": bool})
y_predict_data = y_predict_data.astype({"paymentDefault": bool})

predictions.head()

In [None]:
print(y_predict_data.head(5))

In [None]:
balanced_acc = balanced_accuracy_score(y_predict_data, predictions)

confusion_matrix = confusion_matrix(y_predict_data, predictions, labels=[0, 1])

acc = accuracy_score(y_predict_data, predictions)

print('Accuracy', acc)
print('Balanced accuracy', balanced_acc)
print('Confusion Matrix', confusion_matrix)

<a id="summary"></a>
## 4. Summary and next steps
You successfully completed this notebook! 
 
You learned how to use Scikit Learn machine learning API as well as Open Prediction Service for model creation and deployment. 
 
Now you can use this model deployment in a predictive model in Automation Decision Service. You find a detailed description for this kind of integration in the [ML Start tutorial](https://github.com/icp4a/automation-decision-services-samples/tree/21.0.1/samples/MLStart).

### Authors

This notebook was inspired by original notebook written by Pierre Feillet using Apache Spark and Watson Machine Learning.
It was adapted for Scikit Learn and Open Prediction Service by Amel Ben Othmane. 
