# Operationalizing Machine Learning
** Project 2 **
[[View Rubric](https://review.udacity.com/#!/rubrics/2893/view)]

## Initialization



In [None]:
#  Not needed when running notebook on azure:
!pip install --upgrade -q -r requirements.txt
!python --version

In [11]:
import logging
import os
import csv
import json

import pickle
import pkg_resources
import requests

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

import azureml.core

from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.dataset import Dataset
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core.experiment import Experiment
from azureml.core.run import Run
from azureml.core.workspace import Workspace
from azureml.pipeline.core import Pipeline, PipelineData, TrainingOutput
from azureml.pipeline.core.run import PipelineRun
from azureml.pipeline.steps import AutoMLStep
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails

# Check core SDK version number
print("Azure SDK version:", azureml.core.VERSION)

Python 3.9.1


## Authentication

In [None]:
# skipped granting rights because using azure environment provided by udacity



## Azure Initialization

In [None]:
ws = Workspace.from_config()
# ws = Workspace.get(name="quick-starts-ws-128192") # UPDATE THIS LINE WITH EACH NEW VM INSTANCE!

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

# DONT FORGET TO CLICK THE LOGIN LINK!

## Prepare Dataset

In [None]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
found = False
ds_key = "Bank-marketing"

if ds_key in ws.datasets.keys(): 
        found = True
        ds = ws.datasets[ds_key] 

if not found:
        # Create AML Dataset and register it into Workspace
        # Create TabularDataset using TabularDatasetFactory
        # https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py  
        # i download and import the _train.csv, so no further splitting is necessary
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
        ds = TabularDatasetFactory.from_delimited_files(path=dataset_path)  
        #Register Dataset in Workspace
        ds = ds.register(workspace=ws,
                        name=ds_key,
                        description="Bank Marketing DataSet for Udacity Course 2")


In [None]:
# data cleaning like in project-01

def clean_data(data):
    # Dict for cleaning data
    months = {"jan":1, "feb":2, "mar":3, "apr":4, "may":5, "jun":6, "jul":7, "aug":8, "sep":9, "oct":10, "nov":11, "dec":12}
    weekdays = {"mon":1, "tue":2, "wed":3, "thu":4, "fri":5, "sat":6, "sun":7}

    # Clean and one hot encode data
    x_df = data.to_pandas_dataframe().dropna()
    jobs = pd.get_dummies(x_df.job, prefix="job")
    x_df.drop("job", inplace=True, axis=1)
    x_df = x_df.join(jobs)
    x_df["marital"] = x_df.marital.apply(lambda s: 1 if s == "married" else 0)
    x_df["default"] = x_df.default.apply(lambda s: 1 if s == "yes" else 0)
    x_df["housing"] = x_df.housing.apply(lambda s: 1 if s == "yes" else 0)
    x_df["loan"] = x_df.loan.apply(lambda s: 1 if s == "yes" else 0)
    contact = pd.get_dummies(x_df.contact, prefix="contact")
    x_df.drop("contact", inplace=True, axis=1)
    x_df = x_df.join(contact)
    education = pd.get_dummies(x_df.education, prefix="education")
    x_df.drop("education", inplace=True, axis=1)
    x_df = x_df.join(education)
    x_df["month"] = x_df.month.map(months)
    x_df["day_of_week"] = x_df.day_of_week.map(weekdays)
    x_df["poutcome"] = x_df.poutcome.apply(lambda s: 1 if s == "success" else 0)

    y_df = x_df.pop("y").apply(lambda s: 1 if s == "yes" else 0)

    return x_df, y_df

found_clean = False
if ds_key +"-clean" in ws.datasets.keys(): 
        found_clean = True
        ds_clean = ws.datasets[ds_key +"-clean"] 

if not found_clean:
    # Use the clean_data function to clean your data.
    x, y = clean_data(ds)
    df_clean = x.join(y)

    #Register cleaned Dataset in Workspace
    ds_clean = TabularDatasetFactory.register_pandas_dataframe(df_clean, ws.get_default_datastore(), ds_key +"-clean",
                                                                description="Cleaned Bank Marketing DataSet for Udacity Course 2")

In [None]:

df_clean = ds_clean.to_pandas_dataframe()
df_clean.describe()

In [None]:
df_clean.head(5)

Screenshot of “Registered Datasets” in ML Studio showing that Bankmarketing dataset is available:
![registered_datasets](images/registered_dataset.jpg)

## Automated ML Experiment

<p>Run the experiment using  <em>Classification</em>, ensure <em>Explain best model</em> is checked. <br> On Exit criterion, reduce the default (3 hours) to 1 and reduce the <em>Concurrency </em> from default to 5 (this number should always be less than the number of the compute cluster) <br><br> Note: This process takes about 15 minutes and it runs about 5 minutes per iteration</p>


In [None]:
# create an experiment using Automated ML

experiment_name = 'ml-experiment-1'
exp = Experiment(workspace=ws, name=experiment_name)
# run = exp.start_logging()

configure a compute cluster

Create compute cluster "Standard_DS12_v2" and min number of nodes = 1
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python

In [None]:
# Choose a name for your CPU cluster
cpu_cluster_name = "auto-ml"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True) # , min_node_count = 1, timeout_in_minutes = 10
# For a more detailed view of current AmlCompute status, use get_status().


In [None]:
# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.

automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=ds_clean,
                             label_column_name="y",   
                             path = './pipeline-project',
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             experiment_timeout_minutes = 20,
                             max_concurrent_iterations = 5,
                             # n_cross_validations=3,
                             primary_metric = "AUC_weighted" #  or "accuracy"
                            )


metrics_output_name = 'metrics_output'
metrics_data = PipelineData(name='metrics_data',
                           datastore=ws.get_default_datastore(),
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))

best_model_output_name = 'best_model_output'
best_model_data = PipelineData(name='model_data',
                           datastore=ws.get_default_datastore(),
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[AutoMLStep(
            name='automl_module',
            automl_config=automl_config,
            outputs=[metrics_data, best_model_data],
            allow_reuse=True)
          ])

pipeline_run = exp.submit(pipeline) #TODO: compute_target = cpu_cluster #config=automl_config

# Submit automl run
RunDetails(pipeline_run).show()


In [None]:
pipeline_run.wait_for_completion()


Screenshot showing that the experiment is shown as completed:
![experiment overview](images/experiments_overview.jpg)
![completed run](images/completed_run.jpg)


## Deploy the best model
After the experiment run completes, a summary of all the models and their metrics are shown, including explanations. The Best Model will be shown in the Details tab. In the Models tab, it will come up first (at the top). Make sure you select the best model for deployment.

Deploying the Best Model will allow to interact with the HTTP API service and interact with the model by sending data over POST requests.

1. Select the <strong>best</strong> model for deployment
2. Deploy the model and enable "Authentication"
3. Deploy the model using Azure Container Instance (ACI)

Docs: [PipelineRun Class](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py)

In [None]:
#download pipeline output about metrics (of child runs) and examine them
metrics_portref = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_portref.download('.', show_progress=True)

with open(metrics_portref._path_on_datastore) as f:
    metrics = f.read()
    
pd.DataFrame(json.loads(metrics))

In [None]:
# download pipeline output about the best model and examine it
best_model_portref = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_portref.download('.', show_progress=True)

with open(best_model_portref._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)

# show best model
best_model

In [None]:
best_model.steps

In [None]:
# register best model
# best_automl_model_reg = best_automl_run.register_model(model_name='best_automl_model', model_path=best_model_portref._path_on_datastore) #'outputs/model.pkl')

In [None]:
# examine metrics of best model
# print('Best Run Id: ', best_automl_run.id)
# print('Accuracy:', best_automl_run_metrics['accuracy'])
# print('Metrics:', best_automl_run_metrics)
# print("Model",best_automl_model)

Screenshot of the best model after the experiment completes:
![best model](images/best_model.jpg)
![best model steps](images/best_model_steps.jpg)

## Testing the best model

In [None]:
# Load test data
ds_test = TabularDatasetFactory.from_delimited_files(path='https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv')

x, y = clean_data(ds_test)
df_test = x.join(y)
df_test = df_test[pd.notnull(df_test['y'])]

y_test = df_test['y']
X_test = df_test.drop(['y'], axis=1)

# predict
y_test_pred = best_model.predict(X_test)

# Visualize via confusion matrix
pd.DataFrame(confusion_matrix(y_test, y_test_pred)).style.background_gradient(cmap='Blues', low=0, high=0.9)

## Deployment

In [None]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing Train", description="Training bankmarketing pipeline", version="1.0")

published_pipeline


Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [None]:
auth_header = InteractiveLoginAuthentication().get_authentication_header()

## Enable logging / Application Insights
Now that the Best Model is deployed, enable Application Insights and retrieve logs. Although this is configurable at deploy time with a check-box, it is useful to be able to run code that will enable it for you.


In [None]:
# Ensure <code>az</code> is installed, as well as the Python SDK for Azure
# Create a new virtual environment with Python3
# Write and run code to enable Application Insights
# Use the provided code <code>logs.py</code> to view the logs

TODO: Take a screenshot showing that "Application Insights" is enabled in the Details tab of the endpoint.

TODO: Take a screenshot showing logs by running the provided <code>logs.py</code> script

## Swagger Documentation
In this step, you will consume the deployed model using Swagger.

Azure provides a Swagger JSON file for deployed models. Head to the Endpoints section, and find your deployed model there, it should be the first one on the list.

A few things you need to pay attention to:

swagger.sh will download the latest Swagger container, and it will run it on port 80. If you don't have permissions for port 80 on your computer, update the script to a higher number (above 9000 is a good idea).

serve.py will start a Python server on port 8000. This script needs to be right next to the downloaded swagger.json file. NOTE: this will not work if swagger.json is not on the same directory.



In [None]:
# Download the swagger.json file
# <p>Run the <code>swagger.sh</code> and <code>serve.py</code></p>
# Interact with the swagger instance running with the documentation for the HTTP API of the model.
# Display the contents of the API for the model

TODO: Take a screenshot showing that swagger runs on localhost showing the HTTP API methods and responses for the model


## Consume model endpoints
Once the model is deployed, use the endpoint.py script provided to interact with the trained model. In this step, you need to run the script, modifying both the scoring_uri and the key to match the key for your service and the URI that was generated after deployment.

Hint: This URI can be found in the Details tab, above the Swagger URI.




In [None]:
# URL for the web service, should be similar to:
# 'http://8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io/score' #URI that was generated after deployment
scoring_uri = published_pipeline.endpoint # ''
# If the service is authenticated, set the key or token # key for your service
key = auth_header # ''

# Two sets of data to score, so we get two results back
data = {"data":
        [
          {
            "age": 17,
            "campaign": 1,
            "cons.conf.idx": -46.2,
            "cons.price.idx": 92.893,
            "contact": "cellular",
            "day_of_week": "mon",
            "default": "no",
            "duration": 971,
            "education": "university.degree",
            "emp.var.rate": -1.8,
            "euribor3m": 1.299,
            "housing": "yes",
            "job": "blue-collar",
            "loan": "yes",
            "marital": "married",
            "month": "may",
            "nr.employed": 5099.1,
            "pdays": 999,
            "poutcome": "failure",
            "previous": 1
          },
          {
            "age": 87,
            "campaign": 1,
            "cons.conf.idx": -46.2,
            "cons.price.idx": 92.893,
            "contact": "cellular",
            "day_of_week": "mon",
            "default": "no",
            "duration": 471,
            "education": "university.degree",
            "emp.var.rate": -1.8,
            "euribor3m": 1.299,
            "housing": "yes",
            "job": "blue-collar",
            "loan": "yes",
            "marital": "married",
            "month": "may",
            "nr.employed": 5099.1,
            "pdays": 999,
            "poutcome": "failure",
            "previous": 1
          },
      ]
    }
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

# output should be similar to this: {"result": ["yes", "no"]}

**Alternatively trigger a run via notebook:**

Get the REST url from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep because you defined it is defined as a PipelineParameter object in the step configuration.

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.


In [None]:
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

# Use run id to monitor status of new run. This will take 10-15 min, looks similar to previous pipeline run, so you can skip watching full output.
published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

TODO: Take a screenshot showing that the `endpoint.py` script runs against the API producing JSON output from the model.

## Optional: Benchmarking
The following is an optional step to benchmark the endpoint using Apache bench. You will not be graded on it but I encourage you to try it out.



In [None]:
# Make sure you have the Apache Benchmark command-line tool installed and available in your path
# <p>In the <code>endpoint.py</code>, replace the key and URI again</p>
# <p>Run <code>endpoint.py</code>. A data.json file should appear</p>
# <p>Run the <code>benchmark.sh</code> file. The output should look similar to the text below</p>

In [None]:
!ab -n 10 -v 4 -p data.json -T 'application/json' -H 'Authorization: Bearer REPLACE_WITH_KEY' http://REPLACE_WITH_API_URL/score

TODO: Take a screenshot showing that Apache Benchmark (ab) runs against the HTTP API using authentication keys to retrieve performance results

Run Apache Benchmark for 10 times, producing output similar to:

 ```
 This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
 Licensed to The Apache Software Foundation, http://www.apache.org/

 Benchmarking 8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io (be patient)...INFO: POST header ==
 ---
 POST /score HTTP/1.0
 Content-length: 812
 Content-type: application/json
 Authorization: Bearer Agb3D23IygXXXXXXXXXXXXXXXXXXXXXXXXX
 Host: 8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io
 User-Agent: ApacheBench/2.3
 Accept: */*


 ---
 LOG: header received:
 HTTP/1.0 200 OK
 Content-Length: 33
 Content-Type: application/json
 Date: Thu, 30 Jul 2020 12:33:34 GMT
 Server: nginx/1.10.3 (Ubuntu)
 X-Ms-Request-Id: babfc511-a0f0-4ecb-a243-b3010a76b8b9
 X-Ms-Run-Function-Failed: False

 "{\"result\": [\"yes\", \"no\"]}"
 LOG: Response code = 200
 LOG: header received:
 HTTP/1.0 200 OK
 Content-Length: 33
 Content-Type: application/json
 Date: Thu, 30 Jul 2020 12:33:34 GMT
 Server: nginx/1.10.3 (Ubuntu)
 X-Ms-Request-Id: b48dd8da-0b4e-44fd-a1e5-04043bfa77f1
 X-Ms-Run-Function-Failed: False

 
# "{\"result\": [\"yes\", \"no\"]}"
# LOG: Response code = 200
# LOG: header received:
# HTTP/1.0 200 OK
# Content-Length: 33
# Content-Type: application/json
# Date: Thu, 30 Jul 2020 12:33:34 GMT
# Server: nginx/1.10.3 (Ubuntu)
# X-Ms-Request-Id: b48dd8da-0b4e-44fd-a1e5-04043bfa77f1
# X-Ms-Run-Function-Failed: False
#
# "{\"result\": [\"yes\", \"no\"]}"
# LOG: Response code = 200
# ..done
#
#
# Server Software:        nginx/1.10.3
# Server Hostname:        8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io
# Server Port:            80
#
# Document Path:          /score
# Document Length:        33 bytes
#
# Concurrency Level:      1
# Time taken for tests:   1.599 seconds
# Complete requests:      10
# Failed requests:        0
# Total transferred:      2600 bytes
# Total body sent:        10560
# HTML transferred:       330 bytes
# Requests per second:    6.25 [#/sec] (mean)
# Time per request:       159.918 [ms] (mean)
# Time per request:       159.918 [ms] (mean, across all concurrent requests)
# Transfer rate:          1.59 [Kbytes/sec] received
#                         6.45 kb/s sent
#                         8.04 kb/s total
#
# Connection Times (ms)
#               min  mean[+/-sd] median   max
# Connect:       21   23   0.8     23      24
# Processing:    92  137  28.3    151     176
# Waiting:       92  137  28.3    151     176
# Total:        114  160  28.0    172     199#
```

## Summary: Create and publish a pipeline
You must make sure to update the notebook to have the same keys, URI, dataset, cluster, and model names already created.

In [None]:

# upload the Jupyter Notebook aml-pipelines-with-automated-machine-learning-step.ipynb to the Azure ML studio

# Update all the variables that are noted to match your environment

# Make sure a <code>config.json</code> has been downloaded and is available in the current working directory

# Run through the cells

# Verify the pipeline has been created and shows in Azure ML studio, in the <em>Pipelines</em> section

# Verify that the pipeline has been scheduled to run or is running


TODO: Please take the following screenshots to show your work:
- The pipeline section of Azure ML studio, showing that the pipeline has been created
- The pipelines section in Azure ML Studio, showing the Pipeline Endpoint
- The Bankmarketing dataset with the AutoML module
- The “Published Pipeline overview”, showing a REST endpoint and a status of ACTIVE
- In Jupyter Notebook, showing that the “Use RunDetails Widget” shows the step runs
- In ML studio showing the scheduled run


## Documentation

### Screencast
In this project, you need to record a screencast that shows the entire process of the working ML application. The screencast should meet the following criteria: 1-5 min lenght, clear and understandable audio, at least full hd 16:9, readable text.

In this project, you need to record a screencast that shows the entire process of the working ML application. The screencast should meet the following criteria:
- Working deployed ML model endpoint
- deployed pipeline
- available automl model
- Successful API requests to the endpoint with a JSON payload

In case you are unable to provide an audio file, you can include a written description of your script instead of audio, if you prefer. Please include it in your README file.


In [None]:
# insert link to youtube here

## Readme
An important part of your project submissions is a README file that describes the project and documents the main steps. Please use the README.md template provided to you as a start. The README should include the following areas:

- project overview
- architectural diagram
- short description how to improve project in the future
- all screenshots mentioned above with short descriptions
- link to the screencast video on youtube (or similar)

In [None]:
# insert link to readme here

## Cleanup
Not required but included because i think it's important

In [None]:
cpu_cluster.delete()