# ~ Capstone Project ~

# Azure Machine Learning Engineer 

## Completed by Audrey Tan
 ___
 
- [1. Project Overview](#intro)
- [2. Environment Setup](#env-setup)
    - [2.1 Import dependencies](#env1)
    - [2.2 Workspace and experiment setup](#env2)
    - [2.3 Create a compute cluster](#env3)
- [3. HyperDrive Experiment Submission](#hd-exp)
    - [3.1 Dataset validation](#hd-ds)
    - [3.2 Define a conda environment YAML file](#hd-env)
    - [3.3 Create a sklearn AML environment ](#hd-sklearn)
    - [3.4 HyperDrive config setup](#hd-setup)
    - [3.5 HyperDrive run](#hd-run)
    - [3.6 Monitor HyperDrive run](#hd-watch)
    - [3.7 Examine the best Hyper model details](#hd-model)
    - [3.8 Save and register the best HyperDrive model](#hd-reg)
- [4. Model Deployment](#deploy)
    - [4.1 Deployment setup](#dply1)
    - [4.2 Deploy the model as a web service](#dply2)
    - [4.3 Testing the web service](#dply3)
    - [4.4 Enable Application Insights](#dply4)
    - [4.5 Printing the logs of the web service](#dply5)
    - [4.6 Active web service endpoint demo](#dply6)
- [5. Cleanup](#clean)
- [6. Citations](#cita)
 ___

## Part II - Custom Model Training with HyperDrive
#### This notebook contains the HyperDrive setup, training and deployment steps using SDK. See  `automl` notebook for Part I - AutoML Model Training  
 ___

<a id='intro'></a>
## 1. Project Overview

> In this project, we will use a loan Application Prediction dataset from Kaggle to build a loan application prediction classifier. The classification goal is to predict if a loan application will be approved or denied given the applicant's credit history and other social economic demographic data.
>
> We will build two models of the classifier, one using AutoML and one custom model. AutoML is equipped to train and produce the best model on its own, the custom model will leverage HyperDrive to tune training hyperparameters to deliver the best model. Between the AutoML and Hyperdrive experiment runs, a best performing model is selected for deployment. Scoring requests can then be sent to the deployment endpoint to test the deployed model. The diagram below provides an overview of the workflow. 

![png](assets/MLworkflow.png)

<a id='env-setup'></a>

## 2. Environment Setup

This entails the follow tasks
> * Import all dependcies required to complete the AutoML project
>
> * Initialize workspace and create a new Experiment
>
> * Create a compute target for training 
>

<a id='env1'></a>
### 2.1 Import dependencies
#### import all the packages needed for the project

In [1]:
import logging
import os
import csv
import json
import requests

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core import ScriptRunConfig
from azureml.core.dataset import Dataset

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import LocalTarget, ComputeTargetException

from azureml.data.dataset_factory import TabularDatasetFactory
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from train import clean_data

from azureml.widgets import RunDetails
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice

from azureml.core import Model
from azureml.core import Webservice

from azureml.core import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.19.0


In [2]:
print(pd.__version__)

0.25.3


<a id='env2'></a>
### 2.2 Workspace and experiment setup
#### Display the workspace details and set up an experiment

## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure `config.json` is present as ./config.json

In [3]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-130725
aml-quickstarts-130725
southcentralus
c463503f-66c4-48b5-9bb5-b66fec87c814


In [4]:
exp = Experiment(workspace=ws, name="capstone-hpdr-exp")

<a id='env3'></a>
### 2.3 Create a compute cluster

#### look for an availble compute cluster in the workspace or create a new one to use

In [5]:
clist = ComputeTarget.list(workspace=ws)

In [6]:
clist

[{
   "id": "/subscriptions/c463503f-66c4-48b5-9bb5-b66fec87c814/resourceGroups/aml-quickstarts-130725/providers/Microsoft.MachineLearningServices/workspaces/quick-starts-ws-130725/computes/notebook130725",
   "name": "notebook130725",
   "location": "southcentralus",
   "tags": null,
   "properties": {
     "description": null,
     "computeType": "ComputeInstance",
     "computeLocation": "southcentralus",
     "resourceId": null,
     "provisioningErrors": null,
     "provisioningState": "Succeeded",
     "properties": {
       "vmSize": "STANDARD_DS3_V2",
       "applications": [
         {
           "displayName": "Jupyter",
           "endpointUri": "https://notebook130725.southcentralus.instances.azureml.ms"
         },
         {
           "displayName": "Jupyter Lab",
           "endpointUri": "https://notebook130725.southcentralus.instances.azureml.ms/lab"
         },
         {
           "displayName": "RStudio",
           "endpointUri": "https://notebook130725-8787.sout

In [7]:
clist[0]

Name,Workspace,State,Location,VmSize,Application URI,Docs
notebook130725,quick-starts-ws-130725,Running,southcentralus,STANDARD_DS3_V2,Jupyter JupyterLab RStudio,Doc


In [8]:
clist[1]

AmlCompute(workspace=Workspace.create(name='quick-starts-ws-130725', subscription_id='c463503f-66c4-48b5-9bb5-b66fec87c814', resource_group='aml-quickstarts-130725'), name=std-ds3-v2, id=/subscriptions/c463503f-66c4-48b5-9bb5-b66fec87c814/resourceGroups/aml-quickstarts-130725/providers/Microsoft.MachineLearningServices/workspaces/quick-starts-ws-130725/computes/std-ds3-v2, type=AmlCompute, provisioning_state=Succeeded, location=southcentralus, tags=None)

In [9]:
len(clist)

2

#### Create a  a new compute cluster for the HyperDrive Experiment run in the workspace.

In [10]:
cluster_name = 'hd-ds3-v2' 

In [11]:
# Create compute cluster
# Use the default vm_size = "Standard_DS3_V2".
# max_nodes should be no greater than 4.

# Test cluster exists

try:
    compute_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print(f'compute cluster {cluster_name} already exists')
except ComputeTargetException:
    print(f'creating a new compute cluster {cluster_name} ...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS3_V2',
                                                           max_nodes=4)
    compute_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

compute_cluster.wait_for_completion(show_output=True)

creating a new compute cluster hd-ds3-v2 ...
Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


<a id='hd-exp'></a>
## 3. HyperDrive Experiment Submission

This entails the follow tasks
> * Validate dataset is cleaned and split correctly for the HyperDrive experiment run    
>
> * Define a conda environment YAML file
>
> * Create a sklearn AML environment
> 
> * Setup the HyperDrive config
>
> * Submit the HyperDrive experiment 
>
> * Monitor the HyperDrive run
>
> * Save the best HyperDrive model
>

<a id='hd-ds'></a>
### 3.1 Dataset validation

#### Dataset overview
The **external** dataset is the `train_u6lujuX_CVtuZ9i.csv` of this [kaggle Loan Prediction Problem Dataset](https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset) which I downloaded and staged on this [Github Repo](https://raw.githubusercontent.com/atan4583/datasets/master/train.csv). 

The dataset has 613 records and 13 columns. The **classification goal is to predict if a loan will be approved**. The input variables are the columns carrying the credit history and other social economics demographics of the applicants. The output variable `Loan Status` column indicates if a loan application is approved or denied, i.e. a True(1) or False(0).


The block of code cells below performs these validation tasks:
> 1. checks if the dataset is in the worksplace, if not, download it from the [Github Repo](https://raw.githubusercontent.com/atan4583/datasets/master/train.csv) 
> 2. Loads it to a pandas dataframe to do a quick exploration of the data 
> 3. Runs the dataset through the cleaning function in `train.py` to generate the `x` and `y` dataframes
> 4. Checks all columns in `x` and `y` are of numeric type with no missing value
> 4. Calls sklearn `train_test_split` utility to split `x` and `y` into training and test sets
> 5. Creates a training and a validation dateframes, check all columns are of numeric type with no missing value 

During the HyperDrive experiment run, a python code file `train.py` which is placed in the same folder as this Notebook, will perform the same cleaning and spliting steps as listed in the tasks above, then call the SKLearn LogisticRegression algorithm to train the custom model, with the help of `Azure HyperDrive` (a hyperparameter tuning engine) to produce a HyperDrive classifier.   

#### checks if the dataset exists in the workspace. If not download it. Loads it into a dataframe and performs a quick data exploration

In [12]:
found = False
key = "loan prediction dataset"
description_text = "loan prediction dataset for MLEMAND Capstone Project"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://raw.githubusercontent.com/atan4583/datasets/master/train.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,612.0,592.0,600.0,564.0
mean,5403.459283,1624.906863,146.412162,342.0,0.842199
std,6109.041673,2930.199261,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1211.5,128.0,360.0,1.0
75%,5795.0,2303.0,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    612 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null bool
dtypes: bool(1), float64(4), int64(1), object(7)
memory usage: 58.3+ KB


#### perform quick data exploration steps to ensure all columns are numeric with no missing values. 

In [14]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,False,0,Graduate,False,5849,0.0,,360.0,1.0,Urban,True
1,LP001003,Male,True,1,Graduate,False,4583,1508.0,128.0,360.0,1.0,Rural,False
2,LP001005,Male,True,0,Graduate,True,3000,0.0,66.0,360.0,1.0,Urban,True
3,LP001006,Male,True,0,Not Graduate,False,2583,2358.0,120.0,360.0,1.0,Urban,True
4,LP001008,Male,False,0,Graduate,False,6000,0.0,141.0,360.0,1.0,Urban,True


In [15]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     2
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

#### validate dataset download and data cleaning in `train.py` is working as expected

In [16]:
wurl='https://raw.githubusercontent.com/atan4583/datasets/master/train.csv'
ds = TabularDatasetFactory.from_delimited_files(wurl)

In [17]:
# clean the dataset
x, y = clean_data(ds)

In [18]:
# check column data type is numeric
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 11 columns):
Gender               614 non-null float64
Married              614 non-null float64
Dependents           614 non-null float64
Education            614 non-null int64
Self_Employed        614 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null float64
Loan_Amount_Term     614 non-null float64
Credit_History       614 non-null float64
Property_Area        614 non-null int64
dtypes: float64(8), int64(3)
memory usage: 52.9 KB


In [19]:
#check column data type is numeric
y.to_frame().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 1 columns):
y    614 non-null int64
dtypes: int64(1)
memory usage: 4.9 KB


#### check no missing value in all columns

In [20]:
print(f'x null chk: \n{x.isnull().sum()}\n \ny null chk: \n{y.isnull().sum()}\n')

x null chk: 
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64
 
y null chk: 
0



#### split `x` and `y` into train and test sets

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, random_state=42)

#### validate no missing value in all columns 

In [22]:
print(f'x_train null chk: \n{x_train.isnull().sum()}\n \ny_train null chk: \n{y_train.isnull().sum()}\n')

x_train null chk: 
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64
 
y_train null chk: 
0



In [23]:
print(f'x_test null chk: \n{x_test.isnull().sum()}\n \ny_test null chk: \n{y_test.isnull().sum()}\n')

x_test null chk: 
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64
 
y_test null chk: 
0



In [24]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 460 entries, 1 to 27
Data columns (total 11 columns):
Gender               460 non-null float64
Married              460 non-null float64
Dependents           460 non-null float64
Education            460 non-null int64
Self_Employed        460 non-null float64
ApplicantIncome      460 non-null int64
CoapplicantIncome    460 non-null float64
LoanAmount           460 non-null float64
Loan_Amount_Term     460 non-null float64
Credit_History       460 non-null float64
Property_Area        460 non-null int64
dtypes: float64(8), int64(3)
memory usage: 43.1 KB


In [25]:
y_train.to_frame().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 460 entries, 1 to 27
Data columns (total 1 columns):
y    460 non-null int64
dtypes: int64(1)
memory usage: 7.2 KB


In [26]:
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 194 to 557
Data columns (total 11 columns):
Gender               154 non-null float64
Married              154 non-null float64
Dependents           154 non-null float64
Education            154 non-null int64
Self_Employed        154 non-null float64
ApplicantIncome      154 non-null int64
CoapplicantIncome    154 non-null float64
LoanAmount           154 non-null float64
Loan_Amount_Term     154 non-null float64
Credit_History       154 non-null float64
Property_Area        154 non-null int64
dtypes: float64(8), int64(3)
memory usage: 14.4 KB


In [27]:
y_test.to_frame().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 194 to 557
Data columns (total 1 columns):
y    154 non-null int64
dtypes: int64(1)
memory usage: 2.4 KB


#### combine `x_train` &  `y_train`, `x_test` & `y_test` into a training and validation dataframe respectively 

In [28]:
xt=pd.concat([x_train, y_train], axis=1)

In [29]:
xt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 460 entries, 1 to 27
Data columns (total 12 columns):
Gender               460 non-null float64
Married              460 non-null float64
Dependents           460 non-null float64
Education            460 non-null int64
Self_Employed        460 non-null float64
ApplicantIncome      460 non-null int64
CoapplicantIncome    460 non-null float64
LoanAmount           460 non-null float64
Loan_Amount_Term     460 non-null float64
Credit_History       460 non-null float64
Property_Area        460 non-null int64
y                    460 non-null int64
dtypes: float64(8), int64(4)
memory usage: 46.7 KB


In [30]:
xt.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,y
1,1.0,1.0,1.0,1,0.0,4583,1508.0,128.0,360.0,1.0,0,0
394,1.0,1.0,2.0,1,0.0,3100,1400.0,113.0,360.0,1.0,2,1
316,1.0,1.0,2.0,1,0.0,3717,0.0,120.0,360.0,1.0,1,1
62,1.0,1.0,0.0,0,1.0,2609,3449.0,165.0,180.0,0.0,0,0
158,1.0,0.0,0.0,1,0.0,2980,2083.0,120.0,360.0,1.0,0,1


In [31]:
xt.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
y                    0
dtype: int64

In [32]:
xv=pd.concat([x_test, y_test], axis=1)

In [33]:
xv.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 194 to 557
Data columns (total 12 columns):
Gender               154 non-null float64
Married              154 non-null float64
Dependents           154 non-null float64
Education            154 non-null int64
Self_Employed        154 non-null float64
ApplicantIncome      154 non-null int64
CoapplicantIncome    154 non-null float64
LoanAmount           154 non-null float64
Loan_Amount_Term     154 non-null float64
Credit_History       154 non-null float64
Property_Area        154 non-null int64
y                    154 non-null int64
dtypes: float64(8), int64(4)
memory usage: 15.6 KB


In [34]:
xv.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,y
194,1.0,0.0,0.0,1,0.0,4191,0.0,120.0,360.0,1.0,0,1
428,1.0,1.0,0.0,1,0.0,2920,0.0,87.0,360.0,1.0,0,1
444,1.0,1.0,0.0,1,0.0,7333,8333.0,175.0,300.0,0.0,0,1
34,1.0,0.0,3.0,1,0.0,12500,3000.0,320.0,360.0,1.0,0,0
164,1.0,1.0,0.0,1,0.0,9323,0.0,75.0,180.0,1.0,2,1


In [35]:
xv.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
y                    0
dtype: int64

#### save training and validation dataframes as csv files

In [36]:
os.makedirs('./data', exist_ok=True)

In [37]:
xt.to_csv('data/hd-trn.csv', index = False)

In [38]:
xv.to_csv('data/hd-val.csv', index = False)

<a id='hd-env'></a>
### 3.2 Define a conda environment YAML file
#### with the training script dependencies, it is used in setting up AML training environment and deployment environment

In [39]:
%%writefile conda_env.yml
dependencies:
- python=3.6.2
- pip:
  - azureml-train-automl-runtime==1.18.0
  - inference-schema
  - azureml-interpret==1.18.0
  - azureml-defaults==1.18.0
- numpy>=1.16.0,<1.19.0
- pandas==0.25.1
- scikit-learn==0.22.1
- py-xgboost<=0.90
- fbprophet==0.5
- holidays==0.9.11
- psutil>=5.2.2,<6.0.0
channels:
- anaconda
- conda-forge

Writing conda_env.yml


<a id='hd-sklearn'></a>
### 3.3 Create a sklearn AML environment
#### for training the custom model

In [40]:
sklearn_env = Environment.from_conda_specification(name = 'sklearn_env', file_path = './conda_env.yml')

<a id='hd-setup'></a>
### 3.4 HyperDrive config setup
#### specify early termination policy, parameter sampler, create a ScriptRunConfig and hyperdrive config

In [41]:
# This Early Termination automatically terminates poorly performing runs and improves computational efficiency
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

# This sampler supports discrete & continuous hyperparameters, and early termination of low-performance runs
ps = RandomParameterSampling({'--C': uniform(0.1, 1.0), '--max_iter': choice(50,100,200)})


if "training" not in os.listdir():
    os.mkdir("./training")


cluster = ws.compute_targets[cluster_name]
max_run = 30
max_thread = 4

# Create a ScriptRunConfig for use with train.py

src = ScriptRunConfig(source_directory='.',
                      compute_target=cluster,
                      script='train.py',
                      arguments=['--C', 1.0, '--max_iter', 100],
                      environment=sklearn_env)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(hyperparameter_sampling=ps,
                                     primary_metric_name='Accuracy',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                                     max_total_runs=max_run,
                                     max_concurrent_runs=max_thread,
                                     policy=policy,
                                     run_config=src)

<a id='hd-run'></a>
### 3.5 HyperDrive run
#### submit the experiment

In [42]:
hyperdrive_run = exp.submit(config=hyperdrive_config,show_output=True)

<a id='hd-watch'></a>
### 3.6 Monitor HyperDrive run
#### use  `RunDetails` widget to show the different experiments

In [43]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [44]:
hyperdrive_run

Experiment,Id,Type,Status,Details Page,Docs Page
capstone-hpdr-exp,HD_36adb28b-c77b-4b93-9b37-609567949b80,hyperdrive,Running,Link to Azure Machine Learning studio,Link to Documentation


#### Wait for hyperdrive run to complete

In [45]:
hyperdrive_run.wait_for_completion(show_output=True)

RunId: HD_36adb28b-c77b-4b93-9b37-609567949b80
Web View: https://ml.azure.com/experiments/capstone-hpdr-exp/runs/HD_36adb28b-c77b-4b93-9b37-609567949b80?wsid=/subscriptions/c463503f-66c4-48b5-9bb5-b66fec87c814/resourcegroups/aml-quickstarts-130725/workspaces/quick-starts-ws-130725

Streaming azureml-logs/hyperdrive.txt

"<START>[2020-12-16T15:45:48.128221][API][INFO]Experiment created<END>\n""<START>[2020-12-16T15:45:48.569991][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2020-12-16T15:45:48.739877][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2020-12-16T15:45:49.5038553Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_36adb28b-c77b-4b93-9b37-609567949b80
Web View: https://ml.azure.com/experiments/capstone-hpdr-exp/runs/HD_36adb28b-c77b-4b93-9b37-609567949b80?wsid=/subscriptions/c

{'runId': 'HD_36adb28b-c77b-4b93-9b37-609567949b80',
 'target': 'hd-ds3-v2',
 'status': 'Completed',
 'startTimeUtc': '2020-12-16T15:45:47.522053Z',
 'endTimeUtc': '2020-12-16T16:16:15.263935Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '30e2cd8c-3b3f-44f7-a143-5a05c67facb0',
  'score': '0.8051948051948052',
  'best_child_run_id': 'HD_36adb28b-c77b-4b93-9b37-609567949b80_1',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg130725.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_36adb28b-c77b-4b93-9b37-609567949b80/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=Q8eatg%2Fn9goqwIMV9MXE9O56LYUOeNcMFL3X0NXNZ18%3D&st=2020-12-16T16%3A06%3A23Z&se=2020-12-17T00%3A16%3A23Z&sp=r'}}

<a id='hd-model'></a>
### 3.7 Examine the best Hyper model details
#### retrieve the best HyperDrive model, print all the relevant properties and metrics 

In [46]:
best_hdrun = hyperdrive_run.get_best_run_by_primary_metric()

In [47]:
# print best model metrics
best_hdrun.get_metrics()

{'Regularization Strength:': 0.9434119869123043,
 'Max iterations:': 200,
 'Accuracy': 0.8051948051948052}

In [48]:
best_hdrun

Experiment,Id,Type,Status,Details Page,Docs Page
capstone-hpdr-exp,HD_36adb28b-c77b-4b93-9b37-609567949b80_1,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [49]:
best_hdrun.get_tags()

{'_aml_system_ComputeTargetStatus': '{"AllocationState":"steady","PreparingNodeCount":0,"RunningNodeCount":0,"CurrentNodeCount":0}'}

In [50]:
# print best model run details
best_hdrun.get_details()

{'runId': 'HD_36adb28b-c77b-4b93-9b37-609567949b80_1',
 'target': 'hd-ds3-v2',
 'status': 'Completed',
 'startTimeUtc': '2020-12-16T16:01:12.686404Z',
 'endTimeUtc': '2020-12-16T16:03:38.084911Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '30e2cd8c-3b3f-44f7-a143-5a05c67facb0',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--C',
   '1',
   '--max_iter',
   '100',
   '--C',
   '0.9434119869123043',
   '--max_iter',
   '200'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'hd-ds3-v2',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'priority': None,
  'credentialPassthrough': False,
  'environment':

In [51]:
# print best model run metrics and parameters
print(f'best hyperdrive run metrics: \n{best_hdrun.get_metrics()}\n')
print(f'best hyperdrive run parameters: \n{best_hdrun.get_details()["runDefinition"]["arguments"]}\n')

best hyperdrive run metrics: 
{'Regularization Strength:': 0.9434119869123043, 'Max iterations:': 200, 'Accuracy': 0.8051948051948052}

best hyperdrive run parameters: 
['--C', '1', '--max_iter', '100', '--C', '0.9434119869123043', '--max_iter', '200']



In [52]:
best_runid = best_hdrun.id
best_acc = best_hdrun.get_metrics()['Accuracy']
best_param_c = best_hdrun.get_details()["runDefinition"]["arguments"][-1]
best_param_mxitr = best_hdrun.get_details()["runDefinition"]["arguments"][-3]

In [53]:
# print best model run id, accuracy and the two hyperparameters used in training
print(f'best hyperdrive run job id: {best_runid}\n')
print(f'best hyperdrive run Accuracy: {best_acc}\n')
print(f'best hyperdrive run parameter C: {best_param_c}\n')
print(f'best hyperdrive run parameter max_iter: {best_param_mxitr}\n')

best hyperdrive run job id: HD_36adb28b-c77b-4b93-9b37-609567949b80_1

best hyperdrive run Accuracy: 0.8051948051948052

best hyperdrive run parameter C: 200

best hyperdrive run parameter max_iter: 0.9434119869123043



In [54]:
# print the best experiment run file paths and names
best_hdrun.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_f59dbe5f8e8468bf75b4f7537a1b0739c16687e8099c7558b3bb08adeacc616d_d.txt',
 'azureml-logs/65_job_prep-tvmps_f59dbe5f8e8468bf75b4f7537a1b0739c16687e8099c7558b3bb08adeacc616d_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_f59dbe5f8e8468bf75b4f7537a1b0739c16687e8099c7558b3bb08adeacc616d_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'logs/azureml/102_azureml.log',
 'logs/azureml/dataprep/backgroundProcess.log',
 'logs/azureml/dataprep/backgroundProcess_Telemetry.log',
 'logs/azureml/dataprep/engine_spans_aa7ffc3e-d4d5-4e2f-9bb6-b1db8b6e2533.jsonl',
 'logs/azureml/dataprep/python_span_3b439807-c824-4c8d-8a30-ad181cde0c2e.jsonl',
 'logs/azureml/dataprep/python_span_aa7ffc3e-d4d5-4e2f-9bb6-b1db8b6e2533.jsonl',
 'logs/azureml/job_prep_azureml.log',
 'logs/azureml/job_release_azureml.log',
 'outputs/model.pkl']

<a id='hd-reg'></a>
### 3.8 Save and register the best HyperDrive model
#### Save the best model

In [55]:
os.makedirs('./hdmodel', exist_ok=True)
best_hdrun.download_file('/outputs/model.pkl',os.path.join('./hdmodel','hd_best_model.pkl'))

In [56]:
for f in best_hdrun.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./hdmodel', f.split('/')[-1])
        print(f'Downloading from {f} to {output_file_path} ...')
        best_hdrun.download_file(name=f, output_file_path=output_file_path)

Downloading from outputs/model.pkl to ./hdmodel/model.pkl ...


#### Register the best model

In [57]:
# look at best_hdrun.get_file_names() to see where the mode file location and extension (.pkl or .joblib).
model=best_hdrun.register_model(
    model_name = 'hd_bestmodel', 
    model_path = './outputs/model.pkl',
    model_framework=Model.Framework.SCIKITLEARN,
    tags={'accuracy': best_acc},
    description='Loan Application Prediction'
)
model

Model(workspace=Workspace.create(name='quick-starts-ws-130725', subscription_id='c463503f-66c4-48b5-9bb5-b66fec87c814', resource_group='aml-quickstarts-130725'), name=hd_bestmodel, id=hd_bestmodel:1, version=1, tags={'accuracy': '0.8051948051948052'}, properties={})

<a id='deploy'></a>
## 4. Model Deployment

This entails the follow tasks
> * Deployment setup  
>
> * Deploy the model as a web service
>
> * Testing the web service 
>
> * Enable Application Insights
>
> * Printing the logs of the web service
>

### Note: The tasks were skipped as the best performing model was produced by the AutoML run. See `automl` notebook for model deployment tasks performed.

<a id='dply1'></a>
### 4.1 Deployment setup
#### use conda environment yml to create a deployment environment, scoring file to set up the inference config and set aci config

In [None]:
myenv = Environment.from_conda_specification(name = 'myenv',
                                             file_path = 'conda_env.yml')
myenv

In [None]:
# set inference config
inference_config = InferenceConfig(entry_script= 'score.py',
                                   environment=myenv)

In [None]:
# set Aci Webservice config
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, auth_enabled=True)

<a id='dply2'></a>
### 4.2 Deploy the model as a web service
#### start model deployment and wait for the deployment to finish

In [None]:
service = Model.deploy(workspace=ws, 
                       name='best-hd-model', 
                       models=[model], 
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True)

In [None]:
service

In [None]:
# wait for deployment to finish and print the scoring uri and swagger uri
service.wait_for_deployment(show_output=True)
print(f'\nservice state: {service.state}\n')

print(f'scoring URI: \n{service.scoring_uri}\n')
print(f'swagger URI: \n{service.swagger_uri}\n')

In [None]:
# print the primary authentication key for the deployed webservice
pkey, skey = service.get_keys()
print(f'primary key: {pkey}')

<a id='dply3'></a>
### 4.3 Testing the web service
#### randomly select 2 samples from the validation dataframe and send a request to the web service endpoint

In [None]:
# select 2 random samples from validation dataframe xv
scoring_sample = xv.sample(2)
y_label = scoring_sample.pop('y')

In [None]:
# convert the sample records to a json data file
scoring_json = json.dumps({'data': scoring_sample.to_dict(orient='records')})
print(f'{scoring_json}')

In [None]:
# Set the content type
headers = {"Content-Type": "application/json"}

In [None]:
# set the authorization header
headers["Authorization"] = f"Bearer {pkey}"

In [None]:
# post a request to the scoring uri
resp = requests.post(service.scoring_uri, scoring_json, headers=headers)

In [None]:
# print the scoring results
print(resp.json())

In [None]:
# compare the scoring results with the corresponding y label values
print(f'True Values: {y_label.values}')

#### another way to test the scoring uri without sending a request with a key

In [None]:
print(f'Prediction: {service.run(scoring_json)}')

<a id='dply4'></a>
### 4.4 Enable Application Insights
#### update web service to enable Application Insights and wait for the deployment to finish

In [None]:
service.update(enable_app_insights=True)

In [None]:
service.wait_for_deployment(show_output=True)
print(f'\nservice state: {service.state}\n')

<a id='dply5'></a>
### 4.5 Printing the logs of the web service
#### print the logs by calling the get_logs() function of the web service 

In [None]:
print(f'webservice logs: \n{service.get_logs()}\n')

<a id='dply6'></a>
### 4.6 Active web service endpoint Demo
#### randomly select 3 samples from the validation dataframe, send a request to the web service endpoint

In [None]:
# select 3 random samples from the xv dataframe
scoring_sample = xv.sample(3)
y_label = scoring_sample.pop('y')

In [None]:
# convert the sample records to a json data file
scoring_json = json.dumps({'data': scoring_sample.to_dict(orient='records')})
print(f'{scoring_json}')

In [None]:
# send a request to the scoring uri
resp = requests.post(service.scoring_uri, scoring_json, headers=headers)

In [None]:
# print the scoring results
print(f'Prediction: {resp.json()}')

In [None]:
# compare the scoring results with the corresponding y label values
print(f'True Values: {y_label.values}')

In [None]:
# another way to test the scoring uri without sending a request with a key
print(f'Prediction: {service.run(scoring_json)}')

<a id='clean'></a>
## 5. Clean Up

### delete the web service

In [None]:
# clean up
print(f'delete service ... {service.delete()}')

In [None]:
try:
    print(f'service state: {service.state}')
except:
    print(f'service not found')    

<a id='cita'></a>
## 6. Citations

### Project Starter Code
[Udacity Github Repo](https://github.com/udacity/nd00333-capstone/tree/master/starter_file)

### MLEMAND ND Using Azure Machine Learning 
[Lesson 6.3 - Exercise: Hyperparameter Tuning with HyperDrive](https://youtu.be/SfFqgN1oebM)

### MLEMAND ND Machine Learning Operations 
[Lesson 2.5 - Exercise: Enable Security and Authentication](https://youtu.be/rsECJolX2Ns)

[Lesson 2.10 - Exercise: Deploy an Azure Machine learning Model](https://youtu.be/_RKfF1D6W24)

[Lesson 2.15 - Exercise: Enable Application Insights](https://youtu.be/EXGfNMMTuMY)

### Azure Machine Learning Documentation and Example Code Snippets
[List all ComputeTarget objects within the workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget?view=azure-ml-py#list-workspace-)

[Model Registration and Deployment](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/deploy-to-cloud/model-register-and-deploy.ipynb)

[Using environments](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training/using-environments/using-environments.ipynb)

[AciWebservice Class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py#deploy-configuration-cpu-cores-none--memory-gb-none--tags-none--properties-none--description-none--location-none--auth-enabled-none--ssl-enabled-none--enable-app-insights-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--ssl-cname-none--dns-name-label-none--primary-key-none--secondary-key-none--collect-model-data-none--cmk-vault-base-url-none--cmk-key-name-none--cmk-key-version-none--vnet-name-none--subnet-name-none-)

[What is Application Insights?](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview)

### External Dataset
[Kaggle Loan Prediction Dataset](https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset?select=train_u6lujuX_CVtuZ9i.csv)
