# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import logging
import os
import csv

from matplotlib import pyplot as plt

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.core import Environment, ScriptRunConfig
from azureml.core.dataset import Dataset
from azureml.core.model import Model
from azureml.train.hyperdrive.parameter_expressions import choice, uniform
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.sklearn import SKLearn
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [None]:
#Data which I chose for this project is the one from Zindi Africa hackathon
#organized by Data science Nigeria on predicting Insurance claim based on data from select buildings
#We have 7,161 observations 
# link https://zindi.africa/competitions/data-science-nigeria-2019-challenge-1-insurance-prediction

In [None]:
ws = Workspace.from_config()
experiment_name = 'hyperdrive insurance'

experiment=Experiment(ws, experiment_name)


# Get the data of Kaggle Titanic Dataset
key = "insurance data"
description_text = "Data from Zindi Africa Data hackathon"
found = False

if key in ws.datasets.keys(): 
    found = True
    dataset = ws.datasets[key] 

if not found:
    # Create AML Dataset and register it into Workspace
    example_data = 'https://raw.githubusercontent.com/Ogbuchi-Ikechukwu/Azure-ML-Nanodegree-Capstone/master/starter_file/train_data_cleaned.csv'
    dataset = Dataset.Tabular.from_delimited_files(example_data)
    #Register Dataset in Workspace
    dataset = dataset.register(workspace=ws,
                               name=key,
                               description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

In [None]:
#Creating compute for hyper drive

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# TODO: Create compute cluster
# max_nodes should be no greater than 4.

# choose a name for your cluster
cluster_name = "drive-compute"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS3_V2', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

# can poll for a minimum number of nodes and for a specific timeout. 
# if no min node count is provided it uses the scale settings for the cluster
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=30)
    
 # use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())   

In [None]:
# Create dependencies file for the train script
import os
import shutil

project_folder = './'
os.makedirs(project_folder, exist_ok=True)
shutil.copy('config.json', project_folder)
            
sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = './conda_dependencies.yml')

## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

Here I am using a Logistic Regression model imported from  SKLearn framework to predict whether a given building based on its features will make an insurance claim over a given period of time. Hyperdrive is used to sample different values for two chosen algorithm hyperparameters:

"C": Inverse of regularization strength and "max_iter": representing the Maximum number of iterations taken for the solvers to converge. Different combinations of these hyperparameters will lead to different results and the goal is to find the best values.

I have chosen the Random Sampler because it enables me randomly select parameters from a given range. My choice for C is between 0.002 and 1.00. For max_iter a made a choice of 50, 150, 1000

My choice here was to sample the values using Random Sampling, in which hyperparameter values are randomly selected from the defined search space. "C" is chosen randomly in uniformly distributed between 0.001 and 1.0, while "max_iter" is sampled from one of the three values: 1000, 10000 and 100000.

Bandit Policy was chosen and set at 0.1, which ensures that any run not within the slack factor of the evaluation metric (in our case, "Accuracy") with respect to the best performing run is terminated. 

A training script train.py has been added

In [None]:
# Create an early termination policy. 
early_termination_policy = BanditPolicy(slack_factor = 0.1)

# Create the different params that you will be using during training
param_sampling = RandomParameterSampling( {
    '--C': uniform(0.002, 1.0),
    '--max_iter': choice(50, 150, 1000)
} )

# Create your estimator and hyperdrive config
src = ScriptRunConfig(source_directory=project_folder,
                      script='train.py',
                      compute_target=compute_target,
                      environment=sklearn_env)

hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=param_sampling,
                                     policy=early_termination_policy,
                                     primary_metric_name="Accuracy",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=100)

In [None]:

#TODO: Submit your experiment
hyperdrive_run = experiment.submit(hyperdrive_run_config)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [None]:
RunDetails(hyperdrive_run).show()

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()

In [None]:
print(best_run.get_details()['runId'])

In [None]:
print(best_run.get_details()['runDefinition']['arguments'])

In [None]:
# Model Metrics
print(best_run.get_metrics())

In [None]:
best_run.get_details()

In [None]:
print(best_run.get_file_names())

In [None]:
# Save the best model 
best_run.download_file('outputs/model.joblib', '/model.joblib')


In [None]:
# Register the best model
model = best_run.register_model(model_name='sklearn-titanic',
                                model_path='outputs/model.joblib',
                                model_framework=Model.Framework.SCIKITLEARN)

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [None]:
service_name = 'ikechukwu-service'
service = Model.deploy(ws, service_name, [model])
service.wait_for_deployment(show_output=True)

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
print(service.state)

TODO: In the cell below, print the logs of the web service and delete the service

In [None]:
import json

data = {"data":
        [
          {
            "PassengerId": 812,
            "Pclass": 2,
            "Age": 23.0,
            "SibSp": 0,
            "Parch": 0, 
            "Fare": 13.0,
            "Q": 0,
            "S": 1,
            "male": 1
          },
          {
            "PassengerId": 813,
            "Pclass": 1,
            "Age": 35.0,
            "SibSp": 0,
            "Parch": 0, 
            "Fare": 512.3292,
            "Q": 0,
            "S": 0,
            "male": 1
          }
      ]
    }

# Convert to JSON string
input_data = json.dumps(data)

In [None]:
output = service.run(input_data)

In [None]:
print(output)