# Data Science with Azure ML Services

## logging into ML service resource and creating a Workspace

In [None]:
from azureml.core import Workspace, Experiment, Run

ws = Workspace.get(name='junior_ds',
                      subscription_id='idontthinkso', 
                      resource_group='Jason' 
                     )

Prompt will appear - To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FXKD2FUZQ to authenticate. - copy the code, go to link and paste code where prompted. When connected - will prompt - Interactive authentication successfully completed.

to create a workspace from scratch
**only run if workspace not created**

In [None]:
import azureml.core
print(azureml.core.VERSION)

from azureml.core import Workspace

ws = Workspace.create(name='myworkspace',
                      subscription_id='<azure-subscription-id>', 
                      resource_group='myresourcegroup',
                      create_resource_group=True,
                      location='eastus2' # Or other supported Azure region   
                     )

In [None]:
# get workspace details
ws.get_details()

In [None]:
# Details of your workspace need to be saved to a configuration Json file to the current directory.
# Create the configuration file.
ws.write_config()

In [None]:
# logging information to your workspace
from azureml.core import Experiment

# create an experiment
exp = Experiment(workspace=ws, name='trial_exp')

# start a run
run = exp.start_logging()

# log a number
run.log('trial', 30)

# log a list (Fibonacci numbers)
run.log_list('my list', [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]) 

# finish the run
run.complete()

print('Execution complete')

#print results
print(run.get_portal_url())

In [None]:
# delete resources
ws.delete(delete_dependent_resources=True)

## Regression Model Example

In [None]:
# import packages
import os
import urllib.request

#create a fold for the dataset
os.makedirs('./data', exist_ok = True)
# load dataset to the directory, as you can see, you need to load train sets and test sets seperately 
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename='./data/train-images.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename='./data/train-labels.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename='./data/test-images.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename='./data/test-labels.gz')

print('Code executed')

In [None]:
# functions needed by downstream code...

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import gzip
import numpy as np
import struct

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        struct.unpack('I', gz.read(4))
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return res


# one-hot encode a 1-D array
def one_hot_encode(array, num_of_classes):
    return np.eye(num_of_classes)[array.reshape(-1)]

# load the data...

# To help the model to converge faster , you shrink the intensity values (X) from 0-255 to 0-1
X_train = load_data('./data/train-images.gz', False) / 255.0
y_train = load_data('./data/train-labels.gz', True).reshape(-1)

X_test = load_data('./data/test-images.gz', False) / 255.0
y_test = load_data('./data/test-labels.gz', True).reshape(-1)

print('Code executed')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# now let's show some randomly chosen images from the training set.
count = 0
sample_size = 30
plt.figure(figsize = (16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

In [None]:
# running model from local machine
from sklearn.linear_model import LogisticRegression

#load model
clf = LogisticRegression()
#fit model
clf.fit(X_train, y_train)

#evaluate model using test set
y_hat = clf.predict(X_test)
#print the accuracy
print(np.average(y_hat == y_test))

print('Code executed')

You may notice that, depending on your local machine capacity, it takes a long time to fit models with parameters of different values, or to find k that returns best results. In this case, you should try to run the model on a remote cluster.

RUN AND EXPERIMENT OVERVIEW

Before you begin, it is imperative to familiarize ourselves with two concepts: run and experiment.

Run, within the context of the Azure Machine Learning service, refers to Python code for a specific task, for example, training a model or tuning hyperparameters (we'll define this in a moment). Run does the job of logging metrics and upload result to Azure platform, it's a more natural way to keep track of jobs in your Workspace.

Experiment is a term referring to a composition of a series of runs. In the example, you have one run for the logistic regression model and another for the KNN model, and together they make up an experiment for you to compare results.

In [None]:
# create an experiment
# ensure that first you are connected to a workspace 
# Add this to top of code...
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import azureml
from azureml.core import Workspace, Run

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

In [None]:
#Create an experiment
experiment = Experiment(workspace = ws, name = "my-first-experiment")
#Create a run
run = experiment.start_logging()
run.log("trial",1)
run.complete()
print('Code executed')

In [None]:
#view logged results - go to link provided
print(run.get_portal_url())

## AML Compute as our compute resource

In [None]:
# import packages
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# Step 1: name the cluster, set minimal and maximal number of nodes 
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpucluster")
min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 3)

# Step 2: choose environment variables 
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = min_nodes, 
                                                                max_nodes = max_nodes)

# create the cluster
compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
print('Code executed')

In [None]:
#upload data using get_default_datastore()
ds = ws.get_default_datastore()

ds.upload(src_dir='./data', target_path='mnist', overwrite=True, show_progress=True)
print('Code executed')

In [None]:
#create folder
folder_training_script= './trial_model_mnist'
os.makedirs(folder_training_script, exist_ok=True)

In [None]:
%%writefile $folder_training_script/train.py

# functions needed by downstream code...

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import gzip
import numpy as np
import struct

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        struct.unpack('I', gz.read(4))
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return res


# one-hot encode a 1-D array
def one_hot_encode(array, num_of_classes):
    return np.eye(num_of_classes)[array.reshape(-1)]
#

# end of functions needed by downstream code...

import argparse
import os
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

from azureml.core import Run
# from utils import load_data

# let user feed in 2 parameters, the location of the data files (from datastore), and the regularization rate of the logistic regression model
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')
args = parser.parse_args()

data_folder = os.path.join(args.data_folder, 'mnist')
print('Data folder:', data_folder)

# load train and test set into numpy arrays
# note we scale the pixel intensity values to 0-1 (by dividing it with 255.0) so the model can converge faster.
X_train = load_data(os.path.join(data_folder, 'train-images.gz'), False) / 255.0
X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0
y_train = load_data(os.path.join(data_folder, 'train-labels.gz'), True).reshape(-1)
y_test = load_data(os.path.join(data_folder, 'test-labels.gz'), True).reshape(-1)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '\n')

# get hold of the current run
run = Run.get_context()

print('Train a logistic regression model with regularizaion rate of', args.reg)
clf = LogisticRegression(C=1.0/args.reg, random_state=42)
clf.fit(X_train, y_train)

print('Predict the test set')
y_hat = clf.predict(X_test)

# calculate accuracy on the prediction
acc = np.average(y_hat == y_test)
print('Accuracy is', acc)

run.log('regularization rate', np.float(args.reg))
run.log('accuracy', np.float(acc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')

print('Code executed')

Now, you must add a utils script as shown below for loading data and to create an estimator so that it's easier to scale our work in the future. An estimator object is used to submit the run. Create your estimator by running the following code to define these items:

- The name of the estimator object, est.
- The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
- The compute target. In this case, you use the Azure Machine Learning compute cluster you created.
- The training script name, train.py.
- Parameters required from the training script.
- Python packages needed for training.

In [None]:
from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': ds.as_mount(),
    '--regularization': 0.8
}

#import scikit-learn package 
est = Estimator(source_directory=folder_training_script,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                conda_packages=['scikit-learn'])

In [None]:
#run
run = experiment.submit(config=est)
run

print('Code executed')

In [None]:
# monitor the run
from azureml.widgets import RunDetails
RunDetails(run).show()

In [None]:
#get result
print(run.get_metrics())

## Model Deployment

#link for Azure Container Instances and Kubernetes
https://docs.microsoft.com/azure/machine-learning/service/how-to-deploy-and-where#aci

#link for Azure IoT Edge
https://docs.microsoft.com/azure/machine-learning/service/how-to-deploy-and-where#iotedge

#link for Field Programmable Gate Array
https://docs.microsoft.com/azure/machine-learning/service/how-to-deploy-and-where#fpga

In [None]:
#create a model
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

X, y = load_diabetes(return_X_y = True)
columns = ['age', 'gender', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
data = {
    "train":{"X": X_train, "y": y_train},        
    "test":{"X": X_test, "y": y_test}
}

print ("Data contains", len(data['train']['X']), "training samples and",len(data['test']['X']), "test samples")

# Create, fit, and test the scikit-learn Ridge regression model
regression_model = Ridge(alpha=0.03)
regression_model.fit(data['train']['X'], data['train']['y'])
preds = regression_model.predict(data['test']['X'])

# Output the Mean Squared Error to the notebook and to the run
print('Mean Squared Error is', mean_squared_error(data['test']['y'], preds))

In [None]:
# save work as a pickle file
from sklearn.externals import joblib

joblib.dump(value=regression_model, filename='sklearn_regression_model.pkl')

In [None]:
#register the model
from azureml.core.model import Model
model = Model.register(model_path = "sklearn_regression_model.pkl",
                       model_name = "sklearn_regression_model.pkl",
                       tags = {'area': "diabetes", 'type': "regression"},
                       description = "Ridge regression model to predict diabetes",
                       workspace = ws)

After registering the models, you can reference the Azure portal so that you can see your registered models under Model tab

#### Create a Scoring Script

Container images allow users to deploy models reliably since the machine learning model always depends on some other dependencies such as PyTorch. Using containers to deploy machine learning models can avoid dependency issues.

A container image has the following items packaged, which you need to prepare:

- The model itself
- The inference engine, such as PyTorch
- The scoring file (score.py) or other application consuming the model
- Any dependencies needed

The first step is to create the score.py file that consumes the model, like below. You only need to define two functions: init, which loads the model and run, which does the inference:

In [None]:
%%writefile score.py
import pickle
import json
import numpy
from sklearn.externals import joblib
from sklearn.linear_model import Ridge
from azureml.core.model import Model

def init():
    global model
    model_path = Model.get_model_path('sklearn_regression_model.pkl')
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)
    
# note you can pass in multiple rows for scoring
def run(raw_data):
    try:
        data = json.loads(raw_data)['data']
        data = numpy.array(data)
        result = model.predict(data)
        # you can return any datatype if it is JSON-serializable
        return result.tolist()
        return 0
    except Exception as e:
        error = str(e)
        return error

The second step is to make sure the dependencies are included in the image. Azure Machine Learning does that by creating a conda dependency file:

In [None]:
from azureml.core.conda_dependencies import CondaDependencies 
myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn'])
with open("myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())

The third step is preparing the container image by running the code below. This may take a few minutes. You should see several output messages display as it creates the image.

In [None]:
from azureml.core.image import Image, ContainerImage

image_config = ContainerImage.image_configuration(runtime= "python",
                                 execution_script="score.py",
                                 conda_file="myenv.yml",
                                 tags = {'area': "diabetes", 'type': "regression"},
                                 description = "Image with ridge regression model")

image = Image.create(name = "myimage1",
                     # this is the model object 
                     models = [model],
                     image_config = image_config, 
                     workspace = ws)

image.wait_for_creation(show_output = True)

print('Execution complete')

After the code is executed, you can view the images in the Azure Machine Learning Service portal

#### Deploy Model

Define the deployment configuration. For example, the following code defines a container that uses 1 CPU and 1 GB of memory:

In [None]:
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               tags={'sample name': 'AML 101'}, 
                                               description='This is a great example.')

To deploy the image created in the previous unit, you can use code similar to the code below. This may take a few minutes to finish.

In [None]:
%%time
from azureml.core.webservice import Webservice

# Create the webservice 
service = Webservice.deploy_from_model(name='my-aci-svc3',
                                       deployment_config=aciconfig,
                                       models=[model],
                                       image_config=image_config,
                                       workspace=ws)

# Wait for the service deployment to complete while displaying log output.  This can take several minutes.
service.wait_for_deployment(show_output=True)

print('Execution complete')

#### Scoring Data w/ Deployed Model

Since the model is deployed as a web service that exposes a REST API, it can be tested with many tools. Azure Machine Learning SDK has a built-in testing tool that can work with the deployed web service as shown below

In [None]:
import json

# scrape the first row from the test set.
test_samples = json.dumps({"data": X_test[0:1, :].tolist()})

print(test_samples) # here is what we are sending to the service.

#score on our service
service.run(input_data = test_samples)

## Pipelines

You can use the Azure Machine Learning SDK for Python to create ML pipelines, as well as to submit and track individual pipeline runs.

## Auto ML

AutoML provides an automated solution, let Azure Machine Learning Services run all the models concurrently, compare the results, and recommend the best model for the job

In [None]:
#Create an experiment
experiment = Experiment(workspace = ws, name = "my-third-experiment")

In [None]:
# insert automl parameters
from azureml.train.automl import AutoMLConfig
import logging

automl_config = AutoMLConfig(task = 'regression',
                             iteration_timeout_minutes = 6,
                             iterations = 3,
                             primary_metric = 'spearman_correlation',
                             n_cross_validations = 5,
                             debug_log = 'automl.log',
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train)

print('Execution Complete')

In [None]:
# run automl
local_run = experiment.submit(automl_config, show_output = True)

print('Execution Complete')

In [None]:
#retrieve best model
from azureml.widgets import RunDetails
RunDetails(local_run).show()

In [None]:
#another way to retrieve best model w/o widget
best_run, fitted_model = local_run.get_output()

print(best_run)
print(fitted_model)

## Using HyperDrive to auto-tune parameters

HyperDrive is a built-in service that automatically launches multiple experiments in parallel each with different parameter configurations.

## Manage and Monitor Models

In the Azure Machine Learning service, you can track experiments and monitor metrics with a few lines of code.

- how to add logs into your scripts so Azure Machine Learning can track the key metrics
- how to submit the experiments with model monitoring enabled
- how to monitor the progress of running jobs
- how to view the model's results

To explore the Azure Machine Learning service further with in-depth code samples, follow the link: https://github.com/Azure/MachineLearningNotebooks

For full details and documentation follow the link: https://docs.microsoft.com/azure/machine-learning/service/how-to-enable-data-collection

In [None]:
#query models
regression_models = Model.list(workspace=ws, tags=['area'])
for m in regression_models:
    print("Name:", m.name,"\tVersion:", m.version, "\tDescription:", m.description, m.tags)

The SDK allows data scientists to code within any Python environment including Jupyter Notebooks, PyCharm, and Visual Studio Code.

In [None]:
!pip install azureml-monitoring

In [None]:
#enable data collection
#The first step is enabling data collection. This can either be done using the SDK or the portal. In the SDK, 
#add the line below to your score.py script. The code also enables application insights.

aks_config = AciWebservice.deploy_configuration(collect_model_data=True, enable_app_insights=True)


#sample of where to insert
'''from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               tags={'sample name': 'AML 101'}, 
                                               description='This is a great example.')'''

##### collecting model data

There are four steps to use the SDK to collect model data.

In [None]:
#step 1: import package
from azureml.monitoring import ModelDataCollector

In [None]:
#step 2: add the following code to the init() function
global inputs_dc, prediction_d
inputs_dc = ModelDataCollector("best_model", identifier="inputs", feature_names=["feat1", "feat2", "feat3", "feat4", "feat5", "Feat6"])
prediction_dc = ModelDataCollector("best_model", identifier="predictions", feature_names=["prediction1", "prediction2"])

#sample of code to input
'''%%writefile score.py
import pickle
import json
import numpy
from sklearn.externals import joblib
from sklearn.linear_model import Ridge
from azureml.core.model import Model

def init():
    global model
    model_path = Model.get_model_path('sklearn_regression_model.pkl')
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)
    #inserted code
    global inputs_dc, prediction_d
inputs_dc = ModelDataCollector("best_model", identifier="inputs", feature_names=["feat1", "feat2", "feat3", "feat4", "feat5", "Feat6"])
prediction_dc = ModelDataCollector("best_model", identifier="predictions", feature_names=["prediction1", "prediction2"])
    
# note you can pass in multiple rows for scoring
def run(raw_data):
    try:
        data = json.loads(raw_data)['data']
        data = numpy.array(data)
        result = model.predict(data)
        # you can return any datatype if it is JSON-serializable
        return result.tolist()
        return 0
    except Exception as e:
        error = str(e)
        return error'''

- Identifier is later used for building the folder structure in your Blob, it can be used to divide “raw” versus "processed" data.
- CorrelationId is an optional parameter. You do not need to set it up if your model doesn't require it. Having a correlationId in place helps you simplify mapping of other data. (Examples include: LoanNumber, CustomerId, etc.)
- Feature Names need to be identified in the order of your features for them to have column names when the .csv is created.

In [None]:
#step 3: add the collection function calls, lines below, to your run function.
inputs_dc.collect(data)
prediction_dc.collect(result)

In [None]:
#step 4: update the myenv.yml file with the required modules needed to train the model.
from azureml.core.conda_dependencies import CondaDependencies 
myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn'])
myenv.add_pip_package("azureml-monitoring")
with open("myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())

##### WHERE CAN I VIEW THE COLLECTED DATA?

The output of the monitoring service gets saved in the Azure BLOB storage account associated with your Azure Machine Learning service workspace. The path of the file follows the following pattern.

/modeldata/<subscriptionid>/<resourcegroup>/<workspace>/<webservice>/<model>/<version>/<identifier>/<year>/<month>/<day>/data.csv

An example would be:

/modeldata/1a2b3c4d-5e6f-7g8h-9i10-j11k12l13m14/myresourcegrp/myWorkspace/aks-w-collv9/best_model/10/inputs/2018/12/31/data.csv