# Enabling Data Collection for Models in Production
With this notebook, you can learn how to collect input model data from your Azure Machine Learning service in an Azure Blob storage. Once enabled, this data collected gives you the opportunity:

* Monitor data drifts as production data enters your model
* Make better decisions on when to retrain or optimize your model
* Retrain your model with the data collected

## What data is collected?
* Model input data (voice, images, and video are not supported) from services deployed in Azure Kubernetes Cluster (AKS)
* Model predictions using production input data.

**Note:** pre-aggregation or pre-calculations on this data are done by user.

## Import your dependencies

In [None]:
from azureml.core import Workspace, Run
from azureml.core.compute import AksCompute, ComputeTarget
from azureml.core.webservice import Webservice, AksWebservice
from azureml.core.image import Image
from azureml.core.model import Model

import azureml.core
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Initialize Workspace
Initialize a workspace object from persisted configuration.

In [None]:
ws = Workspace.from_config()
print("Resource group: ", ws.resource_group)
print("Location: ", ws.location)
print("Workspace name: ", ws.name)

## Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on.

In [None]:
import os

project_folder = '../projects/model_monitoring'
os.makedirs(project_folder, exist_ok=True)

## Register Model
Register an existing trained model, add descirption and tags.

In [None]:
model = Model.register(
    model_path = "./resources/models/sklearn_regression_model.pkl", # this points to a local file
    model_name = "sklearn_regression_model", # this is the name the model is registered as
    tags = {'area': "diabetes", 'type': "regression"},
    description = "Ridge regression model to predict diabetes",
    workspace = ws
)

## *Update your scoring file with Data Collection*
### a. Import the module
```python 
from azureml.monitoring import ModelDataCollector```
### b. In your init function add:
```python 
global inputs_dc, prediction_d
inputs_dc = ModelDataCollector("best_model", identifier="inputs", feature_names=["feat1", "feat2", "feat3", "feat4", "feat5", "Feat6"])
prediction_dc = ModelDataCollector("best_model", identifier="predictions", feature_names=["prediction1", "prediction2"])```
    
* Identifier: Identifier is later used for building the folder structure in your Blob, it can be used to divide "raw" data versus "processed".
* CorrelationId: is an optional parameter, you do not need to set it up if your model doesn't require it. Having a correlationId in place does help you for easier mapping with other data. (Examples include: LoanNumber, CustomerId, etc.)
* Feature Names: These need to be set up in the order of your features in order for them to have column names when the .csv is created.

### c. In your run function add:
```python
inputs_dc.collect(data)
prediction_dc.collect(result)```

In [None]:
%%writefile score_diabetes.py
import pickle
import json
import numpy 
from sklearn.externals import joblib
from sklearn.linear_model import Ridge
from azureml.core.model import Model

from azureml.monitoring import ModelDataCollector
import time

def init():
    global model
    print ("model initialized" + time.strftime("%H:%M:%S"))
    # note here "sklearn_regression_model.pkl" is the name of the model registered under the workspace
    # this call should return the path to the model.pkl file on the local disk.
    model_path = Model.get_model_path(model_name = 'sklearn_regression_model')
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)
    
    global inputs_dc, prediction_dc
    # this setup will help us save our inputs under the "inputs" path in our Azure Blob
    inputs_dc = ModelDataCollector(model_name="sklearn_regression_model", identifier="inputs", feature_names=["feat1", "feat2"]) 
    # this setup will help us save our ipredictions under the "predictions" path in our Azure Blob
    prediction_dc = ModelDataCollector("sklearn_regression_model", identifier="predictions", feature_names=["prediction1", "prediction2"])

def run(raw_data):
    global inputs_dc, prediction_dc
    try:
        data = json.loads(raw_data)['data']
        data = numpy.array(data)
        print ("saving input data" + time.strftime("%H:%M:%S"))
        inputs_dc.collect(data) #this call is saving our input data into our blob
        
        result = model.predict(data)
        print ("saving prediction data" + time.strftime("%H:%M:%S"))
        prediction_dc.collect(result)#this call is saving our prediction data into our blob
        
        # you can return any data type as long as it is JSON-serializable
        return result.tolist()
    except Exception as e:
        error = str(e)
        print (error + time.strftime("%H:%M:%S"))
        return error

## *Update your myenv.yml file with the required module*

In [None]:
from azureml.core.conda_dependencies import CondaDependencies 

myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn'])
myenv.add_pip_package("azureml-monitoring")

with open(os.path.join(project_folder, "myenv.yml"),"w") as f:
    f.write(myenv.serialize_to_string())

## Create your new Image

In [None]:
from azureml.core.image import ContainerImage

image_config = ContainerImage.image_configuration(
    execution_script = "score_diabetes.py",
    runtime = "python",
    conda_file = os.path.join(project_folder, "myenv.yml"),
    description = "Image with ridge regression model",
    tags = {'area': "diabetes", 'type': "regression"}
)

image = ContainerImage.create(
    name = "diabetes-model",
    models = [model],
    image_config = image_config,
    workspace = ws
)

image.wait_for_creation(show_output = True)

In [None]:
print(model.name, model.description, model.version)

## Deploy to AKS service

### Create AKS compute

In [None]:
from azureml.core.compute import AksCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

aks_name = 'myaks'

try:
    aks_target = AksCompute(workspace=ws, name=aks_name)
    print('found existing:', aks_target.name)
except ComputeTargetException:
    print('creating new.')

    # AKS configuration
    prov_config = AksCompute.provisioning_configuration(
        agent_count=3,
        vm_size="Standard_B4ms"
    )
    
    # Create the cluster
    aks_target = ComputeTarget.create(
        workspace = ws, 
        name = aks_name, 
        provisioning_configuration = prov_config
    )

In [None]:
%%time
aks_target.wait_for_completion(show_output = True)
print(aks_target.provisioning_state)
print(aks_target.provisioning_errors)

### a. *Activate Data Collection and App Insights through updating AKS Webservice configuration*
In order to enable Data Collection and App Insights in your service you will need to update your AKS configuration file:

In [None]:
aks_config = AksWebservice.deploy_configuration(
    collect_model_data=True, 
    enable_app_insights=True
)

### b. Deploy your service

In [None]:
%%time
aks_service_name ='diabetes-aks-svc'

aks_service = Webservice.deploy_from_image(
    workspace = ws, 
    name = aks_service_name,
    image = image,
    deployment_config = aks_config,
    deployment_target = aks_target
)

aks_service.wait_for_deployment(show_output = True)
print(aks_service.state)

## Test your service and send some data
**Note**: It will take around 15 mins for your data to appear in your blob.
The data will appear in your Azure Blob following this format:

/modeldata/subscriptionid/resourcegroupname/workspacename/webservicename/modelname/modelversion/identifier/year/month/day/data.csv 

In [None]:
%%time
import json

test_sample = json.dumps({'data': [
    [1,2,3,4,54,6,7,8,88,10], 
    [10,9,8,37,36,45,4,33,2,1]
]})
test_sample = bytes(test_sample,encoding = 'utf8')

prediction = aks_service.run(input_data = test_sample)
print(prediction)