# Verify Python libraries are installed
Note that your Synapse Spark pool includes all the libraries required to run this notebook. They were added during the pool creation by using the correct requirements.txt file.

The libraries installed are:
```python
    numpy==1.17.1
    pandas==0.24.2
    idna==2.5
    scipy==1.3.1
    azureml-sdk==1.3.0
    azureml-automl-core==1.3.0
    azureml-automl-runtime==1.2.0
```

Synapse Spark pools already have the required libraries to connect to Cosmos DB operational and analytical storage.

In [4]:
import azureml
from azureml.core import Run
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.core.run import Run
from azureml.core.experiment import Experiment

import scipy

# Verify versions of key libraries
# view version history at https://pypi.org/project/azureml-sdk/#history 
print("Azure ML SDK Version:", azureml.core.VERSION)
print("SciPy Version: ", scipy.__version__)

Azure ML SDK Version: 1.6.0
SciPy Version:  1.1.0

# Configure access to the Azure Machine Learning resources

## Configure Service Principal authentication following the instructions here: [Setup Service Principal Authentication](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?view=azure-ml-py#set-up-service-principal-authentication).

Use the JSON output from the commands in the above link to retrieve the values needed for `tenant_id`, `service_principal_id`, and `service_principal_password` in the next cell.

Note: if the Azure account you are using has access to multiple Azure subscriptions, **make sure you run CLI commands in the correct Azure subscription**. You can set the default subscription to the one you are using for the lab/demo with the Azure CLI command `az account set`.

Reference: https://docs.microsoft.com/cli/azure/account#az-account-set

# Variables

Provide values for the following variables which will be used throughout the rest of this notebook.

In [33]:
# Provide the Subscription ID of the Azure subscription you are using for the lab/demo
subscription_id = "220fc532-6091-423c-8ba0-66c2397d591b"

# Resource Group name where your lab/demo resources are deployed
resource_group = "pz-iot-demo-20200618-1"

# Azure Machine Learning Workspace name and Azure region
# Get these from the Azure ML workspace Overview in your Resource Group
workspace_name = "Cosmos-DB-IoT-ML-vhdozdcujguho"
workspace_region = "East US"

# Values from `Setup Service Principal Authentication` in the above cell
# For reference, SP name you created (not needed in a variable): pz-ml-auth
tenant_id = "" # Use "tenantId" value
service_principal_id = "" # Use "clientId" value
service_principal_password = "" # Use "clientSecret" value

# Pre-trained ML model
# Update for final release
# pkl_url = "https://github.com/AzureCosmosDB/scenario-based-labs/blob/master/IoT/deploy/modelv3.pkl?raw=true"
pkl_url = "https://github.com/plzm/scenario-based-labs/blob/iot-2020/IoT/deploy/modelv3.pkl?raw=true"
local_folder = "models"
local_path = local_folder+"/modelv3.pkl"
model_name = "batt-cycles-7"

# Cosmos DB
# Change this to the Azure region to which you deployed your lab/demo
cosmos_db_region = "East US"
cosmos_db_database = "ContosoAuto"
cosmos_db_container_metadata = "metadata"
cosmos_db_container_maintenance = "maintenance"

synapse_cosmos_db_linked_service = "CosmosDbIoTLab"


# Batch Scoring data
In this notebook, you will use a forecasting model to determine if the battery will need replacement within the next 30 days.

In [9]:
from azureml.core.authentication import ServicePrincipalAuthentication

sp = ServicePrincipalAuthentication(
    tenant_id=tenant_id,
    service_principal_id=service_principal_id,
    service_principal_password=service_principal_password)

In [10]:
# By using the exist_ok param, if the workspace already exists we get a reference to the existing workspace
from azureml.core import Workspace

ws = Workspace.get(
    name=workspace_name, 
    auth=sp,
    subscription_id=subscription_id)

ws.get_details()

{'id': '/subscriptions/220fc532-6091-423c-8ba0-66c2397d591b/resourceGroups/pz-iot-demo-20200618-1/providers/Microsoft.MachineLearningServices/workspaces/Cosmos-DB-IoT-ML-vhdozdcujguho', 'name': 'Cosmos-DB-IoT-ML-vhdozdcujguho', 'location': 'eastus', 'type': 'Microsoft.MachineLearningServices/workspaces', 'sku': 'Basic', 'workspaceid': 'aa21de48-1752-4fd5-85c1-308ec85899bb', 'description': '', 'friendlyName': 'Cosmos-DB-IoT-ML-vhdozdcujguho', 'creationTime': '2020-06-18T12:38:32.9520027+00:00', 'keyVault': '/subscriptions/220fc532-6091-423c-8ba0-66c2397d591b/resourcegroups/pz-iot-demo-20200618-1/providers/microsoft.keyvault/vaults/iot-vault-vhdozdcujguho', 'applicationInsights': '/subscriptions/220fc532-6091-423c-8ba0-66c2397d591b/resourcegroups/pz-iot-demo-20200618-1/providers/microsoft.insights/components/cosmos-db-iot-insights-vhdozdcujguho', 'identityPrincipalId': 'aaade69b-2856-40ca-9c0f-bd1643b4ae1e', 'identityTenantId': '72f988bf-86f1-41af-91ab-2d7cd011db47', 'identityType': 'Sys

## Retrieve the pre-trained model
A pre-trained model has been made available in a public Azure Storage account. Run the following cell to download the model and then register it as a model within your Azure Machine Learning workspace.

In [13]:
import os
import urllib.request
from azureml.core import Model

print("Downloading the pre-trained model...")
os.makedirs("models", exist_ok=True)

urllib.request.urlretrieve(pkl_url, local_path)

print("Download complete.")

print("Uploading and registering model...")
registered_model = Model.register(
    model_path=local_path, 
    model_name=model_name, 
    workspace=ws)

Downloading the pre-trained model...
Download complete.
Uploading and registering model...
Registering model batt-cycles-7

Run the following to retrieve the model from your Azure Machine Learning workspace, and inspect some of its properties.

In [15]:
from azureml.core.model import Model
from sklearn.externals import joblib
from azureml.train import automl

model_path = Model.get_model_path(model_name=model_name, _workspace=ws)
print("Model saved to ", model_path)
model = joblib.load(model_path)
print("Model loaded.")

Model saved to  azureml-models/batt-cycles-7/7/modelv3.pkl
Model loaded.

## Load the data from Cosmos DB to batch score it
Run the following cells to query Cosmos DB Analytical store, prepare the data using SQL queries and then surface the data as temporary views.

### Register Temp View
Now we register the view required to create the dataset that will be used to make the predictions. Notice how you are now capable to join data from multiple Cosmos DB containers.


In [18]:
# vehicle_metadata_df = spark.read.cosmos_olap('metadata').createOrReplaceTempView("metadata")

vehicle_metadata_df = spark.read\
    .format("cosmos.olap")\
    .option("spark.synapse.linkedService", synapse_cosmos_db_linked_service)\
    .option("spark.cosmos.container", cosmos_db_container_metadata)\
    .load()

In [32]:
print(vehicle_metadata_df.count())

vehicle_metadata_df.printSchema()

187136
root
 |-- _rid: string (nullable = true)
 |-- _ts: long (nullable = true)
 |-- id: string (nullable = true)
 |-- _etag: string (nullable = true)
 |-- partitionKey: string (nullable = true)
 |-- entityType: string (nullable = true)
 |-- vin: string (nullable = true)
 |-- lastServiceDate: string (nullable = true)
 |-- batteryAgeDays: long (nullable = true)
 |-- batteryRatedCycles: long (nullable = true)
 |-- lifetimeBatteryCyclesUsed: double (nullable = true)
 |-- averageDailyTripDuration: double (nullable = true)
 |-- batteryFailurePredicted: boolean (nullable = true)
 |-- stateVehicleRegistered: string (nullable = true)
 |-- customer: string (nullable = true)
 |-- description: integer (nullable = true)
 |-- status: string (nullable = true)
 |-- deliveryDueDate: string (nullable = true)
 |-- packages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- timestamp: string (nullable = true)
 |-- tripId: string (nullable = true)
 |-- consignmentId: string (nu

In [21]:
vehicle_metadata_df.createOrReplaceTempView("metadata")

In [22]:
metadata = spark.sql("""
    SELECT * FROM metadata LIMIT 10
    """)

metadata.printSchema()

AnalysisException: 'java.lang.RuntimeException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://synsavhdozdcujguho.dfs.core.windows.net/workspace/tmp/hive?upn=false&timeout=90;'

### Generate Scoring dataset
Now we are ready to use the previously created view to generate the final dataset

In [24]:
trips_clean = spark.sql("""
    SELECT  vin, 
            to_utc_timestamp(tripEnded, \"yyyy-MM-dd'T'HH:mm:ss.SSSX'Z'\") as tripEnded, 
            to_utc_timestamp(tripStarted, \"yyyy-MM-dd'T'HH:mm:ss.SSSX'Z'\") as tripStarted, 
            ((unix_timestamp(to_utc_timestamp(tripEnded, \"yyyy-MM-dd'T'HH:mm:ss.SSSX'Z'\")) - 
                unix_timestamp(to_utc_timestamp(tripStarted, \"yyyy-MM-dd'T'HH:mm:ss.SSSX'Z'\")))/60.0) as tripDurationMinutes
    FROM metadata
    WHERE   entityType = 'Trip' 
            AND (tripStarted is not null AND tripStarted <> '0' AND tripStarted <> '') 
            AND (tripEnded is not null AND tripEnded <> '0' AND tripEnded <> '')
    """)

trips_clean.createOrReplaceTempView("trips_clean")
trips_clean.printSchema()

AnalysisException: 'java.lang.RuntimeException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://synsavhdozdcujguho.dfs.core.windows.net/workspace/tmp/hive?upn=false&timeout=90;'

In [25]:
vehicles_raw = spark.sql("""
    SELECT vin, batteryAgeDays, batteryRatedCycles, lifetimeBatteryCyclesUsed 
    FROM metadata 
    WHERE entityType ='Vehicle'
    """)

vehicles_raw.createOrReplaceTempView("vehicles_raw")
vehicles_raw.printSchema()

AnalysisException: 'java.lang.RuntimeException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://synsavhdozdcujguho.dfs.core.windows.net/workspace/tmp/hive?upn=false&timeout=90;'

In [26]:
vehicles_batch = spark.sql("""
    SELECT  v.vin as vin, 
            to_date(t.tripEnded, 'yyyy-MM-dd') as tripEnded, 
            t.tripDurationMinutes, 
            v.batteryAgeDays, 
            v.batteryRatedCycles, 
            v.lifetimeBatteryCyclesUsed 
    FROM    vehicles_raw v 
    INNER JOIN trips_clean t 
        ON v.vin = t.vin
    """)

vehicles_batch.createOrReplaceTempView("vehicles_batch")
vehicles_batch.printSchema()

AnalysisException: 'java.lang.RuntimeException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://synsavhdozdcujguho.dfs.core.windows.net/workspace/tmp/hive?upn=false&timeout=90;'

Run the following cells to convert the Spark DataFrame to a Pandas DataFrame for use with the pre-created model.

In [None]:
import pandas as pd

spark_df = spark.sql("SELECT cast(tripEnded as string) as date, batteryAgeDays as battery_Age_Days, tripDurationMinutes as daily_Trip_Duration, lifetimeBatteryCyclesUsed, batteryRatedCycles, vin from vehicles_batch v")
pd_df = spark_df.toPandas()
pd_df['date'] = pd.to_datetime(pd_df['date']) # Added to address Spark Date to Pandas date conversion

## Define the scoring logic
The following cell will apply the model and return a prediction for whether or not maintenance is required.

Run the following cell to define the helper method.

In [None]:
def predict_maintenance(row):
    # from azureml.train import automl
    from sklearn.linear_model import LinearRegression
    import pandas as pd
    import numpy as np
    from datetime import datetime
    predict_needs_service = 0
    
    startday = row["battery_Age_Days"]
    dailytripduration = row["daily_Trip_Duration"]
    current_cycles = row["lifetimeBatteryCyclesUsed"]
    rated_lifetime_cycles = row["batteryRatedCycles"]

    dayslist = range(startday, startday+30)
    pds_df = pd.DataFrame({'battery_Age_Days': dayslist, 'daily_Trip_Duration': dailytripduration})

    y_Pred = reg.predict(np.array(pds_df))
    total_cycles_next_30_days = y_Pred[[29,]][0][0]

    if current_cycles + total_cycles_next_30_days > rated_lifetime_cycles:
        predict_needs_service = 1

    return predict_needs_service

Calculate the predictions by running the following cell.

In [None]:
predictions = pd_df.apply(predict_maintenance, axis=1)

Now, run the following cell to examine the predication by `VIN`

In [None]:
import pandas as pd
batch_predictions_pdf = pd.DataFrame({"vin": pd_df["vin"], "serviceRequired":predictions})
batch_predictions_pdf

## Write the predictions back to Cosmos DB
Now you will save the previously created predictions DataFrame back to the `maintenance` collection in Cosmos DB.

Run the following cells to do so.

In [None]:
# Retrieve connection string and key from LinkService
import sys
import re

from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary

connection_string = token_library.getConnectionString(synapse_cosmos_db_linked_service)
matchObj = re.match( r'AccountEndpoint=(.*);Database=(.*);AccountKey="(.*)";', connection_string, re.M|re.I)
endpoint = matchObj.group(1)
masterkey = matchObj.group(3)

In [None]:
maintReadConfig = {
    "Endpoint" : endpoint,
    "Masterkey" : masterkey,
    "Database" : cosmos_db_database,
    "Collection" : cosmos_db_container_maintenance
}

maint = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**maintReadConfig).load()
maint.createOrReplaceTempView("maintenance")

writeConfig = {
    "Endpoint" : endpoint,
    "Masterkey" : masterkey,
    "Database" : cosmos_db_database,
    "Collection" : cosmos_db_container_maintenance,
    "Upsert" : "false"
}

# Schema used by the maintenance collection
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType
maintSchema = StructType([
  StructField("vin",StringType(),True),
  StructField("serviceRequired",IntegerType(),True),
  StructField("id",StringType(),True),
  StructField("_attachments",StringType(),True),
  StructField("_etag",StringType(),True),
  StructField("_rid",StringType(),True),
  StructField("_self",StringType(),True),
  StructField("_ts",IntegerType(),True),
])

In [None]:
# delete any existing maintenance predictions
from azure.cosmos import CosmosClient, PartitionKey, exceptions

client = CosmosClient(endpoint, credential=masterkey)
database = client.get_database_client(cosmos_db_database)
container = database.get_container_client(cosmos_db_container_maintenance)

for item in container.query_items(query='SELECT * FROM c',
                                  enable_cross_partition_query=True):
    print('Deleting Document Id: {0}'.format(item['id']))
    container.delete_item(item, partition_key=item['vin'])

In [None]:
# write the new prediction out to Cosmos DB
batch_predictions = spark.createDataFrame(batch_predictions_pdf)
batch_predictions.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig).save()