# Calculate Ratings from Implicit Activity
This notebook will use the implict events captured in the `events` collection in Cosmos DB to calculate what a user would rate a given item, based on their actions. In other words it converts a users `buy`, `addToCart` and `details` actions into a numeric score for the item. The resulting user to item ratings matrix will be saved to the `ratings` collection in Cosmos DB.

Run the following cell to retrieve the shared configuration values that point to your instance of Cosmos DB.

In [2]:
%run "./Includes/Shared-Configuration"

In [3]:
readEventsConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "events",
"SamplingRatio" : "1.0",
"schema_samplesize" : "1000",
"query_pagesize" : "2147483647",
}

writeRatingsConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "ratings"
}

Whenever you write data back to Cosmos DB, you will need to provide a schema for DataFrame to apply when writing. Run the following cell to define this schema object.

In [5]:
# Schema used by the ratings collection
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
ratingsSchema = StructType([
  StructField("userId",StringType(),True),
  StructField("itemId",StringType(),True),
  StructField("rating",DoubleType(),True),
  StructField("ratingTimeStamp",StringType(),True),
  StructField("_attachments",StringType(),True),
  StructField("_etag",StringType(),True),
  StructField("_rid",StringType(),True),
  StructField("_self",StringType(),True),
  StructField("_ts",IntegerType(),True),
])

In addition to the Spark Connector for Cosmos DB, this notebook also uses the Azure Cosmos DB Python SDK. Run the following cell to install it.

In [7]:
# import the Cosmos DB Python SDK
dbutils.library.installPyPI('azure-cosmos', version='3.1.1')

## Define the ratings calculation logic

The following cell defines the logic for calculating the rating between any two items, based on the historical user activity about each item.

Run the following cell to define the logic.

In [9]:
import os
import datetime
from datetime import date, timedelta
from collections import defaultdict

w1 = 100
w2 = 50
w3 = 15

def truncate_collection(config):
    # delete any existing ratings
    from azure.cosmos import cosmos_client
    database_link = 'dbs/' + config['Database']
    collection_link = database_link + '/colls/' + config['Collection']
    client = cosmos_client.CosmosClient(url_connection=config['Endpoint'], auth={'masterKey': config['Masterkey']})

    documentlist = list(client.ReadItems(collection_link, {'maxItemCount':10}))

    print('Found {0} documents'.format(documentlist.__len__()))

    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 5

    for doc in documentlist:
        print('Deleting Document Id: {0}'.format(doc['id']))
        docLink = collection_link + '/docs/' + doc['id']
        options['partitionKey'] = doc['itemId']
        client.DeleteItem(docLink, options)


def query_log_for_users():
    return spark.sql("SELECT DISTINCT userId FROM events")


def query_aggregated_log_data_for_user(userId):

    user_events_summary = spark.sql("SELECT userId, contentId as itemId, event, count(event) as count FROM events WHERE userid = '{0}' GROUP BY userId, itemId, event".format(userId))

    return user_events_summary


def calculate_implicit_ratings_for_user(userId):

    print("calculate_implicit_ratings_for_user : entered")

    data = query_aggregated_log_data_for_user(userId)
  
    #print("calculate_implicit_ratings_for_user : data = {0}".format(data))

    agg_data = dict()
    max_rating = 0

    for row in data.collect():
        itemId = str(row['itemId'])
        if itemId not in agg_data.keys():
            agg_data[itemId] = defaultdict(int)

        agg_data[itemId][row['event']] = row['count']

    #print("calculate_implicit_ratings_for_user : agg_data = {0}".format(agg_data))
 
    ratings = dict()
    for k, v in agg_data .items():

        rating = w1 * v['buy'] + w2 * v['addToCart'] + w3 * v['details']
        max_rating = max(max_rating, rating)

        ratings[k] = rating

    #print("calculate_implicit_ratings_for_user : user_id = {0}, max_rating = {1}".format(user_id, max_rating))

    for itemId in ratings.keys():
        ratings[itemId] = 10.0 * float(ratings[itemId]) / max_rating

    #print("calculate_implicit_ratings_for_user : ratings = {0}".format(ratings))

    return ratings


def save_ratings(ratings, userId, config):
    print("saving ratings for {}".format(userId))
    i = 0
    
    from pyspark.sql import Row
    newRows = []
    
    for itemId, rating in ratings.items():     
        if rating > 0:
            newRows.append( 
              #userId:string, itemId:string, rating:double, ratingTimeStamp:string, _attachments:string, _etag:string, _rid:string, _self:string, _ts:integer
              Row(userId, itemId, rating, str(datetime.datetime.utcnow()), None,None,None,None,None)
            )

    parallelizeRows = spark.sparkContext.parallelize(newRows)
    new_documents = spark.createDataFrame(parallelizeRows, ratingsSchema)
    new_documents.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**config).save()


def calculate_ratings(writeRatingsConfig):
    rows = query_log_for_users()
    #display(rows)

    for row in rows.collect():
        userId = row['userId']
        print(userId)
        ratings = calculate_implicit_ratings_for_user(userId)
        save_ratings(ratings, userId, writeRatingsConfig)
        





Run the following cell to calculate the ratings and store them in the `ratings` collection within Cosmos DB.

In [11]:
print("Deleting existing implicit ratings...")
truncate_collection(writeRatingsConfig)

# Connect via Spark connector to create Spark DataFrame
events_df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readEventsConfig).load()
events_df.createOrReplaceTempView("events")

print("Calculating implicit ratings...")
calculate_ratings(writeRatingsConfig)

ratings_df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**writeRatingsConfig).load()
ratings_df.createOrReplaceTempView("ratings")

Run the following cell to view the statistics about the rating values that were generated. Notice they run between very close to 0 and 10.

In [13]:
display(ratings_df.describe("rating"))

Run the following cell to query the temporary view representing the ratings data using Spark SQL.

In [15]:
%sql
SELECT itemId, userId, rating FROM ratings

You are finished with this notebook and can return to the lab guide.