# Calculate Item to Item Similarity
This notebook will calculate the similarity between items and then stores the result in the `similarity` collection of Cosmos DB.

Run the following cell to retrieve the shared configuration values that point to your instance of Cosmos DB.

In [2]:
%run "./Includes/Shared-Configuration"

Run the following cell to create the read and write configurations to use when interacting with Cosmos DB using the Spark Connector.

In [4]:
readRatingsConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "ratings",
"SamplingRatio" : "1.0",
"schema_samplesize" : "1000",
"query_pagesize" : "2147483647",
}

writeSimilarityConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "similarity"
}

Whenever you write data back to Cosmos DB, you will need to provide a schema for DataFrame to apply when writing. Run the following cell to define this schema object.

In [6]:
# Schema used by the similarity collection
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
similaritySchema = StructType([
  StructField("sourceItemId",StringType(),True),
  StructField("targetItemId",StringType(),True),
  StructField("similarity",DoubleType(),True),
  StructField("_attachments",StringType(),True),
  StructField("_etag",StringType(),True),
  StructField("_rid",StringType(),True),
  StructField("_self",StringType(),True),
  StructField("_ts",IntegerType(),True),
])

## Define the item similarity calculation logic

The logic below uses the user to item ratings previously created to calculate a score indicating the similarity between a source item and a target item.

The process begins by loading the ratings matrix and for each user to item rating, calculating a new normalized rating (to adjust for the user's bias). 

An overlap matrix is calculated that identifies, for any pair of items, how many users rated both items. First, the normalized ratings matrix is converted to a boolean matrix. That is, if an item for a user has a rating (regardless of the value of the rating), it has a value of 1, otherwise it is zero. Then dot product of the normalized ratings matrix against its transpose is calculated. This yields a simpler matrix where the value each cell now contains the count of the number users who rated both items. Cells that don't have any overlap, have a value of zero.

Separately, the cosine similarity of the normalized ratings matrix is computed. 
It's easiest to understand the cosine similarity calculation as being done between an item `i` and another item `j`. 
The cosine similarity is a ratio:
- The numerator is computed as the sum of the product of the normalized rating of item `i` multiplied with the rating of `j`, for all users who have provided ratings.
- The denominator is computed as the square root of the sum of the squares of the normalized rating of item `i` multiplied by the square root of the sum of thesquares of the normalized rating of item `j`.

In Python, the logic uses the `cosine_similarity` method from scikit-learn to compute the similarity between items by providing it our normalized user-to-items ratings matrix.

The result is then filtered to remove entries with a similarity score lower than configured, and having an overlap in the overlap matrix of less than a configured overlap in quantity of ratings for the pair of items.

Just before saving, any resulting similarities with scores less than the configured minimum similarity are removed, so that weaker similarities are not recommended.

Run the following cell to define the logic.

In [8]:
import os
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import coo_matrix
from datetime import datetime

class ItemSimilarityMatrixBuilder(object):

    def __init__(self, min_overlap=15, min_sim=0.2):
        self.min_overlap = min_overlap
        self.min_sim = min_sim


    def build(self, ratings, config):

        print("Calculating similarities ... using {} ratings".format(len(ratings)))
        start_time = datetime.now()

        print("Normalizing ratings matrix")
        ratings['rating'] = ratings['rating'].astype(float)
        ratings['avg'] = ratings.groupby('userId')['rating'].transform(lambda x: normalize(x))

        ratings['avg'] = ratings['avg'].astype(float)
        ratings['userId'] = ratings['userId'].astype('category')
        ratings['itemId'] = ratings['itemId'].astype('category')

        #print("build : ratings for user '400004' = {0}".format( ratings[ratings['userId'] == '400004'] ))

        coo = coo_matrix((ratings['avg'].astype(float),
                          (ratings['itemId'].cat.codes.copy(),
                           ratings['userId'].cat.codes.copy())))

        print("Calculating overlaps between the items")
        overlap_matrix = coo.astype(bool).astype(int).dot(coo.transpose().astype(bool).astype(int))

        number_of_overlaps = (overlap_matrix > self.min_overlap).count_nonzero()
        print("Overlap matrix leaves {} out of {} with {}".format(number_of_overlaps,
                                                                         overlap_matrix.count_nonzero(),
                                                                         self.min_overlap))

        print("Rating matrix (size {}x{}) finished, in {} seconds".format(coo.shape[0],
                                                                                 coo.shape[1],
                                                                                 datetime.now() - start_time))

        sparsity_level = 1 - (ratings.shape[0] / (coo.shape[0] * coo.shape[1]))
        print("Sparsity level is {}".format(sparsity_level))

        start_time = datetime.now()
        cor = cosine_similarity(coo, dense_output=False)
        cor = cor.multiply(cor > self.min_sim)
        cor = cor.multiply(overlap_matrix > self.min_overlap)

        items = dict(enumerate(ratings['itemId'].cat.categories))
        print('Correlation is finished, done in {} seconds'.format(datetime.now() - start_time))

        self.save_similarities(cor, items, config)

        return cor, items

    def save_similarities(self, sm, index, config):
        created=datetime.utcnow()
        newRows = []
        no_saved = 0

        coo = coo_matrix(sm)
        csr = coo.tocsr()

        #print(f'{coo.count_nonzero()} similarities to save')
        xs, ys = coo.nonzero()
        for x, y in zip(xs, ys):

            if x == y:
                continue

            sim = float(csr[x, y])

            if sim < self.min_sim:
                continue

            from pyspark.sql import Row
            newRows.append( 
                #sourceItemId:string, targetItemId:string, similarity:double, _attachments:string, _etag:string, _rid:string, _self:string, _ts:integer
                Row(index[x], index[y], float(sim), None,None,None,None,None)
            )

        parallelizeRows = spark.sparkContext.parallelize(newRows)
        new_documents = spark.createDataFrame(parallelizeRows, similaritySchema)
        new_documents.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**config).save()
        print("Similarities saved")


def calculateItemSimilarity(writeSimilarityConfig):
  
    print("Truncating similarity collection")
    truncate_collection(writeSimilarityConfig, "sourceItemId")
  
    print("Calculation of item similarity")

    all_ratings = load_all_ratings()

    ItemSimilarityMatrixBuilder(min_overlap=3, min_sim=0.0).build(all_ratings, writeSimilarityConfig)

def normalize(x):
    x = x.astype(float)
    x_sum = x.sum()
    x_num = x.astype(bool).sum()
    x_mean = x_sum / x_num

    if x_num == 1 or x.std() == 0:
        return 0.0
    return (x - x_mean) / (x.max() - x.min())


def truncate_collection(config, partitionKey):
    # delete any existing ratings
    from azure.cosmos import cosmos_client
    database_link = 'dbs/' + config['Database']
    collection_link = database_link + '/colls/' + config['Collection']
    client = cosmos_client.CosmosClient(url_connection=config['Endpoint'], auth={'masterKey': config['Masterkey']})

    documentlist = list(client.ReadItems(collection_link, {'maxItemCount':10}))

    print('Found {0} documents'.format(documentlist.__len__()))

    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 5

    for doc in documentlist:
        print('Deleting Document Id: {0}'.format(doc['id']))
        docLink = collection_link + '/docs/' + doc['id']
        options['partitionKey'] = doc[partitionKey]
        client.DeleteItem(docLink, options)
        
  
def load_all_ratings(min_ratings=3):
    print("load_all_ratings : entered")

    ratings_df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readRatingsConfig).load()
    ratings_df.createOrReplaceTempView("ratings")
    
    ratings = ratings_df.toPandas()

    # create a dataframe that that for each user, provides the number of movies rated
    
    #user_count = ratings[['user_id', 'movie_id']].groupby('user_id').count()
    user_count = spark.sql("SELECT userId, count(itemId) numReviews FROM ratings GROUP BY userId")
    user_count.createOrReplaceTempView("user_count")
    
    #print("load_all_ratings : user_count = {0}".format(ratings[['userId', 'itemId']]))

    # select just the user ids that have enough ratings 
    user_ids = [row['userId'] for row in spark.sql("SELECT userId FROM user_count where numReviews > {0}".format(min_ratings)).collect()] 
    
    #print("load_all_ratings : user_ids ={0}".format(user_ids))

    # selects just the ratings from those users having enough ratings
    ratings = ratings[ratings['userId'].isin(user_ids)]
    ratings['rating'] = ratings['rating'].astype(float)

    #print("load_all_ratings : ratings ={0}".format(ratings))

    return ratings




In addition to the Spark Connector for Cosmos DB, this notebook also uses the Azure Cosmos DB Python SDK. Run the following cell to install it.

In [10]:
# import the Cosmos DB Python SDK
dbutils.library.installPyPI('azure-cosmos', version='3.1.1')

Execute the similarity calculation by running the following cell.

In [12]:
calculateItemSimilarity(writeSimilarityConfig)

You are finished with this notebook and can return to the lab guide.