# Calculate Association Rules
This notebook will examine the `events` data to find items that tend to be purchased together, and creates a matrix that reflects the strength of the releationship. This matrix is stored in the `associations` collection. 

Run the following cell to retrieve the shared configuration values that point to your instance of Cosmos DB.

In [2]:
%run "./Includes/Shared-Configuration"

Run the following cell to create the read and write configurations to use when interacting with Cosmos DB using the Spark Connector.

In [4]:
readEventsConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "events",
"SamplingRatio" : "1.0",
"schema_samplesize" : "1000",
"query_pagesize" : "2147483647",
}

writeAssociationsConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "associations"
}

Whenever you write data back to Cosmos DB, you will need to provide a schema for DataFrame to apply when writing. Run the following cell to define this schema object.

In [6]:
# Schema used by the associations collection
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
associationsSchema = StructType([
  StructField("created",StringType(),True),
  StructField("source",StringType(),True),
  StructField("target",StringType(),True),
  StructField("support",DoubleType(),True),
  StructField("confidence",DoubleType(),True),
  StructField("_attachments",StringType(),True),
  StructField("_etag",StringType(),True),
  StructField("_rid",StringType(),True),
  StructField("_self",StringType(),True),
  StructField("_ts",IntegerType(),True),
])

## Execute the association rules calculation logic
The goal of this algorithm is to compute two metrics that indicate the strength of a relationship between a source item and a target item based on event history, and then save that matrix to the `associations` collection in Cosmos DB.

The algorithm begins with grouping events with a `buy` action into a transaction, grouping by the `sessionId`. This provides the set of items bough together. 

For example, a transaction with two items would look like:
`'404973': ['5512872', '4172430']` where `404973` is the sessionId that is used as the transactionId, and the the array contains the id's of the items bought ('5512872' and '4172430').

Next, the alogirthm uses these transactions to count the number of times a transaction includes a single, specific item. The output of this looks like:

``{frozenset({'3521164'}): 24, frozenset({'4846340'}): 21, frozenset({'4034354'}): 27`

The above indicates the number of times item '3521164' was purchased by itself (24 times). Then this set is filtered to only include items having more than certain quantity of buys (this number is computed to be around 24 buys). 

Then the algorithm computes the possible pairwise combination for each distinct item in a transaction. For example:

`items: ['1985949', '4624424', '4048272'], combinations[('1985949', '4624424'), ('1985949', '4048272'), ('4624424', '4048272')]`

In the above, a transaction included the items '1985949', '4624424', '4048272'. This would result in the combinations ('1985949', '4624424'), ('1985949', '4048272') and ('4624424', '4048272'). It filters out combinations where the invidual items did not have enough purchases on their own. Then then it tallies how many times those pairs of items were purchased together. For example:

`frozenset({'4425200', '1289401'}): 2, frozenset({'4052882', '1289401'}): 1`

In the above intermediate result, items '4425200' and '1289401' were bought together 2 times, whereas items '4052882' and '1289401' were only bought together once.

Finally, the algorithm loops over each item in the one item set that had the minimum number of buys, it treats this item as the source item and will use the the other item in the pair-wise two item set as the target item to compute the confidence (strength of the relationship) for. This results in rules that look like:

`(datetime.datetime(2019, 9, 21, 16, 55, 3, 33313), '4630562', '1211837', 0.04, 0.0004065040650406504)`

The above indicates that source item '4630562' has a relationship with target item '1211837' with a support (the percentage of times the pair is bought together out of all transactions) of 0.04 and a confidence (the ratio of times the pair is bought together relative to the times the source item is bought alone) of 0.0004065040650406504.

Run the following cell to define calculation logic.

In [8]:
import os
from collections import defaultdict
from itertools import combinations
from datetime import datetime


def build_association_rules(writeConfig):
    data = retrieve_buy_events()
    data = generate_transactions(data)

    data = calculate_support_confidence(data, 0.01)
    
    save_rules(data, writeConfig)


def retrieve_buy_events():
    print("retrieving buy events")
    data = spark.sql("SELECT * FROM events WHERE event ='buy'")
    return data


def generate_transactions(data):
    print("generating transactions")
    transactions = dict()

    for row in data.collect():
        transaction_id = row["sessionId"]
        if transaction_id not in transactions:
            transactions[transaction_id] = []
        transactions[transaction_id].append(row["contentId"])

    # print("transactions: ", transactions)
    return transactions

def calculate_support_confidence(transactions, min_sup=0.01):
    print("calculating support confidence")
    N = len(transactions)
    
    one_itemsets = calculate_itemsets_one(transactions, min_sup)
    # print("one_itemsets",one_itemsets)
    
    two_itemsets = calculate_itemsets_two(transactions, one_itemsets)
    # print("two_itemsets", two_itemsets)
    
    rules = calculate_association_rules(one_itemsets, two_itemsets, N)
    print("rules: ", rules[0:2])
    return sorted(rules)


def calculate_itemsets_one(transactions, min_sup=0.01):

    N = len(transactions)

    temp = defaultdict(int)
    one_itemsets = dict()

    for key, items in transactions.items():
        for item in items:
            # using a frozenset enables the set to be used as a key in the dictionary
            inx = frozenset({item})
            temp[inx] += 1

    # remove all items that do not have enough support (enough buys).
    print("Removing items with fewer than {0} buys".format(min_sup * N))
    for key, itemset in temp.items():
        if itemset > min_sup * N:
            one_itemsets[key] = itemset

    return one_itemsets


def calculate_itemsets_two(transactions, one_itemsets):
    two_itemsets = defaultdict(int)

    for key, items in transactions.items():
        items = list(set(items))  # remove duplications

        if (len(items) > 2):
            # calculate all combination pairs of items possible from the list of items in the transaction
            # print("items: {0}, combinations{1}".format(items, list(combinations(items, 2))))
            for perm in combinations(items, 2):
                if has_support(perm, one_itemsets):
                    two_itemsets[frozenset(perm)] += 1
        elif len(items) == 2:
            if has_support(items, one_itemsets):
                two_itemsets[frozenset(items)] += 1
    return two_itemsets


def calculate_association_rules(one_itemsets, two_itemsets, N):
    timestamp = datetime.now()

    rules = []
    for source, source_freq in one_itemsets.items():
        for key, group_freq in two_itemsets.items():
            if source.issubset(key):
                target = key.difference(source)                
                support = float(group_freq) / N
                confidence = float(group_freq) / source_freq
                #print("group_freq:",group_freq,"N:",N,"source_freq:",source_freq, "support:",support, "confidence", confidence)
                rules.append((timestamp, next(iter(source)), next(iter(target)),
                              confidence, support))
    return rules


def has_support(perm, one_itemsets):
    return frozenset({perm[0]}) in one_itemsets and \
           frozenset({perm[1]}) in one_itemsets


def save_rules(rules, config):
    print("saving rules...")
    newRows = []
    for rule in rules:
        from pyspark.sql import Row
        newRows.append( 
            #created:string, source:string, target:string, support:double, confidence:double, _attachments:string, _etag:string, _rid:string, _self:string, _ts:integer
            Row(rule[0], rule[1],rule[2], rule[3], rule[4], None,None,None,None,None)
        )
    parallelizeRows = spark.sparkContext.parallelize(newRows)
    new_documents = spark.createDataFrame(parallelizeRows, associationsSchema)
    new_documents.createOrReplaceTempView("newdocs")
    new_documents.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**config).save()
    print("Associations saved")

def truncate_collection(config, partitionKey):
    # delete any existing ratings
    from azure.cosmos import cosmos_client
    database_link = 'dbs/' + config['Database']
    collection_link = database_link + '/colls/' + config['Collection']
    client = cosmos_client.CosmosClient(url_connection=config['Endpoint'], auth={'masterKey': config['Masterkey']})

    documentlist = list(client.ReadItems(collection_link, {'maxItemCount':10}))

    print('Found {0} documents'.format(documentlist.__len__()))

    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 5

    for doc in documentlist:
        print('Deleting Document Id: {0}'.format(doc['id']))
        docLink = collection_link + '/docs/' + doc['id']
        options['partitionKey'] = doc[partitionKey]
        client.DeleteItem(docLink, options)

Run the following cell to calculate the associations and save them to Cosmos DB. The association rules calculated will be used later in the website to drive the online (realtime) calculation of item recommendations.

In [10]:
# import the Cosmos DB Python SDK
dbutils.library.installPyPI('azure-cosmos', version='3.1.1')

#print("Deleting existing implicit ratings...")
truncate_collection(writeAssociationsConfig, "source")

# Connect via Spark connector to create Spark DataFrame
events_df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readEventsConfig).load()
events_df.createOrReplaceTempView("events")

print("Calculating association rules...")
build_association_rules(writeAssociationsConfig)

You are finished with this notebook and can return to the lab guide.