# Generate Simulated User Activity
This notebook will simulate the browsing and purchasing activity for six users with different preferences and save the result to the `events` collection in Cosmos DB.

Run the following cell to retrieve the shared configuration values that point to your instance of Cosmos DB.

In [2]:
%run "./Includes/Shared-Configuration"

Run the following cell to create the read and write configurations to use when interacting with Cosmos DB using the Spark Connector.

In [4]:
readConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "events",
"SamplingRatio" : "1.0",
"schema_samplesize" : "1000",
"query_pagesize" : "2147483647",
}

writeConfig = {
"Endpoint" : cosmos_db_endpoint,
"Masterkey" : cosmos_db_master_key,
"Database" : cosmos_db_database,
"Collection" : "events",
"Upsert" : "false"
}

Run the following cell to query the events collection from Cosmos DB and access the results thru a Spark DataFrame. Initially this DataFrame should be empty, you are just validating connectivity here.

In [6]:
# Connect via Spark connector to create Spark DataFrame
documents = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readConfig).load()
display(documents)

Whenever you write data back to Cosmos DB, you will need to provide a schema for DataFrame to apply when writing. Run the following cell to define this schema object.

In [8]:
# Schema used by the events collection
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
eventsSchema = StructType([
  StructField("contentId",StringType(),True),
  StructField("created",StringType(),True),
  StructField("event",StringType(),True),
  StructField("sessionId",StringType(),True),
  StructField("userId",StringType(),True),
  StructField("_attachments",StringType(),True),
  StructField("_etag",StringType(),True),
  StructField("_rid",StringType(),True),
  StructField("_self",StringType(),True),
  StructField("_ts",IntegerType(),True),
])

## Execute the event generation logic

The movies have been pre-selected and sorted into the genres of comedy, drama and action. While the actual movie selection and activity taken is random, it is weighted to respect the user's preferences in each genere to hit a distribution that would mirror that user's taste. 

For example, user `400001` has the preference of `20` for comedy, `30` for drama, `50` for action. This will result in the user logging more activity with action movies.

Run the following cell to generate 4000 events, save them to Cosmos DB and take a peek at the data created.

In [10]:
import datetime
import os
import random

SEED = 0

films = {'comedy':
                 [
                  '0475290'
                ,'1289401'
                ,'1292566'
                ,'1473832'
                ,'1489889'
                ,'1608290'
                ,'1679335'
                ,'1700841'
                ,'1711525'
                ,'1860213'
                ,'1878870'
                ,'1985949'
                ,'2005151'
                ,'2277860'
                ,'2387499'
                ,'2709768'
                ,'2823054'
                ,'2869728'
                ,'2937696'
                ,'3110958'
                ,'3381008'
                ,'3470600'
                ,'3521164'
                ,'3553442'
                ,'3783958'
                ,'3874544'
                ,'4034354'
                ,'4048272'
                ,'4136084'
                ,'4139124'
                ,'4438848'
                ,'4501244'
                ,'4513674'
                ,'4624424'
                ,'4651520'
                ,'4698684'
                ,'4901306'
                ,'5247022'
                ,'5512872'],
             'drama': ['2119532'
                ,'2543164'
                ,'3783958'
                ,'3315342'
                ,'3263904'
                ,'4034228'
                ,'3040964'
                ,'3741834'
                ,'2140479'
                ,'1179933'
                ,'1355644'
                ,'4550098'
                ,'2582782'
                ,'4975722'
                ,'2674426'
                ,'2005151'
                ,'4846340'
                ,'1860357'
                ,'3640424'
                ,'3553976'
                ,'2241351'
                ,'4052882'
                ,'2671706'
                ,'3774114'
                ,'5512872'
                ,'4172430'
                ,'3544112'
                ,'4513674'
                ,'0490215'
                ,'1619029'
                ,'4572514'
                ,'1878870'
                ,'1083452'
                ,'2025690'
                ,'1219827'
                ,'1972591'
                ,'4276820'
                ,'2381991'
                ,'3416532'
                ,'2547584'
             ], 'action': [
            '1431045', '2975590'
            , '1386697'
            , '3498820'
            , '3315342'
            , '1211837'
            , '2948356'
            , '3748528'
            , '3385516'
            , '3110958'
            , '4196776'
            , '4425200'
            , '3896198'
            , '2404435'
            , '3731562'
            , '1860357'
            , '4630562'
            , '0803096'
            , '2660888'
            , '3640424'
            , '3300542'
            , '0918940'
            , '2094766'
            , '5700672'
            , '1289401'
            , '1628841'
            , '3393786'
            , '4172430'
            , '4094724'
            , '2025690'
            , '4116284'
            , '3381008'
            , '1219827'
            , '1972591'
            , '2381991'
            , '2034800'
            , '2267968'
            , '2869728'
            , '3949660'
            , '3410834'
        ,'2250912']}

class User:
    sessionId = 0
    userId = 0
    likes = {}
    events = {}

    def __init__(self, user_id, action, drama, comedy):
        self.sessionId = random.randint(0, 1000000)
        self.userId = user_id
        self.likes = {'action': action, 'drama': drama, 'comedy': comedy}
        self.events = {self.sessionId: []}

    def get_session_id(self):
        if random.randint(0, 100) > 90:
            self.sessionId += 1
            self.events[self.sessionId] = []

        return self.sessionId

    def select_genre(self):
        return sample(self.likes)


def select_film(user):

    genre = user.select_genre()
    interested_films = films[genre]
    film_id = ''
    while film_id == '':
        film_candidate = interested_films[random.randint(0, len(interested_films) - 1)]
        if film_candidate not in user.events[user.sessionId]:
            film_id = film_candidate

    return film_id


def select_action(user):
    actions = {'details': 70, 'addToCart': 29, 'buy': 1}

    return sample(actions)


def sample(dictionary):
    random_number = random.randint(0, 100)
    index = 0
    for key, value in dictionary.items():
        index += value

        if random_number <= index:
            return key


def main():
    
    random.seed(SEED)

    number_of_events = 4000

    print("Generating Data")
    users = [
        User(400001, 20, 30, 50),
        User(400002, 50, 20, 40),
        User(400003, 20, 30, 50),
        User(400004, 100, 0, 0),
        User(400005, 0, 100, 0),
        User(400006, 0, 0, 100),
    ]
    print("Simulating " + str(len(users)) + " visitors")

    from pyspark.sql import Row
    newRows = []
    
    for x in range(0, number_of_events):
        randomuser_id = random.randint(0, len(users) - 1)
        user = users[randomuser_id]
        selected_film = select_film(user)
        action = select_action(user)
        if action == 'buy':
            user.events[user.sessionId].append(selected_film)
        print(str(x) + " user id " + str(user.userId) + " selects film " + str(selected_film) + " and " + action)

        newRows.append( 
          #contentId:string, created:string, event:string, sessionId:string, userId:string, _attachments:string, _etag:string, _rid:string, _self:string, _ts:integer
          Row(selected_film,datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), action, str(user.get_session_id()), str(user.userId), None,None,None,None,None)
        )
    
    parallelizeRows = spark.sparkContext.parallelize(newRows)
    new_documents = spark.createDataFrame(parallelizeRows, eventsSchema)
    new_documents.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig).save()
    display(new_documents)
    return

if __name__ == '__main__':
    print("Starting Event Log Population script...")
    main()


The previous output displayed the data by querying the Spark DataFrame. Create a new DataFrame that verifies the data was actually written to Cosmos DB by requerying Cosmos DB and viewing the result.

In [12]:
#re-query the document to verify
documents = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readConfig).load()
display(documents)

Once you have the data from Cosmos DB in a DataFrame, you can create a temporary view that enables querying using Spark SQL. Run the following cell to create this view and then the cell that follows to issue a SQL query against it.

In [14]:
documents.createOrReplaceTempView("events")

In [15]:
%sql
SELECT userId, event, count(*) FROM events
GROUP BY userId, event 
ORDER BY userId

You are finished with this notebook and can return to the lab guide.