### Aim

Create a **user context** from the user events history

**User Context** - A user context can be defined as the last N sessions of the user relative to the current date.<br>
N is a parameter which can be optimized during model training. Here, we set it to 10.

**Assumptions** - Let one session be defined as one day of user events

In [1]:
import pandas as pd
from collections import defaultdict
import datetime
import numpy as np

In [2]:
datapath = '../../datasets/user-items-recsys/'
events_file = 'events.csv'
category_tree = 'category_tree.csv'
item_props_1 = 'item_properties_part1.csv'
item_props_2 = 'item_properties_part2.csv'

In [3]:
df_events = pd.read_csv(datapath+events_file)
df_events.shape

(2756101, 5)

In [4]:
df_events.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


In [5]:
## Convert timestamp to date
df_events['datetime'] = pd.to_datetime(df_events['timestamp'],unit='ms')
df_events['date'] = pd.to_datetime(df_events['timestamp'],unit='ms').dt.date

#### Let us see if a user history gives any useful insights

We will analyze his items and their categories to find any pattern.

User ID = **992329**

In [6]:
user_id = 992329

In [7]:
df_user = df_events.query('visitorid=='+str(user_id))

In [8]:
print("Number of events by this user:",df_user.shape[0])
df_user.head()

Number of events by this user: 30


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,datetime,date
1,1433224214164,992329,view,248676,,2015-06-02 05:50:14.164,2015-06-02
20559,1433224672007,992329,view,193150,,2015-06-02 05:57:52.007,2015-06-02
44215,1433225555976,992329,view,246453,,2015-06-02 06:12:35.976,2015-06-02
50030,1433395158782,992329,view,8775,,2015-06-04 05:19:18.782,2015-06-04
64989,1433395205712,992329,view,8775,,2015-06-04 05:20:05.712,2015-06-04


Let us explore the itemid and see if he was searching for similar items

In [9]:
items_list = df_user['itemid'].unique()

In [10]:
# Load item properties
df_items1 = pd.read_csv(datapath+item_props_1)
# Load item properties
df_items2 = pd.read_csv(datapath+item_props_2)

In [11]:
## Convert timestamp to date
df_items1['datetime'] = pd.to_datetime(df_items1['timestamp'],unit='ms')
df_items1['date'] = pd.to_datetime(df_items1['timestamp'],unit='ms').dt.date
df_items1.head()

## Convert timestamp to date
df_items2['datetime'] = pd.to_datetime(df_items2['timestamp'],unit='ms')
df_items2['date'] = pd.to_datetime(df_items2['timestamp'],unit='ms').dt.date
df_items2.head()

Unnamed: 0,timestamp,itemid,property,value,datetime,date
0,1433041200000,183478,561,769062,2015-05-31 03:00:00,2015-05-31
1,1439694000000,132256,976,n26.400 1135780,2015-08-16 03:00:00,2015-08-16
2,1435460400000,420307,921,1149317 1257525,2015-06-28 03:00:00,2015-06-28
3,1431831600000,403324,917,1204143,2015-05-17 03:00:00,2015-05-17
4,1435460400000,230701,521,769062,2015-06-28 03:00:00,2015-06-28


In [12]:
df_items = pd.concat([df_items1,df_items2])

**Filter for the items viewed by the user**

In [13]:
df_item_properties = df_items1.query("itemid in @items_list")
df_item_categories = df_item_properties.query("property=='categoryid'")
print("Number of items viewed by the user:",len(items_list))
print("Number of items for which we have a category:",len(df_item_categories['itemid'].unique()))

Number of items viewed by the user: 24
Number of items for which we have a category: 15


This means we have 9 items where we do not have a categoryid.<br>
Ignoring those and analyzing the categories of the remaining 15 items

**Now let us explore the items**

1. We first load the category heirarchy table.
2. Then, we build a Tree data structure from the heirarchy dataframe.
3. Analyze which items belong to a common root category.

*Tree data structure allows us to easily analyze the common ancester of two categories*<br>
*A root category is the category at the root node of the heirarchy*

#### Step 1: Load category heirarchy data into a dataframe

In [14]:
# Load category tree
df_category_tree = pd.read_csv(datapath+category_tree)
df_category_tree.head()

Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0


#### Step 2: Create Tree Data structure from category tree

In [15]:
import CategoryTree
import importlib
importlib.reload(CategoryTree)

<module 'CategoryTree' from 'C:\\Users\\Samarth\\Documents\\ml_projects\\rec-items\\RetailRocket\\CategoryTree.py'>

In [16]:
from CategoryTree import CategoryTree

In [17]:
categoryTree = CategoryTree(df_category_tree)

In [18]:
categoryTree.build_trees()

In [19]:
print("Number of independent trees:",len(categoryTree.trees))
print("This indicates that there are {} independent root categories not linked to each other".format(len(categoryTree.trees)))

Number of independent trees: 25
This indicates that there are 25 independent root categories not linked to each other


#### Printing a category tree

Let us print out a tree to visualize what it looks like

In [20]:
print("Printing Tree with Root Category 679")
categoryTree.print_tree(categoryTree.trees[679],0)

Printing Tree with Root Category 679
Node: 679
	Node: 1424
		Node: 365
		Node: 421
		Node: 1143
		Node: 553
		Node: 1008
			Node: 992
			Node: 202
		Node: 245
			Node: 1520
		Node: 230
		Node: 1105
		Node: 1215
			Node: 1587
			Node: 91
			Node: 1199
			Node: 1461
			Node: 1204
			Node: 1359
			Node: 417
		Node: 281
	Node: 1139
		Node: 310
		Node: 258
		Node: 550
		Node: 449
	Node: 630
		Node: 1217
		Node: 1662
		Node: 752
		Node: 233
		Node: 103
		Node: 1112
	Node: 901
		Node: 1231
		Node: 368
		Node: 833
	Node: 313
		Node: 1436
		Node: 560
		Node: 559
	Node: 1544
	Node: 491
	Node: 869


#### Step3: Let us now search for the root categories of items viewed by the user

In [21]:
rootCatsMap = defaultdict(list)
for category in df_item_categories['value'].values:
    rootCatsMap[categoryTree.get_root_category(category)].append(category)
    
print("The following dictionary shows the list of items belonging to a root category")
print("(Key-Value pair. Key: Root Category, Value: List of categories of items viewed by user)")
dict(rootCatsMap)

The following dictionary shows the list of items belonging to a root category
(Key-Value pair. Key: Root Category, Value: List of categories of items viewed by user)


{140: ['1219',
  '1258',
  '473',
  '471',
  '1248',
  '1219',
  '1173',
  '34',
  '531',
  '279'],
 395: ['411', '1483'],
 679: ['1662', '833'],
 1224: ['612']}

*We see that the user has mostly viewed items from the Root Category: **140**.
This gives us a clue to recommend him items from this category*

### User session

Let us take one session, i.e one day of user events

In [22]:
print("Number of events for this user:",df_user.shape[0])
session_dates = df_user['date'].unique()
print("Number of sessions for this user:",len(session_dates))

Number of events for this user: 30
Number of sessions for this user: 15


In [23]:
print("Here are the session dates")
print(session_dates)

Here are the session dates
[datetime.date(2015, 6, 2) datetime.date(2015, 6, 4)
 datetime.date(2015, 6, 5) datetime.date(2015, 6, 8)
 datetime.date(2015, 6, 7) datetime.date(2015, 6, 10)
 datetime.date(2015, 6, 13) datetime.date(2015, 6, 11)
 datetime.date(2015, 6, 22) datetime.date(2015, 6, 26)
 datetime.date(2015, 5, 5) datetime.date(2015, 5, 28)
 datetime.date(2015, 7, 6) datetime.date(2015, 7, 10)
 datetime.date(2015, 7, 30)]


#### Let us create a session for 5th June

In [24]:
df_session = df_user[df_user['date']==datetime.date(2015,6,5)]
df_session

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,datetime,date
82514,1433480383896,992329,view,225942,,2015-06-05 04:59:43.896,2015-06-05
92903,1433480149682,992329,view,437767,,2015-06-05 04:55:49.682,2015-06-05
93057,1433480524891,992329,view,314515,,2015-06-05 05:02:04.891,2015-06-05


In [25]:
print("Number of events on 5th June:",df_session.shape[0])

Number of events on 5th June: 3


### User Context

*Stack up last N sessions to create a context*

1. Create input for single user session
2. Stack up last N session to form a context

In [26]:
context_N = 10

#### Some basic statistics

In [27]:
n_items = df_items['itemid'].unique()
print("Number of unique items:",len(n_items))

Number of unique items: 417053


In [28]:
event_items = df_events['itemid'].unique()
print("Number of unique items in the events table:",len(event_items))

Number of unique items in the events table: 235061


*Number of Events*: 2.75M<br>
*Number of unique items*: 417,053<br>
*Items having an event*: 235,061

In [29]:
print("Number of items having 0 events:",len(n_items)-len(event_items))

Number of items having 0 events: 181992


#### Step 1: Creating input for single user session

Vectorize itemid and create session of N events.<br>
Each session will have itemid as One Hot Vector which will be passed to embedding to get item embedding

In [30]:
from sklearn.preprocessing import OneHotEncoder

In [31]:
tokenizer = OneHotEncoder(sparse=False)

In [32]:
df_session

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,datetime,date
82514,1433480383896,992329,view,225942,,2015-06-05 04:59:43.896,2015-06-05
92903,1433480149682,992329,view,437767,,2015-06-05 04:55:49.682,2015-06-05
93057,1433480524891,992329,view,314515,,2015-06-05 05:02:04.891,2015-06-05


In [33]:
event_items = np.array(event_items).reshape((-1,1))
tokenizer.fit(event_items)

OneHotEncoder(sparse=False)

In [34]:
X_session = tokenizer.transform(np.array(df_session['itemid']).reshape(-1,1))
print("Shape of session containing 3 events:",X_session.shape)

Shape of session containing 3 events: (3, 235061)


#### Step 2: Let us concatenate last 10 available user sessions

In [35]:
curr_date = datetime.date(2022,2,15)
lastN = 10

In [36]:
## Filter for all the data before this
df_user = df_user.sort_values(by="datetime")
df_user.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,datetime,date
1492085,1430802384830,992329,view,2711,,2015-05-05 05:06:24.830,2015-05-05
1981137,1432781848232,992329,view,340825,,2015-05-28 02:57:28.232,2015-05-28
1972814,1432782322169,992329,view,446522,,2015-05-28 03:05:22.169,2015-05-28
1,1433224214164,992329,view,248676,,2015-06-02 05:50:14.164,2015-06-02
20559,1433224672007,992329,view,193150,,2015-06-02 05:57:52.007,2015-06-02


In [37]:
def get_last_n_sessions_for_user(user_id, curr_date, n=5):
    df_user = df_events.query("visitorid == @user_id and date < @curr_date").sort_values(by='datetime')
    user_session_dates = df_user['date'].unique()
    lastN_dates = sorted(user_session_dates)[-n:]
    return df_user.query("date in @lastN_dates")

In [38]:
df_recent_user_history = get_last_n_sessions_for_user(user_id, curr_date, lastN)
print("Number of events in last 10 sessions:",df_recent_user_history.shape[0])
df_recent_user_history

Number of events in last 10 sessions: 18


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,datetime,date
145189,1433703062882,992329,view,48030,,2015-06-07 18:51:02.882,2015-06-07
133651,1433729380709,992329,view,352788,,2015-06-08 02:09:40.709,2015-06-08
133782,1433729489683,992329,view,169213,,2015-06-08 02:11:29.683,2015-06-08
133552,1433729786622,992329,view,46971,,2015-06-08 02:16:26.622,2015-06-08
145597,1433730019241,992329,view,225942,,2015-06-08 02:20:19.241,2015-06-08
140216,1433730231553,992329,view,395071,,2015-06-08 02:23:51.553,2015-06-08
141987,1433730402675,992329,view,377484,,2015-06-08 02:26:42.675,2015-06-08
206958,1433945313290,992329,view,299003,,2015-06-10 14:08:33.290,2015-06-10
198995,1433946626515,992329,view,381721,,2015-06-10 14:30:26.515,2015-06-10
235412,1434035218038,992329,view,457496,,2015-06-11 15:06:58.038,2015-06-11


In [39]:
X_user_context = tokenizer.transform(np.array(df_recent_user_history['itemid']).reshape(-1,1))
print("Shape of session containing {} sessions: {}".format(context_N, X_user_context.shape))

Shape of session containing 10 sessions: (18, 235061)


Above method is a way to fetch last N sessions and convert is into numpy input for the model. This can become the user context input for the model

#### Next Steps

1. This input should be passed to an embedding layer to fetch item embeddings<br>
2. The embeddings need to be aggregated (averaging method is used in the YouTube paper) as number of events can be vary user to user<br>
3. Other features can then be concatenated with the embeddings layer and passed to a Dense layer

Aim for this notebook is achieved. Next steps will be performed in a different notebook