# Practical Work in AI

In [1]:
# installs necessary libraries

#!pip install pandas

In [2]:
# necessary imports for notebook to run
import pandas as pd

## 1. Dataset analysis

### 1.1 Structure of processed.csv

As a first step, certain statistics have to be observed on the given dataset in order to see if it is feasible to be sessionized and further used for training sequence-aware recommender systems models [a], such as GRU4Rec. In our work, a sessionized dataset is a dataset which has been split up into interaction sessions, where a specific user interacts with items. A set threshold determines when to split interactions into the next user session. A session always corresponds to one user, and one user can have multiple sessions.

For the analysis, we are working with the "processed.csv" dataset, which contains user-item interactions and represents a pre-processed version of the "new_release_stream.csv" file. A single row in this file looks like the following:

`28,60,188807,1,[0.0],train`

Here, the user with ID 28 listened to the track with ID 60 at timestamp 188807 (measured in seconds from first consumption). The track was listened to over 80% of its length (y = 1). The time interval (measured in hours) between when user 28 interacted with item 60 this time, versus the previous times, is 0.0, meaning it is the first interaction from user 28 to track 60. Based on the train-test-val split done during pre-processing (see `preprocess.py`), this interaction is part of the training set. 

Repetitive behavior is measured via the relational interval column (here [0.0]). Hence, the following lines from the dataset:

`28,60,188807,1,[0.0],train`

`28,60,188977,1,"[0.04722222222222222, 0.0]",train`

`28,60,189155,0,"[0.09666666666666666, 0.049444444444444444]",train`

have the following meaning with regards to the repetitions:

`28,60,188807,1,[0.0],train` = User 28 consumes item 60 for the first time, thus no previous consumptions logged in the relational interval

`28,60,188977,1,"[0.04722222222222222, 0.0]",train` = User 28 consumed item 60 for the first time 188977-188807 = 170s / 60 / 60 = 0.047222hrs ago

`28,60,189155,0,"[0.09666666666666666, 0.049444444444444444]",train` = User 28 consumed item 60 since the last time 189155-188977 = 178s / 60 / 60 = 0.049444hrs ago and for the first time 0.049444444444444444 + 0.04722222222222222 = 0.09666666666666666 hrs ago

### 1.2 Statistical Analysis

The analysis of the dataset is separated into two parts:

- **Global measures**: e.g. average count of user-item interactions per set, average count of user-item interactions per item, etc.
- **Session-based measures**: e.g. average number of sessions per user, average number of interactions per session, etc.

Global measures are metrics which are interesting in terms of the whole dataset, without (yet) any focus on sessions. These statistics are based on the dataset as a whole. Session-based measures, as the name suggests, are evaluated on a previously sessionized dataset.

We start by reading in the data:

In [3]:
file_path = './data/processed.csv' # adjust as needed
df = pd.read_csv(file_path)

At first, we focus on global measures which concern the whole dataset.

#### 1.2.1 Global measures

We observe the number of interactions, unique users and unique items in the dataset:

In [4]:
print("Number of user-item interactions in total: ", len(df))
print("Number of unique users: ", df.userId.nunique())
print("Number of unique songs: ", df.itemId.nunique())

Number of user-item interactions in total:  1583815
Number of unique users:  3623
Number of unique songs:  879


These numbers are consistent with the reportings from [b], after the authors implemented a $k^{item}$ and a $k^{user}$ pre-processing step, where each user has to have interacted with at least $k^{item}$ items and every item has to have been consumed by $k^{user}$ users. They decided on $k^{item} = k^{user} = 30$. For details, see [b], pp. 975.

##### **Number of total user-item interactions per set**

The following table shows the number of interactions per set (training, validation, and test). These are pre-determined by the 'set'-column, which is filled during the pre-processing step.

In [5]:
# count number of rows per set, aka count size of training, test and validation dataset
set_counts = df['set'].value_counts()

# create df outof series and rename columns accordingly
set_counts = set_counts.reset_index(name='n_interactions')

print("Number of user-item interactions for training-, test-, and validation set:")
set_counts


Number of user-item interactions for training-, test-, and validation set:


Unnamed: 0,index,n_interactions
0,train,1423927
1,test,80166
2,val,79722


As we can see, after splitting the data, we have train = ~1.4m, test = ~80.1k and validation = ~79.7k interactions. 

##### **Interest score distribution across sets**

We can now split up the number of interactions based on various criteria we want to observe. For example, this table shows the interest distribution across the different sets. The authors use the $y$ variable, where $y=1$ if the user has listened to more than 80% of the track, and $y=0$ otherwise, as target values for the interest. Namely, if $y=1$, then the interest score for that interaction is equal to 1, meaning the user is interested in the item. These are then used as the target variables in the test set. During training, interest scores ranging from 0 to 1 are obtained. These are then converted to 1 if the score >= 0.5, or 0 otherwise; and then compared to the target interest scores during evaluation.

In [6]:
listening_events = df.groupby(['set', 'y']).size().reset_index(name='n_interactions')
listening_events

Unnamed: 0,set,y,n_interactions
0,test,0,36564
1,test,1,43602
2,train,0,664080
3,train,1,759847
4,val,0,36453
5,val,1,43269


We observe a pretty even distribution of the binary interest variable across the sets, with slightly more "interested" interactions than "not interested" interactions.

##### **Average number of interactions per user, in total and per set**

The interactions can also be averaged per user or per item, and further split into a more useful statistic - the average number of interactions per user/item for each set. This lets us know if each set has enough data per user/item in it. We also observe how many tracks the user is interested in within all interactions.

In [7]:
user_interaction_counts = df.groupby('userId').size() # per userId, count the interactions
avg_user_counts = user_interaction_counts.mean() # average interaction counts over all users

print(f"On average, a user has {round(avg_user_counts, 1)} item interactions in total with the following interest distribution:")

user_interaction_counts_interest_scores = df.groupby(['userId', 'y']).size().reset_index(name='n_interactions')
user_interaction_counts_interest_scores.groupby('y')['n_interactions'].mean().round(1).reset_index(name='n_interactions')

On average, a user has 437.2 item interactions in total with the following interest distribution:


Unnamed: 0,y,n_interactions
0,0,203.4
1,1,233.7


On average, users have around 437 interactions and actually listen to (= are interested in) roughly half of the tracks of all interactions.

In [8]:
user_interaction_counts_per_set = df.groupby(['set', 'userId', 'y']).size().reset_index(name='n_interactions')

user_interaction_counts_per_set = user_interaction_counts_per_set.groupby(['set', 'y'])['n_interactions'].mean().round(1).reset_index(name='n_interactions')

print("On average, a user has the following numbers of interactions per set:")
user_interaction_counts_per_set

On average, a user has the following numbers of interactions per set:


Unnamed: 0,set,y,n_interactions
0,test,0,10.1
1,test,1,12.0
2,train,0,183.3
3,train,1,209.7
4,val,0,10.1
5,val,1,11.9


##### **Average number of interactions per item, in total and per set**

In [9]:
item_interaction_counts = df['itemId'].value_counts() # per track, count the interactions
avg_item_counts = item_interaction_counts.mean() # average interaction counts over all items

print(f"On average, an item is interacted with {round(avg_item_counts, 1)} times, split into the following interest distribution:")

item_interaction_counts_interest_scores = df.groupby(['itemId', 'y']).size().reset_index(name='n_interactions')
item_interaction_counts_interest_scores.groupby('y')['n_interactions'].mean().round(1).reset_index(name='n_interactions')

On average, an item is interacted with 1801.8 times, split into the following interest distribution:


Unnamed: 0,y,n_interactions
0,0,838.6
1,1,963.3


In [53]:
item_interaction_counts_per_set = df.groupby(['set', 'itemId', 'y']).size()

item_interaction_counts_per_set = item_interaction_counts_per_set.groupby(['set', 'y']).mean().round(1).reset_index(name='n_interactions')

print("On average, an item is interacted with the following number of times per set:")
item_interaction_counts_per_set

On average, an item is interacted with the following number of times per set:


Unnamed: 0,set,y,n_interactions
0,test,0,44.7
1,test,1,53.3
2,train,0,755.5
3,train,1,864.4
4,val,0,45.0
5,val,1,53.4


Here we also observe that a slight majority of consumed tracks are listened to >80% of their duration.

##### **Average count of repetitions across all users, in total and per set**

Since our work focuses on a user's repetitive behavior, another statistic that we observe is the number of repetitions 

In [68]:
user_item_same_pairs_counts = df.groupby(['userId', 'itemId', 'y']).size().reset_index(name='n_interactions') # counts the number of rows per same user-item pair

average_repetitions = user_item_same_pairs_counts.groupby('y')['n_interactions'].mean().round(1).reset_index(name='n_interactions') # averages them

print(f"On average, a user repeats one particular track with the following distribution:")
average_repetitions

On average, a user repeats one particular track with the following distribution:


Unnamed: 0,y,n_interactions
0,0,5.1
1,1,5.8


In [12]:
user_item_same_pairs_counts_per_set = df.groupby(['set', 'userId', 'itemId']).size() # counts the number of same user-item pairs per set

average_repetitions_per_set = user_item_same_pairs_counts_per_set.groupby(level='set').mean().round(1) # averages them for each set

print("Per set, on average, a user repeats one particular track the following number of times: \n", average_repetitions_per_set)

Per set, on average, a user repeats one particular track the following number of times: 
 set
test     11.1
train    10.8
val      11.0
dtype: float64


#### 1.2.2 Session-based measures

In [13]:
# define cutoff value for sessionizing (here, 30 mins (1800s) is taken)
THRESHOLD = 1800
def detect_sessions(unique_user_interactions, threshold=THRESHOLD):
    # sort interactions chronologically (should be given by dataset anyway, this is just a precaution)
    unique_user_interactions = unique_user_interactions.sort_values('timestamp')

    # calc time differences between consecutive timestamps timestamp_j - timestamp_i
    time_diff = unique_user_interactions['timestamp'].diff()

    # sums up separate sessions - splits to next group when a new session starts (time difference > 30 mins), otherwise it stays the same - this can be used as IDs/index to mark which rows in sorted unique_user_interactions belong to which session
    sessions = (time_diff > threshold).cumsum()
    unique_user_interactions['session_id'] = sessions
    return unique_user_interactions

In [14]:
# assign session id's based on unique user's interactions and threshold
sessionized_df_overall = df.groupby(['userId']).apply(detect_sessions).reset_index(drop=True)

# for each set, group each user (unique users in training set, unique users in val set, unique users in test set) and detect their sessions
sessionized_df = df.groupby(['set', 'userId']).apply(detect_sessions).reset_index(drop=True)

In [15]:
sessionized_df

Unnamed: 0,userId,itemId,timestamp,y,relational_interval,set,session_id
0,0,14,1296206,1,[0.0],test,0
1,0,298,1326351,1,[0.0],test,1
2,0,14,1449841,1,"[42.67638888888889, 0.0]",test,2
3,0,14,1450319,0,"[42.80916666666667, 0.13277777777777777]",test,2
4,0,14,2895492,1,"[444.24611111111113, 401.5697222222222, 0.0]",test,3
...,...,...,...,...,...,...,...
1583810,3622,559,9968460,1,"[597.9216666666666, 593.5372222222222, 572.214...",val,18
1583811,3622,559,10140227,1,"[645.6347222222222, 641.2502777777778, 619.927...",val,19
1583812,3622,584,10144708,1,"[641.2966666666666, 595.9677777777778, 595.891...",val,20
1583813,3622,559,10391098,1,"[715.3211111111111, 710.9366666666666, 689.614...",val,21


In [16]:
# Filter the sessionized DataFrame for user with userId 2
user_sessions = sessionized_df_overall[sessionized_df_overall['userId'] == 3]

# Display all sessions for this user
print(user_sessions)

      userId  itemId  timestamp  y                       relational_interval  \
2411       3       3       8076  0                                        []   
2412       3       6     196538  0                                        []   
2413       3      49     242578  1                                     [0.0]   
2414       3      30     301625  0                                        []   
2415       3      88     317156  0                                        []   
...      ...     ...        ... ..                                       ...   
3676       3      61    5152873  0                      [1011.2397222222222]   
3677       3      61    5152952  0                      [1011.2616666666667]   
3678       3     298    5153050  0                      [1011.3916666666667]   
3679       3     278    5153053  0  [458.37694444444446, 18.959722222222222]   
3680       3     298    5153134  0                                [1011.415]   

        set  session_id  
2411  train  

##### **Totel number of sessions in the dataset**

In [17]:
# group number of sessions per user and sum them up to get total amount of sessions - this way, duplicate session IDs across users are permitted
n_sessions_total = len(sessionized_df_overall.groupby(['userId', 'session_id']).size())

print(f"In total, the dataset consists of {n_sessions_total} sessions.")

In total, the dataset consists of 311509 sessions.


##### **Average number of sessions per user, in total and per set**

In [18]:
# count id's across users and calculate average of that count across all users
session_counts_user_overall = sessionized_df_overall.groupby('userId')['session_id'].nunique().mean().round(1)

print(f"On average, a user has {session_counts_user_overall} sessions.")

On average, a user has 86.0 sessions.


In [19]:
# count number of sessions per user per set
session_counts = sessionized_df.groupby(['set', 'userId'])['session_id'].nunique()

# averaging
average_session_counts_per_set = session_counts.groupby(level='set').mean().round(1)

print("Per set, a user has on average the following number of sessions: \n", average_session_counts_per_set)

Per set, a user has on average the following number of sessions: 
 set
test     16.0
train    82.6
val      15.9
Name: session_id, dtype: float64


In [20]:
# create global session id to account for the fact that while session id's are unique per user, they are not unique globally, each global session id exactly describes one user in a specific session and is not the same across multiple users
sessionized_df_overall['global_session_id'] = sessionized_df_overall['userId'].astype(str) + "_" + sessionized_df_overall['session_id'].astype(str)

# group by itemId and count the sessions for each item
item_sessions_count = sessionized_df_overall.groupby('itemId')['global_session_id'].nunique()

# average sessions per item
avg_sessions_per_item = item_sessions_count.mean().round(1)

print(f"On average, each item appears in {avg_sessions_per_item} sessions.")


On average, each item appears in 1341.8 sessions.


In [21]:
# each global_set_session_id describes one user in a specific session in a specific set
sessionized_df_overall['global_set_session_id'] = sessionized_df_overall['set'].astype(str) + "_" + sessionized_df_overall['userId'].astype(str) + sessionized_df_overall['session_id'].astype(str)

# group by itemid, then by set, and count sessions
item_session_count_per_set = sessionized_df_overall.groupby(['itemId', 'set'])['global_set_session_id'].nunique()

avg_item_session_count_per_set = item_session_count_per_set.groupby(level='set').mean().round(1)

print("On average, for each set, each item appears in the following number of sessions: \n", avg_item_session_count_per_set)

On average, for each set, each item appears in the following number of sessions: 
 set
test       74.1
train    1203.0
val        74.4
Name: global_set_session_id, dtype: float64


In [22]:
""" file_path_testdata = './data/testdata.csv' # adjust as needed
df_testdata = pd.read_csv(file_path_testdata)

#sessionized
sessionized_testdata = df_testdata.groupby(['userId']).apply(detect_sessions).reset_index(drop=True)

#globally sessionized
sessionized_testdata['global_session_id'] = sessionized_testdata['userId'].astype(str) + "_" + sessionized_testdata['session_id'].astype(str)

sessionized_testdata """

' file_path_testdata = \'./data/testdata.csv\' # adjust as needed\ndf_testdata = pd.read_csv(file_path_testdata)\n\n#sessionized\nsessionized_testdata = df_testdata.groupby([\'userId\']).apply(detect_sessions).reset_index(drop=True)\n\n#globally sessionized\nsessionized_testdata[\'global_session_id\'] = sessionized_testdata[\'userId\'].astype(str) + "_" + sessionized_testdata[\'session_id\'].astype(str)\n\nsessionized_testdata '

In [23]:
# INTRA rep avg
reps_per_item_per_session_intra = sessionized_df_overall.groupby(['userId', 'itemId', 'global_session_id']).size().reset_index(name='n_reps')
reps_per_item_per_session_intra

Unnamed: 0,userId,itemId,global_session_id,n_reps
0,0,0,0_0,1
1,0,0,0_117,3
2,0,0,0_122,1
3,0,0,0_25,1
4,0,0,0_35,1
...,...,...,...,...
1179425,3622,592,3622_22,1
1179426,3622,592,3622_23,1
1179427,3622,592,3622_4,1
1179428,3622,592,3622_7,2


In [24]:
reps_per_session = reps_per_item_per_session_intra.groupby('global_session_id')['n_reps'].sum().reset_index(name='n_reps')
reps_per_session

Unnamed: 0,global_session_id,n_reps
0,0_0,1
1,0_1,1
2,0_10,1
3,0_100,29
4,0_101,60
...,...,...
311504,9_59,2
311505,9_6,8
311506,9_7,5
311507,9_8,32


In [25]:
intra_session_rep_rate = reps_per_session['n_reps'].mean()
intra_session_rep_rate

5.08433143183664

In [26]:
# INTRA rate per set
reps_per_item_per_session_intra_per_set = sessionized_df_overall.groupby(['set', 'userId', 'itemId', 'global_session_id']).size().reset_index(name='n_reps')
reps_per_item_per_session_intra_per_set

Unnamed: 0,set,userId,itemId,global_session_id,n_reps
0,test,0,14,0_115,1
1,test,0,14,0_123,3
2,test,0,14,0_26,1
3,test,0,14,0_35,2
4,test,0,14,0_65,1
...,...,...,...,...,...
1179425,val,3622,584,3622_23,1
1179426,val,3622,584,3622_4,3
1179427,val,3622,584,3622_6,1
1179428,val,3622,584,3622_7,2


In [27]:
reps_per_session_per_set = reps_per_item_per_session_intra_per_set.groupby(['set', 'global_session_id'])['n_reps'].sum().reset_index(name='n_reps')
reps_per_session_per_set

Unnamed: 0,set,global_session_id,n_reps
0,test,0_101,1
1,test,0_115,1
2,test,0_123,3
3,test,0_131,1
4,test,0_26,1
...,...,...,...
405466,val,9_13,2
405467,val,9_28,3
405468,val,9_30,3
405469,val,9_33,2


In [28]:
intra_session_rep_rate_per_set = reps_per_session_per_set.groupby('set')['n_reps'].mean().reset_index(name='mean_reps')
intra_session_rep_rate_per_set

Unnamed: 0,set,mean_reps
0,test,1.474155
1,train,4.790931
2,val,1.479704


In [29]:
# INTER rep avg
reps_per_item_per_session_inter = sessionized_df_overall.groupby(['userId', 'itemId', 'global_session_id']).size().reset_index(name='reps')
reps_per_item_per_session_inter

Unnamed: 0,userId,itemId,global_session_id,reps
0,0,0,0_0,1
1,0,0,0_117,3
2,0,0,0_122,1
3,0,0,0_25,1
4,0,0,0_35,1
...,...,...,...,...
1179425,3622,592,3622_22,1
1179426,3622,592,3622_23,1
1179427,3622,592,3622_4,1
1179428,3622,592,3622_7,2


In [30]:
reps_per_user_across_sessions = reps_per_item_per_session_inter.groupby('userId')['reps'].sum().reset_index(name='reps')
reps_per_user_across_sessions

Unnamed: 0,userId,reps
0,0,1629
1,1,209
2,2,573
3,3,1270
4,4,788
...,...,...
3618,3618,537
3619,3619,548
3620,3620,457
3621,3621,602


In [31]:
inter_session_rep_rate = reps_per_user_across_sessions['reps'].mean().round(2)
inter_session_rep_rate

437.16

In [32]:
#INTER rate per set
reps_per_item_per_session_inter_per_set = sessionized_df_overall.groupby(['set', 'userId', 'itemId', 'global_session_id']).size().reset_index(name='reps')
reps_per_item_per_session_inter_per_set

Unnamed: 0,set,userId,itemId,global_session_id,reps
0,test,0,14,0_115,1
1,test,0,14,0_123,3
2,test,0,14,0_26,1
3,test,0,14,0_35,2
4,test,0,14,0_65,1
...,...,...,...,...,...
1179425,val,3622,584,3622_23,1
1179426,val,3622,584,3622_4,3
1179427,val,3622,584,3622_6,1
1179428,val,3622,584,3622_7,2


In [33]:
reps_per_user_across_sessions_per_set = reps_per_item_per_session_inter_per_set.groupby(['set', 'userId'])['reps'].sum().reset_index(name='reps')
reps_per_user_across_sessions_per_set

Unnamed: 0,set,userId,reps
0,test,0,14
1,test,1,13
2,test,2,22
3,test,3,21
4,test,4,14
...,...,...,...
10864,val,3618,14
10865,val,3619,41
10866,val,3620,27
10867,val,3621,15


In [34]:
inter_session_rep_rate_per_set = reps_per_user_across_sessions_per_set.groupby('set')['reps'].mean().round(2)
inter_session_rep_rate_per_set

set
test      22.13
train    393.02
val       22.00
Name: reps, dtype: float64

In [35]:
#avg length per SINGLE user session 

In [36]:
#avg number of interactions per SINGLE user session

•	Total count of user-item interactions per set -> DONE
•	Average count of user-item interactions per user -> DONE
•	Average count of user-item interactions per user per set -> DONE
•	Average count of user-item interactions per item -> DONE
•	Average count of user-item interactions per item per set -> DONE
•	How many repetitions in the whole dataset
•	How many repetitions on average per user -> DONE
Session-based measures
•	Total number of sessions
•	Average amount of sessions per user
•	Average amount of sessions per user, per set
•	Average amount of sessions an item is in
•	Average amount of sessions an item is in, per session
•	How many repetitions per user-session -> INTRA SESSION
•	How many repetitions per user across sessions -> INTER SESSION
•	Average length of user-session (timestamp)
•	Average number of interactions within one user session

## References

[a] Session-aware recommendation paper

[b] Ex2Vec paper