# Practical Work in AI

In [2]:
# installs necessary libraries

#!pip install pandas

In [3]:
# necessary imports for notebook to run
import pandas as pd

## 1. Dataset analysis

### 1.1 Structure of processed.csv

As a first step, certain statistics have to be observed on the given dataset in order to see if it is feasible to be sessionized and further used for training sequence-aware recommender systems models [a], such as GRU4Rec. 

For the analysis, we are working with the "processed.csv" dataset, which contains user-item interactions and represents a pre-processed version of the "new_release_stream.csv" file. A single row in this file looks like the following:

`28,60,188807,1,[0.0],train`

Here, the user with ID 28 listened to the track with ID 60 at timestamp 188807 (measured in seconds from first consumption). The track was listened to over 80% of its length (y = 1). The time interval (measured in hours) between when user 28 interacted with item 60 this time, versus the previous times, is 0.0, meaning it is the first interaction from user 28 to track 60. Based on the train-test-val split done during pre-processing (see `preprocess.py`), this interaction is part of the training set. 

Repetitive behavior is measured via the relational interval column (here [0.0]). Hence, the following lines from the dataset:

`28,60,188807,1,[0.0],train`

`28,60,188977,1,"[0.04722222222222222, 0.0]",train`

`28,60,189155,0,"[0.09666666666666666, 0.049444444444444444]",train`

have the following meaning with regards to the repetitions:

`28,60,188807,1,[0.0],train` = User 28 consumes item 60 for the first time, thus no previous consumptions logged in the relational interval

`28,60,188977,1,"[0.04722222222222222, 0.0]",train` = User 28 consumed item 60 for the first time 188977-188807 = 170s / 60 / 60 = 0.047222hrs ago

`28,60,189155,0,"[0.09666666666666666, 0.049444444444444444]",train` = User 28 consumed item 60 since the last time 189155-188977 = 178s / 60 / 60 = 0.049444hrs ago and for the first time 0.049444444444444444 + 0.04722222222222222 = 0.09666666666666666 hrs ago

### 1.2 Statistical Analysis

The analysis of the dataset is separated into two parts:

- **Global measures**: e.g. average count of user-item interactions per set, average count of user-item interactions per item, etc.
- **Session-based measures**: e.g. average number of sessions per user, average number of interactions per session, etc.

We start by reading in the data:

In [4]:
file_path = './data/processed.csv' # adjust as needed
df = pd.read_csv(file_path)

#### 1.2.1 Global measures

Firstly, we observe the number of interactions, unique users and unique items in the dataset:

In [5]:
print("Number of user-item interactions in total: ", len(df))
print("Number of unique users: ", df.userId.nunique())
print("Number of unique songs: ", df.itemId.nunique())

Number of user-item interactions in total:  1583815
Number of unique users:  3623
Number of unique songs:  879


These numbers are consistent with the reportings from [b], after the authors implemented a $k^{item}$ and a $k^{user}$ pre-processing step, where each user has to have interacted with at least $k^{item}$ items and every item has to have been consumed by $k^{user}$ users. They decided on $k^{item} = k^{user} = 30$.

##### **Number of user-item interactions per set**

In [6]:
# count number of rows per set, aka count size of training, test and validation dataset
set_counts = df['set'].value_counts()

# create df outof series and rename columns accordingly
set_counts = set_counts.reset_index()
set_counts.columns = ['set', 'count']

print("Number of user-item interactions for training-, test-, and validation set:")
set_counts


Number of user-item interactions for training-, test-, and validation set:


Unnamed: 0,set,count
0,train,1423927
1,test,80166
2,val,79722


As we can see, after splitting the data, we have numbers of interactions of train = ~1.4m, test = ~80.1k and validation = ~79.7k. 

##### **Average number of interactions per user, in total and per set**

In [7]:
user_interaction_counts = df['userId'].value_counts() # per userId, count the interactions
avg_user_counts = user_interaction_counts.mean() # average interaction counts over all users

print(f"On average, a user has {round(avg_user_counts, 1)} item interactions in total.")

On average, a user has 437.2 item interactions in total.


In [8]:
user_interaction_counts_per_set = df.groupby(['set', 'userId']).size()

user_interaction_counts_per_set = user_interaction_counts_per_set.groupby(level='set').mean().round(1)

print("Per set, a user has on average the following numbers of interactions: \n", user_interaction_counts_per_set)

Per set, a user has on average the following numbers of interactions: 
 set
test      22.1
train    393.0
val       22.0
dtype: float64


##### **Average number of interactions per item, in total and per set**

In [9]:
item_interaction_counts = df['itemId'].value_counts() # per track, count the interactions
avg_item_counts = item_interaction_counts.mean() # average interaction counts over all items

print(f"On average, an item is interacted with {round(avg_item_counts, 1)} times.")

On average, an item is interacted with 1801.8 times.


In [10]:
item_interaction_counts_per_set = df.groupby(['set', 'itemId']).size()

item_interaction_counts_per_set = item_interaction_counts_per_set.groupby(level='set').mean().round(1)

print("Per set, on average, an item is interacted with the following number of times: \n", item_interaction_counts_per_set)

Per set, on average, an item is interacted with the following number of times: 
 set
test       98.0
train    1619.9
val        98.4
dtype: float64


##### **Average count of repetitions across all users, in total and per set**

In [11]:
user_item_same_pairs_counts = df.groupby(['userId', 'itemId']).size() # counts the number of rows per same user-item pair

average_repetitions = user_item_same_pairs_counts.mean() # averages them

print(f"On average, a user repeats a particular track {round(average_repetitions, 1)} times.")

On average, a user repeats a particular track 10.9 times.


In [12]:
user_item_same_pairs_counts_per_set = df.groupby(['set', 'userId', 'itemId']).size() # counts the number of same user-item pairs per set

average_repetitions_per_set = user_item_same_pairs_counts_per_set.groupby(level='set').mean().round(1) # averages them for each set

print("Per set, on average, a user repeats a particular item the following number of times: \n", average_repetitions_per_set)

Per set, on average, a user repeats a particular item the following number of times: 
 set
test     11.1
train    10.8
val      11.0
dtype: float64


#### 1.2.2 Session-based measures

In [13]:
# define cutoff value for sessionizing (here, 30 mins (1800s) is taken)
THRESHOLD = 1800
def detect_sessions(unique_user_interactions, threshold=THRESHOLD):
    # sort interactions chronologically (should be given by dataset anyway, this is just a precaution)
    unique_user_interactions = unique_user_interactions.sort_values('timestamp')

    # calc time differences between consecutive timestamps timestamp_j - timestamp_i
    time_diff = unique_user_interactions['timestamp'].diff()

    # sums up separate sessions - splits to next group when a new session starts (time difference > 30 mins), otherwise it stays the same - this can be used as IDs/index to mark which rows in sorted unique_user_interactions belong to which session
    sessions = (time_diff > threshold).cumsum()
    unique_user_interactions['session_id'] = sessions
    return unique_user_interactions

In [14]:
# assign session id's based on unique user's interactions and threshold
sessionized_df_overall = df.groupby(['userId']).apply(detect_sessions).reset_index(drop=True)

# for each set, group each user (unique users in training set, unique users in val set, unique users in test set) and detect their sessions
sessionized_df = df.groupby(['set', 'userId']).apply(detect_sessions).reset_index(drop=True)

In [15]:
sessionized_df

Unnamed: 0,userId,itemId,timestamp,y,relational_interval,set,session_id
0,0,14,1296206,1,[0.0],test,0
1,0,298,1326351,1,[0.0],test,1
2,0,14,1449841,1,"[42.67638888888889, 0.0]",test,2
3,0,14,1450319,0,"[42.80916666666667, 0.13277777777777777]",test,2
4,0,14,2895492,1,"[444.24611111111113, 401.5697222222222, 0.0]",test,3
...,...,...,...,...,...,...,...
1583810,3622,559,9968460,1,"[597.9216666666666, 593.5372222222222, 572.214...",val,18
1583811,3622,559,10140227,1,"[645.6347222222222, 641.2502777777778, 619.927...",val,19
1583812,3622,584,10144708,1,"[641.2966666666666, 595.9677777777778, 595.891...",val,20
1583813,3622,559,10391098,1,"[715.3211111111111, 710.9366666666666, 689.614...",val,21


In [16]:
# Filter the sessionized DataFrame for user with userId 2
user_sessions = sessionized_df_overall[sessionized_df_overall['userId'] == 3]

# Display all sessions for this user
print(user_sessions)

      userId  itemId  timestamp  y                       relational_interval  \
2411       3       3       8076  0                                        []   
2412       3       6     196538  0                                        []   
2413       3      49     242578  1                                     [0.0]   
2414       3      30     301625  0                                        []   
2415       3      88     317156  0                                        []   
...      ...     ...        ... ..                                       ...   
3676       3      61    5152873  0                      [1011.2397222222222]   
3677       3      61    5152952  0                      [1011.2616666666667]   
3678       3     298    5153050  0                      [1011.3916666666667]   
3679       3     278    5153053  0  [458.37694444444446, 18.959722222222222]   
3680       3     298    5153134  0                                [1011.415]   

        set  session_id  
2411  train  

##### **Totel number of sessions in the dataset**

In [17]:
# group number of sessions per user and sum them up to get total amount of sessions - this way, duplicate session IDs across users are permitted
n_sessions_total = len(sessionized_df_overall.groupby(['userId', 'session_id']).size())

print(f"In total, the dataset consists of {n_sessions_total} sessions.")

In total, the dataset consists of 311509 sessions.


##### **Average number of sessions per user, in total and per set**

In [18]:
# count id's across users and calculate average of that count across all users
session_counts_user_overall = sessionized_df_overall.groupby('userId')['session_id'].nunique().mean().round(1)

print(f"On average, a user has {session_counts_user_overall} sessions.")

On average, a user has 86.0 sessions.


In [19]:
# count number of sessions per user per set
session_counts = sessionized_df.groupby(['set', 'userId'])['session_id'].nunique()

# averaging
average_session_counts_per_set = session_counts.groupby(level='set').mean().round(1)

print("Per set, a user has on average the following number of sessions: \n", average_session_counts_per_set)

Per set, a user has on average the following number of sessions: 
 set
test     16.0
train    82.6
val      15.9
Name: session_id, dtype: float64


In [20]:
# create global session id to account for the fact that while session id's are unique per user, they are not unique globally, each global session id exactly describes one user in a specific session and is not the same across multiple users
sessionized_df_overall['global_session_id'] = sessionized_df_overall['userId'].astype(str) + "_" + sessionized_df_overall['session_id'].astype(str)

# group by itemId and count the sessions for each item
item_sessions_count = sessionized_df_overall.groupby('itemId')['global_session_id'].nunique()

# average sessions per item
avg_sessions_per_item = item_sessions_count.mean().round(1)

print(f"On average, each item appears in {avg_sessions_per_item} sessions.")


On average, each item appears in 1341.8 sessions.


In [21]:
# each global_set_session_id describes one user in a specific session in a specific set
sessionized_df_overall['global_set_session_id'] = sessionized_df_overall['set'].astype(str) + "_" + sessionized_df_overall['userId'].astype(str) + sessionized_df_overall['session_id'].astype(str)

# group by itemid, then by set, and count sessions
item_session_count_per_set = sessionized_df_overall.groupby(['itemId', 'set'])['global_set_session_id'].nunique()

avg_item_session_count_per_set = item_session_count_per_set.groupby(level='set').mean().round(1)

print("On average, for each set, each item appears in the following number of sessions: \n", avg_item_session_count_per_set)

On average, for each set, each item appears in the following number of sessions: 
 set
test       74.1
train    1203.0
val        74.4
Name: global_set_session_id, dtype: float64


In [24]:
sessionized_df_overall

Unnamed: 0,userId,itemId,timestamp,y,relational_interval,set,session_id,global_session_id,global_set_session_id
0,0,0,0,0,[],train,0,0_0,train_00
1,0,7,15690,0,[],train,1,0_1,train_01
2,0,15,38426,0,[],train,2,0_2,train_02
3,0,5,45670,1,[0.0],train,3,0_3,train_03
4,0,20,77618,0,[],train,4,0_4,train_04
...,...,...,...,...,...,...,...,...,...
1583810,3622,579,10398504,1,"[711.3533333333334, 665.9480555555556, 644.134...",train,23,3622_23,train_362223
1583811,3622,590,10398803,0,"[711.3533333333334, 665.9480555555556, 644.133...",train,23,3622_23,train_362223
1583812,3622,591,10398827,1,"[711.2280555555556, 665.8227777777778, 644.008...",train,23,3622_23,train_362223
1583813,3622,592,10398988,1,"[711.2280555555556, 665.8227777777778, 644.008...",train,23,3622_23,train_362223


In [30]:
# count repetitions of each item per session
intra_session_reps = sessionized_df_overall.groupby(['global_session_id', 'itemId']).size().reset_index(name='n_reps')

# average rep counts within a particular session across all items
avg_intra_session_reps_per_item = intra_session_reps.groupby('global_session_id')['n_reps'].mean()

# average across all sessions
avg_intra_session_reps = avg_intra_session_reps_per_item.mean().round(1)

print(f"On average, an item is repeated {avg_intra_session_reps} times within one session.")

On average, an item is repeated 1.3 times within one session.


In [44]:
sessionized_df_overall


Unnamed: 0,userId,itemId,timestamp,y,relational_interval,set,session_id,global_session_id,global_set_session_id
0,0,0,0,0,[],train,0,0_0,train_00
1,0,7,15690,0,[],train,1,0_1,train_01
2,0,15,38426,0,[],train,2,0_2,train_02
3,0,5,45670,1,[0.0],train,3,0_3,train_03
4,0,20,77618,0,[],train,4,0_4,train_04
...,...,...,...,...,...,...,...,...,...
1583810,3622,579,10398504,1,"[711.3533333333334, 665.9480555555556, 644.134...",train,23,3622_23,train_362223
1583811,3622,590,10398803,0,"[711.3533333333334, 665.9480555555556, 644.133...",train,23,3622_23,train_362223
1583812,3622,591,10398827,1,"[711.2280555555556, 665.8227777777778, 644.008...",train,23,3622_23,train_362223
1583813,3622,592,10398988,1,"[711.2280555555556, 665.8227777777778, 644.008...",train,23,3622_23,train_362223


In [None]:
# for each set, group each user (unique users in training set, unique users in val set, unique users in test set) and detect their sessions
sessionized_df = df.groupby(['set', 'userId']).apply(detect_sessions).reset_index(drop=True)

# Now group by set, user, and session_id to get number of sessions per set
session_counts = sessionized_df.groupby(['set', 'userId']).size()



print("Number of interactions per user session:")
print(session_counts)

Number of interactions per user session:
set   userId
test  0         14
      1         13
      2         22
      3         21
      4         14
                ..
val   3618      14
      3619      41
      3620      27
      3621      15
      3622      28
Length: 10869, dtype: int64


In [None]:
# only consider sessions with at least 'interaction_threshold' number of interactions
interaction_threshold = 20
filtered_sessions = session_counts[session_counts >= interaction_threshold]

print("Filtered sessions:")
print(filtered_sessions)

# check number of reps, how many items are repeated in the same 
# how many sessions of user on average and length of sessions
# avg interactions per session + per item + per user


Filtered sessions:
set    userId  session_id
test   341     1             20
       2078    7             27
       2796    4             33
       2988    0             22
train  0       24            31
                             ..
       3622    21            32
               22            30
val    1343    4             21
       2146    2             31
       2758    3             23
Length: 9753, dtype: int64


In [None]:
filtered_sessions['train']
# avg number of sessions per user

#min items per user 5, min sessions per user 5, maybe reduce it to 3

#accumulative approach - weighted average, mean, etc. weighted sum

# partial sequences 
# combination of embeddings

#put into tables

#compare latent spaces

userId  session_id
0       24            31
        25            20
        26            29
        27            31
        28            20
                      ..
3622    1             29
        5             23
        9             22
        21            32
        22            30
Length: 9746, dtype: int64

•	Total count of user-item interactions per set -> DONE
•	Average count of user-item interactions per user -> DONE
•	Average count of user-item interactions per user per set -> DONE
•	Average count of user-item interactions per item -> DONE
•	Average count of user-item interactions per item per set -> DONE
•	How many repetitions in the whole dataset
•	How many repetitions on average per user -> DONE
Session-based measures
•	Total number of sessions
•	Average amount of sessions per user
•	Average amount of sessions per user, per set
•	Average amount of sessions an item is in
•	Average amount of sessions an item is in, per session
•	How many repetitions per user-session -> INTRA SESSION
•	How many repetitions per user across sessions -> INTER SESSION
•	Average length of user-session (timestamp wise)
•	Average number of interactions within one user session

## References

[a] Session-aware recommendation paper

[b] Ex2Vec paper