# Practical Work in AI

In [1]:
# installs necessary libraries

#!pip install pandas

In [2]:
# necessary imports for notebook to run
import pandas as pd

## 1. Dataset analysis

### 1.1 Structure of processed.csv

As a first step, certain statistics have to be observed on the given dataset in order to see if it is feasible to be sessionized and further used for training sequence-aware recommender systems models [a], such as GRU4Rec. 

For the analysis, we are working with the "processed.csv" dataset, which contains user-item interactions and represents a pre-processed version of the "new_release_stream.csv" file. A single row in this file looks like the following:

`28,60,188807,1,[0.0],train`

Here, the user with ID 28 listened to the track with ID 60 at timestamp 188807 (measured in seconds from first consumption). The track was listened to over 80% of its length (y = 1). The time interval (measured in hours) between when user 28 interacted with item 60 this time, versus the previous times, is 0.0, meaning it is the first interaction from user 28 to track 60. Based on the train-test-val split done during pre-processing (see `preprocess.py`), this interaction is part of the training set. 

Repetitive behavior is measured via the relational interval column (here [0.0]). Hence, the following lines from the dataset:

`28,60,188807,1,[0.0],train`

`28,60,188977,1,"[0.04722222222222222, 0.0]",train`

`28,60,189155,0,"[0.09666666666666666, 0.049444444444444444]",train`

have the following meaning with regards to the repetitions:

`28,60,188807,1,[0.0],train` = User 28 consumes item 60 for the first time, thus no previous consumptions logged in the relational interval

`28,60,188977,1,"[0.04722222222222222, 0.0]",train` = User 28 consumed item 60 for the first time 188977-188807 = 170s / 60 / 60 = 0.047222hrs ago

`28,60,189155,0,"[0.09666666666666666, 0.049444444444444444]",train` = User 28 consumed item 60 since the last time 189155-188977 = 178s / 60 / 60 = 0.049444hrs ago and for the first time 0.049444444444444444 + 0.04722222222222222 = 0.09666666666666666 hrs ago

### 1.2 Statistical Analysis

The analysis of the dataset is separated into two parts:

- **Global measures**: e.g. average count of user-item interactions per set, average count of user-item interactions per item, etc.
- **Session-based measures**: e.g. average number of sessions per user, average number of interactions per session, etc.

We start by reading in the data:

In [49]:
file_path = './data/processed.csv' # adjust as needed
df = pd.read_csv(file_path)

#### 1.2.1 Global measures

Firstly, we observe the number of interactions, unique users and unique items in the dataset:

In [50]:
print("Number of user-item interactions in total: ", len(df))
print("Number of unique users: ", df.userId.nunique())
print("Number of unique songs: ", df.itemId.nunique())

Number of user-item interactions in total:  1583815
Number of unique users:  3623
Number of unique songs:  879


These numbers are consistent with the reportings from [b], after the authors implemented a $k^{item}$ and a $k^{user}$ pre-processing step, where each user has to have interacted with at least $k^{item}$ items and every item has to have been consumed by $k^{user}$ users. They decided on $k^{item} = k^{user} = 30$.

##### **Number of user-item interactions per set**

In [29]:
# count number of rows per set, aka count size of training, test and validation dataset
set_counts = df['set'].value_counts()

# create df outof series and rename columns accordingly
set_counts = set_counts.reset_index()
set_counts.columns = ['set', 'count']

print("Number of user-item interactions for training-, test-, and validation set:")
set_counts


Number of user-item interactions for training-, test-, and validation set:


Unnamed: 0,set,count
0,train,1423927
1,test,80166
2,val,79722


As we can see, after splitting the data, we have numbers of interactions of train = ~1.4m, test = ~80.1k and validation = ~79.7k. 

##### **Average number of interactions, per user, and per item, in total and per set**

In [45]:
user_interaction_counts = df['userId'].value_counts() # per userId, count the interactions
avg_user_counts = user_interaction_counts.mean() # average interaction counts over all users

print(f"On average, a user has {int(avg_user_counts)} item interactions in total.")

On average, a user has 437 interactions in total.


In [55]:
user_interaction_counts_per_set = df.groupby(['set', 'userId']).size()

user_interaction_counts_per_set = user_interaction_counts_per_set.groupby(level='set').mean()

print("Per set, a user has on average the following numbers of interactions: \n", user_interaction_counts_per_set)

Per set, a user has on average the following numbers of interactions: 
 set
test      22.126967
train    393.024289
val       22.004416
dtype: float64


In [59]:
item_interaction_counts = df['itemId'].value_counts() # per track, count the interactions
avg_item_counts = item_interaction_counts.mean() # average interaction counts over all items

print(f"On average, an item is interacted with {int(avg_item_counts)} times.")

On average, an item is interacted with 1801 times.


In [60]:
item_interaction_counts_per_set = df.groupby(['set', 'itemId']).size()

item_interaction_counts_per_set = item_interaction_counts_per_set.groupby(level='set').mean()

print("Per set, on average, an item is interacted with the following number of times: \n", item_interaction_counts_per_set)

Per set, on average, an item is interacted with the following number of times: 
 set
test       98.002445
train    1619.939704
val        98.422222
dtype: float64


#### 1.2.2 Session-based measures

In [2]:
# define cutoff value for sessionizing (here, 30 mins (1800s) is taken)
def detect_sessions(group, threshold=1800):
    # Sort values by timestamp to ensure correct session boundary detection
    group = group.sort_values('timestamp')
    # calc difference between consecutive timestamps timestamp_j - timestamp_i
    time_diff = group['timestamp'].diff()
    # sums up separate sessions - increases when a new session starts, otherwise it stays the same - this can be used as IDs which rows belong to which session
    sessions = (time_diff > threshold).cumsum()
    group['session_id'] = sessions
    return group

# for each set, group each user (unique users in training set, unique users in val set, unique users in test set) and detect their sessions
sessionized_df = df.groupby(['set', 'userId']).apply(detect_sessions).reset_index(drop=True)



# Now group by set, user, and session_id to get number of sessions per set
session_counts = sessionized_df.groupby(['set', 'userId', 'session_id']).size()



print("Number of interactions per user session:")
print(session_counts)

Number of interactions per user session:
set   userId  session_id
test  0       0             1
              1             1
              2             2
              3             1
              4             1
                           ..
val   3622    18            1
              19            1
              20            1
              21            1
              22            1
Length: 414420, dtype: int64


In [5]:
sessionized_df.nunique()

userId                    3623
itemId                     879
timestamp              1376164
y                            2
relational_interval    1202141
set                          3
session_id                 426
dtype: int64

In [3]:
# only consider sessions with at least 'interaction_threshold' number of interactions
interaction_threshold = 20
filtered_sessions = session_counts[session_counts >= interaction_threshold]

print("Filtered sessions:")
print(filtered_sessions)

# check number of reps, how many items are repeated in the same 
# how many sessions of user on average and length of sessions
# avg interactions per session + per item + per user


Filtered sessions:
set    userId  session_id
test   341     1             20
       2078    7             27
       2796    4             33
       2988    0             22
train  0       24            31
                             ..
       3622    21            32
               22            30
val    1343    4             21
       2146    2             31
       2758    3             23
Length: 9753, dtype: int64


In [11]:
filtered_sessions['train']
# avg number of sessions per user

#min items per user 5, min sessions per user 5, maybe reduce it to 3

#accumulative approach - weighted average, mean, etc. weighted sum

# partial sequences 
# combination of embeddings

#put into tables

#compare latent spaces

userId  session_id
0       24            31
        25            20
        26            29
        27            31
        28            20
                      ..
3622    1             29
        5             23
        9             22
        21            32
        22            30
Length: 9746, dtype: int64

## References

[a] Session-aware recommendation paper

[b] Ex2Vec paper