In [1]:
import pandas as pd

This notebook is used to investigate different train-test split strategies.

In [23]:
# loading transformed data
df = pd.read_parquet('../data/podcast_data_transformed.parquet')
df.head()

Unnamed: 0,user_id,prd_number,series_title,unique_title,platform,device_type,pub_date,episode_duration,genre,branding_channel,mother_channel,category,content_time_spent,date,time,completion_rate
0,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11032421443,Brinkmanns briks,Brinkmanns briks: Vi skal tale om pillerne_110...,web,PC,2024-10-30,3422,Fakta og debat,DR P1,DR P1,Oplysning og kultur,3423,2024-11-01,08:56:00,1.0
1,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11032422442,Hjernekassen på P1,Hjernekassen på P1: Forebyggelse_11032422442,web,PC,2024-10-29,3363,Fakta og debat,DR P1,DR P1,Oplysning og kultur,359,2024-11-01,11:19:00,0.10675
2,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11162405447,Ubegribeligt,Ubegribeligt: Vand_11162405447,web,PC,2024-10-31,3417,Fakta og debat,DR P1,DR P1,Aktualitet og debat,5160,2024-11-01,09:53:00,1.0
3,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11802437044,Stjerner og striber,"Stjerner og striber: Joken, der ikke vil dø_11...",web,PC,2024-11-01,2847,Aktualitet,DR P1,-,Nyheder,2847,2024-11-01,08:40:00,1.0
4,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11802451178,Tiden,"Tiden: Skraldemanden Trump, spansk oversvømmel...",web,PC,2024-11-01,947,Nyheder,DR Lyd,-,Nyheder,610,2024-11-01,09:27:00,0.644139


In [24]:
# grouping by prd_number and counting the number of plays for each episode
prd_grp_df = df.groupby('prd_number')['user_id'].count().sort_values(ascending=True)

# filtering away episodes with less than 5 interactions
filtered_df = df[df['prd_number'].isin(prd_grp_df[prd_grp_df >= 10].index)]

# number of unique users 
n_users = len(set(filtered_df['user_id']))

# number of interactions (rows in the df)
n_interactions = len(filtered_df)

# printing the summary statistics
print(f"Number of unique users: {n_users}")
print(f"Number of interactions: {n_interactions}")

Number of unique users: 142344
Number of interactions: 2951510


#### Temporal User Split

This splitting strategy defines a training and test set of each user according to some splitting percentage, e.g., 80% training data and 20% testing data is being recommended in the literature.  

Investigating the consequences of implementing a 80-20 split.

In [25]:
# grouping by user_id and counting the number of prd_numbers for each user
df_grouped = filtered_df.groupby('user_id')['prd_number'].count().reset_index()
df_grouped.rename(columns={'prd_number': 'prd_count'}, inplace=True)

# number of users with at least 3 prd_numbers
df_grouped = df_grouped[df_grouped['prd_count'] >= 3]

# number of users left
users_set = set(df_grouped['user_id'])
n_users_usplit = len(users_set)

# number of interactions left (rows in the df_grouped)
df_filtered = filtered_df[filtered_df['user_id'].isin(users_set)]
n_interactions_usplit = len(df_filtered)

# printing the results
print(f"Number of users with at least 3 prd_numbers: {n_users_usplit} ({n_users_usplit/n_users:.1%} of users are kept)")
print(f"Number of interactions left: {n_interactions_usplit} ({n_interactions_usplit/n_interactions:.1%} of interactions are kept)")

Number of users with at least 3 prd_numbers: 95395 (67.0% of users are kept)
Number of interactions left: 2887658 (97.8% of interactions are kept)


#### Temporal Global Split
Define a single global timestamp that works as the boundary between train and test data for all users/interactions.

I want to make the cut on the `date` attribute, where the selected date is a Monday. In this way entire weeks are contained in the training and test data. This might be an appropriate decision since podcast listening and publication have weekly patterns. This must be investigated in the EDA and related work section about podcast listening patterns.

In [None]:
# testing size of train - and test data when splitting on different dates
split_dates = ['2024-10-28', '2024-11-04', '2024-11-11', '2024-11-18']

data = {'date': split_dates,
        'train%': [],
        'test%': [],
        'users%': [],
        'interactions%': [],
        }

for split_date in split_dates:
    # splitting the data
    train_df = filtered_df[filtered_df['date'] < split_date]
    test_df = filtered_df[filtered_df['date'] >= split_date]

    # number of unique users both in the train and test data
    common_users = set(train_df['user_id']).intersection(set(test_df['user_id']))
    n_common_users = len(common_users)

    # filter df according to the common users
    train_df_common = train_df[train_df['user_id'].isin(common_users)]
    test_df_common = test_df[test_df['user_id'].isin(common_users)]

    # computing train and test percentages
    train_interactions = len(train_df_common)
    test_interactions = len(test_df_common)
    total_interactions = train_interactions + test_interactions
    train_perc = train_interactions / total_interactions
    test_perc = test_interactions / total_interactions
    data["train%"].append(train_perc)
    data["test%"].append(test_perc)

     # computing the percentage of users
    perc_users = n_common_users / n_users
    data["users%"].append(perc_users)

    # computing the percentage of total interactions
    perc_interactions = total_interactions / n_interactions
    data["interactions%"].append(perc_interactions)

# generating a dataframe from the gathered data
global_split_df = pd.DataFrame(data)
print(global_split_df)

         date    train%     test%    users%  interactions%
0  2024-10-28  0.603096  0.396904  0.459640       0.880902
1  2024-11-04  0.675683  0.324317  0.449313       0.873870
2  2024-11-11  0.747187  0.252813  0.426621       0.858558
3  2024-11-18  0.821092  0.178908  0.386992       0.825525


         date    train%     test%    users%  interactions%
0  2024-10-28  0.603076  0.396924  0.454199       0.871050
1  2024-11-04  0.675668  0.324332  0.443994       0.864097
2  2024-11-11  0.747176  0.252824  0.421571       0.848957
3  2024-11-18  0.821085  0.178915  0.382410       0.816293


November 11, 2024, seems like a reasonable global split date, as the balance between train and test data is close to 75-25 and more users are kept compared to November 18, 2024.

Assessing number of episodes per user in the training set, as I might want to filter away users with less than x episodes for training.

In [35]:
# applying global user split
split_date = '2024-11-11'
train_df = filtered_df[filtered_df['date'] < split_date]
test_df = filtered_df[filtered_df['date'] >= split_date]

# number of unique users both in the train and test data
common_users = set(train_df['user_id']).intersection(set(test_df['user_id']))
n_common_users = len(common_users)

# filter df according to the common users
train_df_common = train_df[train_df['user_id'].isin(common_users)]
test_df_common = test_df[test_df['user_id'].isin(common_users)]

# number of interactions
n_interactions_train = len(train_df_common)
n_interactions_test = len(test_df_common)
n_interactions_org = n_interactions_train + n_interactions_test

# grouping by user_id and counting the number of prd_numbers for each user in the train data
df_grouped_train = train_df_common.groupby('user_id')['prd_number'].count().reset_index()

# providing a name for the count column
df_grouped_train.rename(columns={'prd_number': 'prd_count'}, inplace=True)

# thresholds to test
user_thresholds = [2, 3, 4, 5]

data = {'threshold': user_thresholds,
        'train%': [],
        'test%': [],
        'users_total%': [],
        'users_split%': [],
        'interactions%': [],
        }

for threshold in user_thresholds:
    # filtering the train data according to the threshold
    df_grouped_train_filtered = df_grouped_train[df_grouped_train['prd_count'] >= threshold]
    users_set = set(df_grouped_train_filtered['user_id'])
    
    # filtering the train and test data according to the common users
    train_df_common_filtered = train_df_common[train_df_common['user_id'].isin(users_set)]
    test_df_common_filtered = test_df_common[test_df_common['user_id'].isin(users_set)]

    # computing train and test percentages
    train_interactions = len(train_df_common_filtered)
    test_interactions = len(test_df_common_filtered)
    total_interactions = train_interactions + test_interactions
    train_perc = train_interactions / n_interactions_train
    test_perc = test_interactions / n_interactions_test
    data["train%"].append(train_perc)
    data["test%"].append(test_perc)

    # computing the percentage of users
    perc_users = len(users_set) / n_users
    perc_users_split = len(users_set) / n_common_users
    data["users_total%"].append(perc_users)
    data["users_split%"].append(perc_users_split)

    # computing the percentage of total interactions
    perc_interactions = total_interactions / n_interactions_org
    data["interactions%"].append(perc_interactions)
    
# generating a dataframe from the gathered data
user_split_df = pd.DataFrame(data)
print(user_split_df)

   threshold    train%     test%  users_total%  users_split%  interactions%
0          2  0.998015  0.982172      0.400221      0.938116       0.994010
1          3  0.994801  0.965374      0.378843      0.888007       0.987361
2          4  0.990688  0.950548      0.360605      0.845258       0.980540
3          5  0.985839  0.936243      0.344482      0.807466       0.973301


The `tran%` and `test%` indicate the percentages of the interactions in the original train and test sets before filtering on the number of plays per user.

`user_total%` indicate the percentage of original users kept before applying any filtering. This was 42.2% for the global split using 2024-11-11 as the split date.

`user_split%`indicate the percentage of users kept after applying the global split, but before filtering on the number of plays per user.

`interactions%` indicate the percentage of interactions among both the train and test set before filtering on the number of plays per user.

#### Decision
The user split keeps a larger proportion of the users and interactions, but is a less realistic strategy to implement for the collaborative filtering algorithms. This is because some of the training data will be in the future compared to some of the testing data and vice versa.  

The global split throws away more users and a few more interactions. However, it enforces a more realistic splitting strategy, why I'm favoring this strategy over the temporal user split.  

For the global user split I'll use 2024-11-11 as the split date and filter on users with at least 2 plays in the training data.