In [1]:
import pandas as pd

This notebook is used to investigate different train-test split strategies.

In [2]:
# loading my transformed data
df = pd.read_parquet('../data/podcast_data_transformed.parquet')
df.head()

Unnamed: 0,user_id,prd_number,series_title,unique_title,platform,device_type,pub_date,episode_duration,genre,branding_channel,mother_channel,category,content_time_spent,date,time,completion_rate
0,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11032421443,Brinkmanns briks,Brinkmanns briks: Vi skal tale om pillerne_110...,web,PC,2024-10-30,3422,Fakta og debat,DR P1,DR P1,Oplysning og kultur,3423,2024-11-01,08:56:00,1.0
1,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11032422442,Hjernekassen på P1,Hjernekassen på P1: Forebyggelse_11032422442,web,PC,2024-10-29,3363,Fakta og debat,DR P1,DR P1,Oplysning og kultur,359,2024-11-01,11:19:00,0.10675
2,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11162405447,Ubegribeligt,Ubegribeligt: Vand_11162405447,web,PC,2024-10-31,3417,Fakta og debat,DR P1,DR P1,Aktualitet og debat,5160,2024-11-01,09:53:00,1.0
3,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11802437044,Stjerner og striber,"Stjerner og striber: Joken, der ikke vil dø_11...",web,PC,2024-11-01,2847,Aktualitet,DR P1,-,Nyheder,2847,2024-11-01,08:40:00,1.0
4,000065a7ec329b0fc01a779ead0e8d38d987b070300113...,11802451178,Tiden,"Tiden: Skraldemanden Trump, spansk oversvømmel...",web,PC,2024-11-01,947,Nyheder,DR Lyd,-,Nyheder,610,2024-11-01,09:27:00,0.644139


In [3]:
# number of unique users 
n_users = len(set(df['user_id']))

# number of interactions (rows in the df)
n_interactions = len(df)

# printing the summary statistics
print(f"Number of unique users: {n_users}")
print(f"Number of interactions: {n_interactions}")

Number of unique users: 144047
Number of interactions: 2984717


The transformed data contains 144,047 unique users.  

#### Temporal User Split

This splitting strategy defines a training and test set of each user according to some splitting percentage, e.g., 80% training data and 20% testing data is being recommended in the literature.  

Investigating the consequences of implementing a 80-20 split.

In [4]:
# grouping by user_id and counting the number of prd_numbers for each user
df_grouped = df.groupby('user_id')['prd_number'].count().reset_index()
df_grouped.rename(columns={'prd_number': 'prd_count'}, inplace=True)

# number of users with at least 3 prd_numbers
df_grouped = df_grouped[df_grouped['prd_count'] >= 3]

# number of users left
users_set = set(df_grouped['user_id'])
n_users_usplit = len(users_set)

# number of interactions left (rows in the df_grouped)
df_filtered = df[df['user_id'].isin(users_set)]
n_interactions_usplit = len(df_filtered)

# printing the results
print(f"Number of users with at least 3 prd_numbers: {n_users_usplit} ({n_users_usplit/n_users:.1%} of users are kept)")
print(f"Number of interactions left: {n_interactions_usplit} ({n_interactions_usplit/n_interactions:.1%} of interactions are kept)")

Number of users with at least 3 prd_numbers: 96288 (66.8% of users are kept)
Number of interactions left: 2919820 (97.8% of interactions are kept)


#### Temporal Global Split
Define a single global timestamp that works as the boundary between train and test data for all users/interactions.

I want to make the cut on the `date` attribute, where the selected date is a Monday. In this way entire weeks are contained in the training and test data. This might be an appropriate decision since podcast listening and publication have weekly patterns. This must be investigated in the EDA and related work section about podcast listening patterns.

In [6]:
# testing size of train - and test data when splitting on different dates
split_dates = ['2024-10-28', '2024-11-04', '2024-11-11', '2024-11-18']

data = {'date': split_dates,
        'train%': [],
        'test%': [],
        'users%': [],
        'interactions%': [],
        }

for split_date in split_dates:
    # splitting the data
    train_df = df[df['date'] < split_date]
    test_df = df[df['date'] >= split_date]

    # number of unique users both in the train and test data
    common_users = set(train_df['user_id']).intersection(set(test_df['user_id']))
    n_common_users = len(common_users)

    # filter df according to the common users
    train_df_common = train_df[train_df['user_id'].isin(common_users)]
    test_df_common = test_df[test_df['user_id'].isin(common_users)]

    # computing train and test percentages
    train_interactions = len(train_df_common)
    test_interactions = len(test_df_common)
    total_interactions = train_interactions + test_interactions
    train_perc = train_interactions / total_interactions
    test_perc = test_interactions / total_interactions
    data["train%"].append(train_perc)
    data["test%"].append(test_perc)

     # computing the percentage of users
    perc_users = n_common_users / n_users
    data["users%"].append(perc_users)

    # computing the percentage of total interactions
    perc_interactions = total_interactions / n_interactions
    data["interactions%"].append(perc_interactions)


In [None]:
# generating a dataframe from the gathered data
df = pd.DataFrame(data)
print(df)

         date    train%     test%    users%  interactions%
0  2024-10-28  0.603254  0.396746  0.457115       0.880448
1  2024-11-04  0.675782  0.324218  0.446729       0.873404
2  2024-11-11  0.747212  0.252788  0.424112       0.858137
3  2024-11-18  0.821110  0.178890  0.384708       0.825251


#### Decision
The user split keeps a larger proportion of the users and interactions, but is a less realistic strategy to implement for the collaborative filtering algorithms. This is because some of the training data will be in the future compared to some of the testing data and vice versa.  

The global split throws away more users and a few more interactions. However, it enforces a more realistic splitting strategy, why I'm favoring this strategy over the temporal user split. 

November 11, 2024, seems like a reasonable global split date, as the balance between train and test data is close to 75-25 and more users are kept compared to November 18, 2024.