In [1]:
import pandas as pd

This notebook is used to investigate different train-test split strategies.

#### Temporal Global Split
Define a single global timestamp that works as the boundary between train and test data for all users/interactions.

In [22]:
# loading my transformed data
df = pd.read_csv('raw_data/transformed_data.csv')
df.head()

Unnamed: 0,user_id,prd_number,series_title,unique_title,platform,device_type,pub_date,episode_duration,genre,branding_channel,mother_channel,category,content_time_spent,date,time,completion_rate
0,00005776ec874bc9ab8ca964cf274858,11032415372,Kampen om historien,Kampen om historien: Jihadismen og kampen mod ...,app,Mobile Phone,2024-09-10,3416,Historie,DR P1,DR P1,Oplysning og kultur,1140,2024-09-10,14:44:00,0.333724
1,00005776ec874bc9ab8ca964cf274858,11032415392,Kampen om historien,Kampen om historien: Midt i en ny mellemkrigst...,app,Mobile Phone,2024-09-24,3416,Historie,DR P1,DR P1,Oplysning og kultur,1257,2024-10-08,10:51:00,0.367974
2,00005776ec874bc9ab8ca964cf274858,11032415412,Kampen om historien,Kampen om historien: Gaskamrene er en detalje ...,app,Tablet,2024-10-08,3393,Historie,DR P1,DR P1,Oplysning og kultur,3395,2024-11-28,09:42:00,1.0
3,00005776ec874bc9ab8ca964cf274858,11032415422,Kampen om historien,Kampen om historien: Libanons lange krig - kri...,app,Mobile Phone,2024-10-15,3427,Historie,DR P1,DR P1,Oplysning og kultur,3425,2024-10-30,09:39:00,0.999416
4,00005776ec874bc9ab8ca964cf274858,11032415432,Kampen om historien,Kampen om historien: Hvad har amerikanerne lær...,app,Mobile Phone,2024-10-22,3365,Historie,DR P1,DR P1,Oplysning og kultur,4681,2024-12-01,10:27:00,1.0


The transformed data contains 144,047 unique users.  

I want to make the cut on the `date` attribute, where the selected date is a Monday. In this way entire weeks are contained in the training and test data. This might be an appropriate decision since podcast listening and publication have weekly patterns. This must be investigated in the EDA and related work section about podcast listening patterns.

In [28]:
# testing size of train - and test data when splitting on different dates
split_date = "2024-10-28"

# splitting the data
train_df = df[df['date'] < split_date]
test_df = df[df['date'] >= split_date]

# number of unique users both in the train and test data
common_users = set(train_df['user_id']).intersection(set(test_df['user_id']))
n_common_users = len(common_users)

# filter df according to the common users
train_df_common = train_df[train_df['user_id'].isin(common_users)]
test_df_common = test_df[test_df['user_id'].isin(common_users)]

# printing the size of the train and test data
train_interactions = len(train_df_common)
test_interactions = len(test_df_common)
total_interactions = train_interactions + test_interactions
df_interactions = len(df)
perc_interactions = total_interactions / df_interactions
print(f"% of total interactions {perc_interactions} ({total_interactions})")
train_perc = train_interactions / total_interactions
test_perc = test_interactions / total_interactions
print("The training data contains", train_interactions, "interactions, which is", train_perc, "of the total data")
print("The test data contains", test_interactions, "interactions, which is", test_perc, "of the total data")  

total_users = len(set(df['user_id']))
perc_users = n_common_users / total_users
print("Number of common users: ", n_common_users) 
print(f"% of total users {perc_users}")

% of total interactions 0.880448297108235 (2627889)
The training data contains 1585284 interactions, which is 0.6032537903998229 of the total data
The test data contains 1042605 interactions, which is 0.3967462096001772 of the total data
Number of common users:  65846
% of total users 0.45711469173255953


In [29]:
data = {
    'date': ['2024-10-28', '2024-11-04', '2024-11-11', '2024-11-18'],
    'train%': [60.3, 67.6, 74.7, 82.1],
    'test%': [39.7, 32.4, 25.3, 17.9],
    '%users': [45.7, 44.7, 42.4, 38.5],
    '%interactions': [88.0, 87.3, 85.8, 82.5]
}

df = pd.DataFrame(data)
print(df)

         date  train%  test%  %users  %interactions
0  2024-10-28    60.3   39.7    45.7           88.0
1  2024-11-04    67.6   32.4    44.7           87.3
2  2024-11-11    74.7   25.3    42.4           85.8
3  2024-11-18    82.1   17.9    38.5           82.5
