<h1>Sampling a rating dataset from 500 users of the ml-25m dataset</h1>


<B>CAVEAT:<B> The sampling requires a copy of the ratings.csv file from an official source (see also [X] in the code below). See for instance https://grouplens.org/datasets/movielens/25m/ (last access: January 14, 2025).

We group the ml-25m benchmark dataset by userId and then order by the number of rating entries per user. By default, every user has at least 20 entries. We then create 500 equisized chunks of the histogram and sample one user per slice.

Permission to use a sample of the ml-25m dataset has been granted on 11 November, 2022 (see /dec-cf-sim/data/ml-25m/2022-11-11_Permission to use a sample from the ml-25m dataset.pdf).

In [1]:
import pandas as pd
import numpy as np

def sample_n_users(df, n, random_state=1234):
    # sort userIds by number of rating entries (descending)
    sorted_userIds = df.value_counts("userId").index
    # split userIds into equalsized chunks 
    userId_chunks = np.array_split(sorted_userIds, n)

    # sample userIds (one per chunk)
    sample_userIds = [pd.Series(chunk.values).sample(n=1, random_state=random_state).values[0] for chunk in userId_chunks]
    sample_df = df[df.loc[:,"userId"].isin(sample_userIds)]
    return sample_df

In [2]:
# READ DATA
#############

# Some statistics on the ml-25m dataset we sample from:
# total lengths
# tag_df 1093360 rows × 4 columns
# num unique users: 14592
######
# 25m_df 25000095rows × 4 columns
# num unique users: 162541

# ratings
ml25_ratings = '/home/teichinger/dec-cf-sim/data/ml-25m/ratings.csv' # [X] CAVEAT: You have to get a copy of this .csv file to proceed from an official source.
df = pd.read_csv(ml25_ratings, dtype = {"userId": object, "movieId": object, "rating": float, "timestamp": int})
# rename column: 'movieId' -> 'itemId'
df.columns = ['userId', 'itemId', 'rating', 'timestamp']

# tags
ml25_tags = '/home/teichinger/dec-cf-sim/data/ml-25m/tags.csv'
df_tags = pd.read_csv(ml25_tags, dtype = {"userId":object, "movieId": object, "tag": object, "timestamp": int})

# SAMPLE DF
##############
sample_df = sample_n_users(df, 500, random_state=1234)

# SAVE DF
##############
output_filename = './samples/ml-25m:n=500:rseed=1234.csv'
sample_df.to_csv(output_filename, index=False)

In [3]:
# SELECT ALL SAMPLED USERS 
sample_userIds = sample_df.loc[:,"userId"].value_counts().index.tolist()


def blank_concat(_s):
    """ Takes a pandas.Series <_s> and concatenates all strings with a blank in between. """
    
    return " ".join(_s)


#items for which predictions are due:      158   
#items for which predictions are feasible: 8355
#items for which predicitons are feasible in the test set: 156