# SASREC Model

We decided to go with SASRec because it’s great at handling sequential data, which is exactly what we need for recommending short videos. Since users on platforms like Kuaishou typically watch videos in a sequence, SASRec can capture patterns in how people interact with content over time. Unlike basic methods that just look at overall user-item relationships, SASRec uses a Transformer model to focus on the order in which videos are watched, helping us predict what a user might want to watch next. It's a solid model for this kind of task, and it’s been shown to work really well for similar recommendation problems. Plus, it's efficient and scalable, making it a good fit for our project’s goals.

These articles were very informative and gave a broader understanding of SASRec

https://medium.com/biased-algorithms/contrastive-learning-for-sequential-recommendation-f4744d75128a

https://medium.com/@rohan.chaudhury.rc/paper-review-self-attentive-sequential-recommendation-a4efd2185a61

In [3]:
import pandas as pd

# Load the interaction data
df = pd.read_csv('data_final_project/KuaiRec 2.0/data/small_matrix.csv')

# Display the first few rows to understand its structure
print(df.head())


   user_id  video_id  play_duration  video_duration                     time  \
0       14       148           4381            6067  2020-07-05 05:27:48.378   
1       14       183          11635            6100  2020-07-05 05:28:00.057   
2       14      3649          22422           10867  2020-07-05 05:29:09.479   
3       14      5262           4479            7908  2020-07-05 05:30:43.285   
4       14      8234           4602           11000  2020-07-05 05:35:43.459   

         date     timestamp  watch_ratio  
0  20200705.0  1.593898e+09     0.722103  
1  20200705.0  1.593898e+09     1.907377  
2  20200705.0  1.593898e+09     2.063311  
3  20200705.0  1.593898e+09     0.566388  
4  20200705.0  1.593899e+09     0.418364  


In [4]:
from collections import defaultdict

# sort the dataframe by user_id and timestamp
df.sort_values(by=['user_id', 'timestamp'], inplace=True)

# Build user -> [video_id1, video_id2, ...] dict
user_sequences = defaultdict(list)

for row in df.itertuples():
    user_sequences[row.user_id].append(row.video_id)

# Preview one user's sequence
for uid, vids in list(user_sequences.items())[:1]:
    print(f"User {uid} : {vids}")
print(f"Number of unique userssequence: {len(user_sequences)}")

User 14 : [148, 183, 3649, 5262, 8234, 6789, 1963, 175, 1973, 171, 6803, 3634, 6787, 1951, 179, 5266, 5241, 6782, 6788, 8220, 6801, 3647, 6771, 9588, 186, 6812, 3684, 206, 211, 1988, 3672, 9595, 8242, 8248, 6829, 217, 9570, 139, 8160, 3669, 6846, 2007, 6839, 2000, 3654, 2008, 1898, 203, 5261, 256, 6854, 8289, 3702, 5326, 9660, 229, 5315, 262, 2040, 254, 2024, 5353, 6767, 8251, 8295, 5328, 8201, 9569, 2029, 223, 6865, 9653, 3706, 3630, 5331, 286, 2074, 2081, 3699, 9683, 1986, 2052, 285, 280, 5252, 8298, 5365, 9678, 275, 3719, 3586, 8212, 5367, 9592, 8319, 290, 145, 3737, 6904, 3722, 6749, 279, 147, 289, 5381, 1903, 9670, 8222, 296, 8316, 297, 2075, 2093, 9697, 2084, 3694, 3698, 5339, 2077, 258, 9645, 265, 2082, 1943, 5237, 3650, 5297, 8279, 3747, 307, 2113, 9659, 103, 5265, 8340, 5251, 6879, 3734, 288, 180, 5228, 5290, 8302, 8228, 8342, 6834, 5374, 9684, 5375, 3778, 6930, 9704, 340, 8323, 2121, 1922, 8359, 282, 3770, 5416, 5421, 5413, 9706, 9711, 3714, 3764, 9716, 318, 3772, 8311, 8354,

In [14]:
import os

sas_rec_data_dir = "data_final_project/KuaiRec 2.0/sas_rec_data/"
# Create the folder if it does not exist
os.makedirs(sas_rec_data_dir, exist_ok=True)

with open(sas_rec_data_dir + "sasrec_sequences.txt", "w") as f:
    for sequence in user_sequences.values():
        f.write(" ".join(map(str, sequence)) + "\n")


### Split Logic (Per User)

    Train: All videos except the last two

    Valid: Second-to-last video (to validate prediction from training sequence)

    Test: Last video (to evaluate final recommendation quality)

In [15]:
train_data = {}
valid_data = {}
test_data = {}

for user, seq in user_sequences.items():
    if len(seq) < 3:
        continue
    train_data[user] = seq[:-2]
    valid_data[user] = seq[-2:-1]
    test_data[user] = seq[-1:]

#calculate the number of users in each set
train_users = len(train_data)
valid_users = len(valid_data)
test_users = len(test_data)
print(f"Number of users in train set: {train_users}")
print(f"Number of users in valid set: {valid_users}")
print(f"Number of users in test set: {test_users}")

Number of users in train set: 1411
Number of users in valid set: 1411
Number of users in test set: 1411


In [16]:
def save_sequences(path, data_dict):
    with open(sas_rec_data_dir + path, "w") as f:
        for user, items in data_dict.items():
            for item in items:
                f.write(f"{user} {item}\n")


save_sequences("train.txt", train_data)
save_sequences("valid.txt", valid_data)
save_sequences("test.txt", test_data)