## Load and clean every important files

*Applying `eval` to each entry in the corresponding column, converting string representations of Python literals into their actual Python objects.*

Here, cleaning is quite straightforward: it is just about removing null values and duplicates.

In [1]:
from load_clean import load_and_clean_data

big_matrix, small_matrix, social_network, item_categories, user_features, item_daily_features = load_and_clean_data()

Loading big and small matrices...
Loading social network...
Loading item features...
Loading user features...
Loading items' daily features...
All data loaded.
Cleaning data...
Data cleaned.
Big matrix: 7.71% cleaned
Small matrix: 3.89% cleaned
Social network: 0.00% cleaned
Item categories: 0.00% cleaned
User features: 3.86% cleaned
Item daily features: 30.11% cleaned


## Binarize data

We consider an interaction is positive when `watch_ratio >= 2`.

This comes from the KuaiRec paper, "A user–video pair is considered positive if the user’s cumulative watch time is at least twice the video’s duration (i.e. watch_ratio >= 2).
This threshold ensures we capture strong signals (rewatches or full watches), and ignore casual or accidental views."

In [None]:
for df in (big_matrix, small_matrix):
    df['interaction'] = (df['watch_ratio'] >= 2).astype(int)

print("Big  :", big_matrix.shape, "  positives:", big_matrix['interaction'].sum())
print("Small:", small_matrix.shape, "  positives:", small_matrix['interaction'].sum())
big_matrix.head()

Big  : (11564987, 9)   positives: 3925531
Small: (4494578, 9)   positives: 1471862


Unnamed: 0,user_id,video_id,play_duration,video_duration,time,date,timestamp,watch_ratio,interaction
0,0,3649,13838,10867,2020-07-05 00:08:23.438,20200705,1593879000.0,1.273397,1
1,0,9598,13665,10984,2020-07-05 00:13:41.297,20200705,1593879000.0,1.244082,1
2,0,5262,851,7908,2020-07-05 00:16:06.687,20200705,1593879000.0,0.107613,0
3,0,1963,862,9590,2020-07-05 00:20:26.792,20200705,1593880000.0,0.089885,0
4,0,8234,858,11000,2020-07-05 00:43:05.128,20200705,1593881000.0,0.078,0


## 2. Popularity Recommender

A super-simple baseline:  
1.  Count how many times each `video_id` was **positively** watched in the **train** set.  
2.  For each user, recommend the top-K most popular videos they **haven’t** already seen.


In [4]:
from collections import defaultdict

popularity = (
    big_matrix[big_matrix['interaction']==1]
      .groupby('video_id')['interaction']
      .sum()
      .sort_values(ascending=False)
)
pop_list = popularity.index.tolist()

seen = big_matrix[big_matrix['interaction']==1].groupby('user_id')['video_id'].apply(set).to_dict()

def recommend_pop(user_id, K=10):
    recs = []
    seen_u = seen.get(user_id, set())
    for vid in pop_list:
        if vid not in seen_u:
            recs.append(vid)
        if len(recs)==K:
            break
    return recs


## 3. Matrix Factorization with SVD

We use the [`Surprise`](https://surprise.readthedocs.io/) library’s SVD on our **implicit** binary data:
- Treat `interaction` (0/1) as the “rating.”  
- 5-fold CV for RMSE/MAE.  
- Then fit on the full train set and produce top-K scores.


In [5]:
# install surprise if needed
# !pip install scikit-surprise

from surprise import Dataset, Reader, SVD
#IMPORTANT: if you get an error about the version of numpy, try to downgrade it with 'pip uninstall numpy && pip install "numpy<2.0"'
from surprise.model_selection import cross_validate

# 3.1 load into Surprise
reader = Reader(rating_scale=(0,1))
train_data = Dataset.load_from_df(big_matrix[['user_id','video_id','interaction']], reader)

# 3.2 5-fold CV
algo = SVD(n_factors=50, n_epochs=10, verbose=True)
cv_results = cross_validate(algo, train_data, measures=['RMSE','MAE'], cv=5, verbose=True)

# 3.3 fit on all train data
trainset = train_data.build_full_trainset()
algo.fit(trainset)


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Evaluating RMSE, MAE of algorithm SVD on 5 split(s

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1037af820>

## 4. Evaluation: Ranking Metrics

We now define:
- **Precision@K**  
- **Recall@K**  
- **NDCG@K**  
- **MAP@K**  
- **Accuracy** (fraction of correctly predicted interactions over all test pairs, thresholding the SVD score at 0.5)

and a helper to get **top-K** recommendations from our SVD.


In [7]:
import numpy as np

# 4.1 helper: top-K from SVD
all_items = big_matrix['video_id'].unique().tolist()
def recommend_svd(user_id, K=10):
    scores = [ (iid, algo.predict(user_id, iid).est) for iid in all_items ]
    scores.sort(key=lambda x: x[1], reverse=True)
    return [iid for iid,_ in scores[:K]]

# 4.2 metric functions
def precision_at_k(recs, actual, K):
    return len(set(recs[:K]) & set(actual)) / K

def recall_at_k(recs, actual, K):
    if not actual: return 0.0
    return len(set(recs[:K]) & set(actual)) / len(actual)

def dcg_at_k(recs, actual, K):
    return sum((1 if rec in actual else 0)/np.log2(idx+2)
               for idx, rec in enumerate(recs[:K]))

def ndcg_at_k(recs, actual, K):
    ideal = dcg_at_k(actual, actual, min(K,len(actual)))
    return dcg_at_k(recs, actual, K) / ideal if ideal>0 else 0.0

def average_precision(recs, actual, K):
    hits, sum_prec = 0, 0.0
    for i, r in enumerate(recs[:K]):
        if r in actual:
            hits += 1
            sum_prec += hits/(i+1)
    return sum_prec / min(len(actual), K) if actual else 0.0

# 4.3 accuracy: predict interaction if score >= 0.5
def accuracy_svd(test_df):
    preds = [ algo.predict(u, v).est>=0.5
              for u,v in zip(test_df.user_id, test_df.video_id) ]
    return np.mean((test_df.interaction==1) == preds)


## 5. Evaluate on the **small_matrix** (test)

For **each user** in `small`:
1.  Collect the list of **positives** in the test set.  
2.  Generate top-K recs from **Popularity** and **SVD**.  
3.  Compute all metrics.  
4.  Average across users.


In [None]:
from tqdm import tqdm
import numpy as np

Ks = [1, 10, 100]

users   = small_matrix['user_id'].unique()
actuals = (small_matrix[small_matrix['interaction']==1]
           .groupby('user_id')['video_id']
           .apply(list)
           .to_dict())

# prepare a metrics dict: metrics[K][model][metric_name] = list of scores
metrics = {
    K: {
        'pop': {'prec': [], 'rec': [], 'ndcg': [], 'map': []},
        'svd': {'prec': [], 'rec': [], 'ndcg': [], 'map': []}
    } for K in Ks
}

for K in Ks:
    for u in tqdm(users, desc=f"Evaluating @K={K}"):
        act      = actuals.get(u, [])
        pop_recs = recommend_pop(u, K)
        svd_recs = recommend_svd(u, K)

        # populate pop metrics
        metrics[K]['pop']['prec'].append( precision_at_k(pop_recs,    act, K) )
        metrics[K]['pop']['rec'].append(  recall_at_k(pop_recs,       act, K) )
        metrics[K]['pop']['ndcg'].append(ndcg_at_k(pop_recs,          act, K) )
        metrics[K]['pop']['map'].append(average_precision(pop_recs,   act, K) )

        # populate svd metrics
        metrics[K]['svd']['prec'].append( precision_at_k(svd_recs,    act, K) )
        metrics[K]['svd']['rec'].append(  recall_at_k(svd_recs,       act, K) )
        metrics[K]['svd']['ndcg'].append(ndcg_at_k(svd_recs,          act, K) )
        metrics[K]['svd']['map'].append(average_precision(svd_recs,   act, K) )

for K in Ks:
    print(f"\n=== Results @K={K} ===")
    for model in ('pop', 'svd'):
        print(f"-- {model.upper():4s} --")
        print(f"Precision@{K}: {np.mean(metrics[K][model]['prec']):.4f}")
        print(f"Recall   @{K}: {np.mean(metrics[K][model]['rec']):.4f}")
        print(f"NDCG     @{K}: {np.mean(metrics[K][model]['ndcg']):.4f}")
        print(f"MAP      @{K}: {np.mean(metrics[K][model]['map']):.4f}")

print("\nOverall SVD Accuracy on test set:", accuracy_svd(small_matrix))


Evaluating @K=1: 100%|██████████| 1411/1411 [00:33<00:00, 42.29it/s]
Evaluating @K=10: 100%|██████████| 1411/1411 [00:32<00:00, 42.93it/s]
Evaluating @K=100: 100%|██████████| 1411/1411 [00:36<00:00, 39.15it/s]



=== Results @K=1 ===
-- POP  --
Precision@1: 0.7052
Recall   @1: 0.0007
NDCG     @1: 0.7052
MAP      @1: 0.7052
-- SVD  --
Precision@1: 0.0191
Recall   @1: 0.0000
NDCG     @1: 0.0191
MAP      @1: 0.0191

=== Results @K=10 ===
-- POP  --
Precision@10: 0.7587
Recall   @10: 0.0080
NDCG     @10: 0.7536
MAP      @10: 0.6419
-- SVD  --
Precision@10: 0.0189
Recall   @10: 0.0001
NDCG     @10: 0.0187
MAP      @10: 0.0091

=== Results @K=100 ===
-- POP  --
Precision@100: 0.6947
Recall   @100: 0.0723
NDCG     @100: 0.7040
MAP      @100: 0.5280
-- SVD  --
Precision@100: 0.0121
Recall   @100: 0.0013
NDCG     @100: 0.0129
MAP      @100: 0.0024

Overall SVD Accuracy on test set: 0.7688445945314555


The baseline results reveal a stark contrast between a trivial popularity heuristic and a standard SVD model:  
- The **Popularity** recommender achieves deceptively high **Precision@K** and **NDCG@K** by repeatedly surfacing the same handful of hits, yet its **Recall** is essentially zero in the face of dozens—or hundreds—of true positives per user.  
- Conversely, vanilla **SVD** barely recovers any relevant items in the top-K, yielding both low precision and near-zero recall despite an overall accuracy of ~77% (a meaningless number given the imbalance of negatives vs. positives).  

In short, **high precision** here is driven by a tiny set of “easy” hits, while **low recall** exposes that nearly all genuine preferences lie outside the narrow top-K window. These findings underscore two fundamental limitations: (1) evaluating at small K hides the bulk of user interests, and (2) classical pointwise factorization is ill-suited to capture the long tail of positive signals.  

**Key takeaways & next steps**  
- Adopt ranking-centric or implicit-feedback methods (e.g. BPR, ALS) that directly optimize top-N retrieval.  
- Incorporate side information (video tags, user features, social graph) or hard-negative sampling to diversify recommendations and boost recall without sacrificing precision.  
