## ALS Model for Implicit Feedback

We now implement **Alternating Least Squares (ALS)** to factorize our implicit, binary interaction data into user and item latent factors.  
ALS is optimized for **Top-N recommendation** with confidence weighting, making it well-suited to our short‐video watch/rewatch signals.


In [33]:
# !pip install implicit

import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
from implicit.als import AlternatingLeastSquares


## Load and clean every important files

*Applying `eval` to each entry in the corresponding column, converting string representations of Python literals into their actual Python objects.*

Here, cleaning is quite straightforward: it is just about removing null values and duplicates.

In [34]:
from load_clean import load_and_clean_data

big_matrix, small_matrix, social_network, item_categories, user_features, item_daily_features = load_and_clean_data()

Loading big and small matrices...
Loading social network...
Loading item features...
Loading user features...
Loading items' daily features...
All data loaded.
Cleaning data...
Data cleaned.
Big matrix: 7.71% cleaned
Small matrix: 3.89% cleaned
Social network: 0.00% cleaned
Item categories: 0.00% cleaned
User features: 3.86% cleaned
Item daily features: 30.11% cleaned


### Pre-filter Data

We consider **positive** interactions when `watch_ratio >= 2`, according to KuaiRec Paper, then map `user_id` and `video_id` into zero-based matrix indices.


In [64]:
train_df = big_matrix.copy()
train_df['interaction'] = (train_df['watch_ratio'] >= 2).astype(int)

# map to indices
user_ids = train_df['user_id'].unique().tolist()
item_ids = train_df['video_id'].unique().tolist()
user2idx = {u:i for i,u in enumerate(user_ids)}
item2idx = {v:i for i,v in enumerate(item_ids)}

n_items = len(item_ids)
n_users = len(user_ids)

train_df['u_idx'] = train_df['user_id'].map(user2idx)
train_df['i_idx'] = train_df['video_id'].map(item2idx)

print(f"Users: {len(user_ids)}, Items: {len(item_ids)}, Interactions: {len(train_df)}")

Users: 7176, Items: 10728, Interactions: 11564987


### 6.3 Build the Confidence Matrix

We follow the **implicit feedback** formulation:  
\[
C_{u,i} = 1 + \alpha \times R_{u,i},
\]  
where \(R_{u,i}\in\{0,1\}\) is our binary ‘interaction’, and \(\alpha\) scales confidence.


In [65]:
# Build confidence matrix: implicit format expects (item x user)
alpha = 40
# rows = items, cols = users (implicit expects item-user matrix)
conf_mat = coo_matrix(
    (1 + alpha * train_df['interaction'],
     (train_df['i_idx'], train_df['u_idx'])),
    shape=(n_items, n_users)
)

print(f"Sparse matrix shape: {conf_mat.shape}")

Sparse matrix shape: (10728, 7176)


### 6.4 Train the ALS Model

We fit ALS on the **item-user** sparse matrix.  
Key hyperparameters:
- `factors=50`: latent dimension  
- `regularization=0.1`: prevents overfitting  
- `iterations=15`: number of ALS sweeps  


In [66]:
# convert to CSR for speed
conf_csr = conf_mat.tocsr()

# initialize & fit ALS
als = AlternatingLeastSquares(
    factors=50,
    regularization=0.1,
    iterations=15,
    dtype=np.float32,
    use_gpu=False
)
als.fit(conf_csr)


100%|██████████| 15/15 [00:24<00:00,  1.63s/it]


### 6.5 Generate Top-K Recommendations

We use ALS’s `.recommend()` to produce Top-K recs **excluding** items the user already interacted with.


In [69]:
# build user_items (for filtering)
user_items = conf_mat.T.tocsr() 

def recommend_als(user_id, K=10):
    uidx = user2idx[user_id]
    print("conf_mat shape:", conf_mat.shape)      # should be (10728, 7176)
    print("user_items shape:", user_items.shape)  # should be (7176, 10728)

    row = user_items[uidx]
    print("row shape:", row.shape) 
    recs = als.recommend(
        userid=uidx,
        user_items=row,
        N=K,
        filter_already_liked_items=True
    )
    print(f"Recommended items for user {uidx}: {recs}")
    return [item_ids[i] for i in recs[0]]


### 6.6 Evaluate ALS with Top-K Metrics

We reuse our evaluation functions (Precision@K, Recall@K, NDCG@K, MAP@K)  
on the **test set** (`small_matrix.csv`), mapped into the same index space.


In [70]:
test_df = small_matrix.copy()
test_df['interaction'] = (test_df['watch_ratio'] >= 2).astype(int)
# keep only users/items seen in train
test_df = test_df[test_df['user_id'].isin(user2idx) & test_df['video_id'].isin(item2idx)]
test_df['u_idx'] = test_df['user_id'].map(user2idx)
test_df['i_idx'] = test_df['video_id'].map(item2idx)

# build ground truth
gt = test_df[test_df['interaction']==1].groupby('u_idx')['i_idx'].apply(set).to_dict()

# metrics definitions
import math
def precision_at_k(recs, actual, k):
    return len(set(recs[:k]) & set(actual))/k
def recall_at_k(recs, actual, k):
    return len(set(recs[:k]) & set(actual))/len(actual) if actual else 0
def dcg_at_k(recs, actual, k):
    return sum((1 if r in actual else 0)/math.log2(i+2) for i,r in enumerate(recs[:k]))
def ndcg_at_k(recs, actual, k):
    idcg = sum(1/math.log2(i+2) for i in range(min(len(actual),k)))
    return dcg_at_k(recs,actual,k)/idcg if idcg>0 else 0
def map_at_k(recs, actual, k):
    hits=0; sum_prec=0
    for i,r in enumerate(recs[:k]):
        if r in actual:
            hits+=1
            sum_prec+=hits/(i+1)
    return sum_prec/min(len(actual),k) if actual else 0

# evaluate
K=10
metrics = {'prec':[], 'rec':[], 'ndcg':[], 'map':[]}
for uidx, actual in gt.items():
    recs = [ item2idx.get(v,-1) for v in recommend_als(idx2user[uidx], K) ]
    metrics['prec'].append( precision_at_k(recs,actual,K) )
    metrics['rec' ].append( recall_at_k(recs,actual,K) )
    metrics['ndcg'].append( ndcg_at_k(recs,actual,K) )
    metrics['map' ].append( map_at_k(recs,actual,K) )

print("ALS @K=10:")
print(f"Precision@10: {np.mean(metrics['prec']):.4f}")
print(f"Recall@10   : {np.mean(metrics['rec'] ):.4f}")
print(f"NDCG@10     : {np.mean(metrics['ndcg']):.4f}")
print(f"MAP@10      : {np.mean(metrics['map'] ):.4f}")


conf_mat shape: (10728, 7176)
user_items shape: (7176, 10728)
row shape: (1, 10728)
Recommended items for user 14: (array([1863, 1399,  993, 4409, 2709,  600, 4084,  333,  734, 4964],
      dtype=int32), array([1.2909622, 1.2656503, 1.2611493, 1.2561815, 1.2517598, 1.2426412,
       1.2396802, 1.2316259, 1.2299379, 1.2267133], dtype=float32))
conf_mat shape: (10728, 7176)
user_items shape: (7176, 10728)
row shape: (1, 10728)
Recommended items for user 19: (array([ 600, 6721, 6633, 1821,  958, 5723, 4007, 3911, 7045, 5923],
      dtype=int32), array([1.3828309, 1.3656793, 1.3581164, 1.3182485, 1.3064888, 1.2996873,
       1.2876625, 1.2856709, 1.28192  , 1.280205 ], dtype=float32))
conf_mat shape: (10728, 7176)
user_items shape: (7176, 10728)
row shape: (1, 10728)
Recommended items for user 21: (array([6454, 4946, 6884, 1991, 1745,  826, 4731, 6341, 5821, 6784],
      dtype=int32), array([1.2248569, 1.1918578, 1.1856631, 1.1761225, 1.1646575, 1.1590784,
       1.1578197, 1.1576061, 1.15

IndexError: index 7224 is out of bounds for axis 1 with size 7176