# MovieLens Recommender Walkthrough
High-level steps: load ratings and movie metadata, merge them, compute basic stats, build train/test matrices, calculate similarities, generate collaborative filtering predictions, and evaluate with RMSE.


## Setup
Import core libraries: NumPy for numerical ops and Pandas for data handling.


In [30]:
import numpy as np
import pandas as pd

## Load ratings data
Read the user–item ratings file (`u.data`), which is tab-separated and has no header. Columns: user_id, item_id, rating, timestamp.


In [31]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)
print(f"Ratings loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(df.head())

Ratings loaded: 100000 rows, 4 columns
   user_id  item_id  rating  timestamp
0      196      242       3  881250949
1      186      302       3  891717742
2       22      377       1  878887116
3      244       51       2  880606923
4      166      346       1  886397596


## Load movie metadata and merge
Read `u.item` (pipe-separated, latin-1) with movie info and 19 genre flags, then merge into the ratings dataframe on `item_id`.


In [32]:
movie_titles = pd.read_csv('u.item', sep='|', encoding='latin-1', header=None, 
                           names=['item_id', 'movie_title', 'release_date', 'video_release_date', 
                                  'imdb_url'] + [f'genre_{i}' for i in range(19)])
print(f"Movies loaded: {movie_titles.shape[0]} titles, {movie_titles.shape[1]} columns")

# Merge movie metadata into the ratings dataframe
df = pd.merge(df, movie_titles, on='item_id')
print(f"Merged dataframe shape: {df.shape}")

Movies loaded: 1682 titles, 24 columns
Merged dataframe shape: (100000, 27)


## Compute unique counts and sparsity
Count distinct users/items and compute how sparse the rating matrix is (fraction of missing ratings).


In [33]:
n_users = df.user_id.nunique()  # 944 users
n_items = df.item_id.nunique()  # 1682 movies

# Calculate sparsity
sparsity = 1.0 - len(df)/(n_users * n_items)
# Result: 93.7% sparse (most user-movie pairs have no rating)

## Display basic dataset stats
Print the number of unique users, unique movies, and overall sparsity of the user–item matrix.


In [34]:
print(f"Unique users: {n_users}")
print(f"Unique movies: {n_items}")
print(f"Matrix sparsity: {sparsity:.4f} (~{sparsity*100:.1f}% missing)")

Unique users: 943
Unique movies: 1682
Matrix sparsity: 0.9370 (~93.7% missing)


## Train/test split
Split the merged ratings into training (75%) and test (25%) sets for offline evaluation with a fixed random seed for reproducibility.


In [35]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.30, random_state=42)
print(f"Train size: {len(train_data)}, Test size: {len(test_data)}")

Train size: 70000, Test size: 30000


## Build training rating matrix
Create a dense user–item matrix from the training split; unrated pairs stay at zero.


In [36]:
# Training matrix: 944 users × 1682 movies
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]

print(f"Training matrix shape: {train_data_matrix.shape}")
print(f"Non-zero training entries: {np.count_nonzero(train_data_matrix)}")

Training matrix shape: (943, 1682)
Non-zero training entries: 70000


## Build test rating matrix
Construct a dense user–item matrix for the held-out test set; unobserved pairs remain zeros.


In [37]:
# Test matrix: same structure
test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

print(f"Test matrix shape: {test_data_matrix.shape}")
print(f"Non-zero test entries: {np.count_nonzero(test_data_matrix)}")

Test matrix shape: (943, 1682)
Non-zero test entries: 30000


## Compute similarity matrices
Use cosine distance on the training matrix to compute user–user and item–item similarity matrices.


In [38]:
from sklearn.metrics.pairwise import pairwise_distances

user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

print(f"User similarity matrix shape: {user_similarity.shape}")
print(f"Item similarity matrix shape: {item_similarity.shape}")

User similarity matrix shape: (943, 943)
Item similarity matrix shape: (1682, 1682)


## Prediction helper
Given a ratings matrix and a similarity matrix, generate predicted ratings using either user-based (default) or item-based collaborative filtering with mean-centering for user bias.


In [39]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        # Adjust for user's average rating bias
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        
        # Weighted average of similar users' ratings
        pred = mean_user_rating[:, np.newaxis] + \
               similarity.dot(ratings_diff) / \
               np.array([np.abs(similarity).sum(axis=1)]).T
               
    elif type == 'item':
        # Weighted average of similar items' ratings
        pred = ratings.dot(similarity) / \
               np.array([np.abs(similarity).sum(axis=1)])
    
    return pred

## Generate user- and item-based predictions
Apply the similarity matrices to the training data to create full prediction matrices for both user-based and item-based collaborative filtering.


In [40]:
# Generate predictions
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

print(f"User-based prediction matrix shape: {user_prediction.shape}")
print(f"Item-based prediction matrix shape: {item_prediction.shape}")

User-based prediction matrix shape: (943, 1682)
Item-based prediction matrix shape: (943, 1682)


## Evaluation metric (RMSE)
Define a helper that computes Root Mean Squared Error only on ratings that exist in the ground-truth test matrix.


In [41]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    # Only compare ratings that exist in test set
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

# Results:
# User-based CF RMSE: 3.13 (predictions off by ~3 points)
# Item-based CF RMSE: 3.46 (slightly worse)

## Matrix factorization with SVD
Use truncated SVD to learn 20 latent factors, reconstruct the full rating matrix, and evaluate with RMSE on the test set.


In [42]:
from scipy.sparse.linalg import svds

# Decompose training matrix into 20 hidden features
u, s, vt = svds(train_data_matrix, k=20)

# Reconstruct matrix to predict all ratings
s_diag_matrix = np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)

# Evaluate
print('User-based CF RMSE: ' + str(rmse(X_pred, test_data_matrix)))
# Result: 2.73 (better than memory-based!)

User-based CF RMSE: 2.8007957748553167
