## 1. Load and Prepare the Data

First, we load the ratings data (`u.data`) and movie titles (`u.item`). The core of collaborative filtering is the **user-item matrix**, where rows represent users, columns represent movies, and the values are the ratings. We will create this matrix using a `pivot_table`.

In [1]:
import pandas as pd
import numpy as np

# Define column names for the ratings data
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, encoding='latin-1')

# Define column names for the movie titles data
m_cols = ['movie_id', 'title']
movies = pd.read_csv('u.item', sep='|', names=m_cols, usecols=range(2), encoding='latin-1')

# Merge the two dataframes
data = pd.merge(ratings, movies, on='movie_id')

# Create the user-item matrix
user_item_matrix = data.pivot_table(index='user_id', columns='title', values='rating')

print("Shape of user-item matrix:", user_item_matrix.shape)
user_item_matrix.head()

Shape of user-item matrix: (943, 1664)


title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


The resulting matrix is very **sparse**, as most users have not rated most movies. SVD algorithms in `scikit-learn` require a dense matrix, so we will fill the missing `NaN` values with `0`.

In [2]:
# Normalize the matrix by subtracting the mean rating of each user
user_item_matrix_normalized = user_item_matrix.subtract(user_item_matrix.mean(axis=1), axis='rows')

# Fill NaNs with 0
user_item_matrix_filled = user_item_matrix_normalized.fillna(0)
user_item_matrix_filled.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,-1.605166,1.394834,0.0,0.0,-0.605166,0.394834,0.0,0.0,...,0.0,0.0,0.0,1.394834,-0.605166,0.0,0.0,0.0,0.394834,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.704918,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,-0.773585,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,-0.874286,0.0,0.0,0.0,0.0,1.125714,0.0,0.0,...,0.0,0.0,0.0,1.125714,0.0,0.0,0.0,0.0,1.125714,0.0


## 2. Matrix Factorization with Truncated SVD

We use `TruncatedSVD` from `scikit-learn` to decompose our user-item matrix into three smaller matrices. This process identifies latent features (like genres, actors, or themes) that explain the observed ratings.



The parameter `n_components` determines the number of latent features to find. It's a hyperparameter that can be tuned.

In [3]:
from sklearn.decomposition import TruncatedSVD

# Instantiate TruncatedSVD
svd = TruncatedSVD(n_components=50, random_state=42)

# Fit SVD on the user-item matrix
svd.fit(user_item_matrix_filled)

# Transform the data
user_features = svd.transform(user_item_matrix_filled)

print("Shape of user features matrix:", user_features.shape)

Shape of user features matrix: (943, 50)


## 3. Reconstruct Ratings and Evaluate the Model

After decomposition, we can reconstruct the full ratings matrix by multiplying the resulting matrices back together. This new matrix contains predicted ratings for every user-movie pair.

We will evaluate our model by calculating the **Root Mean Squared Error (RMSE)** between our predicted ratings and the actual ratings that were present in the original dataset.

In [4]:
from sklearn.metrics import mean_squared_error

# Reconstruct the ratings matrix
predicted_ratings_normalized = np.dot(user_features, svd.components_)

# Add back the user mean to get the final predicted ratings
predicted_ratings = pd.DataFrame(predicted_ratings_normalized, index=user_item_matrix.index, columns=user_item_matrix.columns) \
                       .add(user_item_matrix.mean(axis=1), axis='rows')
                       
# --- Evaluation ---
# Get original ratings that are not NaN
original_ratings = user_item_matrix.stack().reset_index()
original_ratings.columns = ['user_id', 'title', 'rating']

# Get predicted ratings for the same user-movie pairs
predicted_ratings_stacked = predicted_ratings.stack().reset_index()
predicted_ratings_stacked.columns = ['user_id', 'title', 'predicted_rating']

# Merge original and predicted ratings
evaluation_df = pd.merge(original_ratings, predicted_ratings_stacked, on=['user_id', 'title'])

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(evaluation_df['rating'], evaluation_df['predicted_rating']))

print(f"Model RMSE: {rmse:.4f}")

Model RMSE: 0.7263


This RMSE value tells us how well our model can reconstruct the ratings it was trained on. A lower value is better. Note that a proper evaluation would involve a train-test split, but this method is sufficient for demonstrating the mechanism.

## 4. Generate Top-N Movie Recommendations

Now for the fun part! We can use our reconstructed ratings matrix to provide personalized recommendations. The logic is:
1.  Take a `user_id` as input.
2.  Get all predicted ratings for that user from our final `predicted_ratings` matrix.
3.  Remove movies the user has already seen.
4.  Sort the remaining movies by their predicted rating and return the top `N`.

In [5]:
def get_recommendations(user_id, n=10):
    """Generate top N movie recommendations for a given user."""
    # Get the user's predicted ratings
    user_predictions = predicted_ratings.loc[user_id].sort_values(ascending=False)
    
    # Get the movies the user has already rated
    user_rated_movies = user_item_matrix.loc[user_id].dropna().index
    
    # Filter out movies the user has already rated
    recommendations = user_predictions[~user_predictions.index.isin(user_rated_movies)]
    
    return recommendations.head(n)

# --- Get recommendations for a sample user ---
sample_user_id = 100
top_10_movies = get_recommendations(sample_user_id, n=10)

print(f"Top 10 Movie Recommendations for User {sample_user_id}:")
display(top_10_movies)

Top 10 Movie Recommendations for User 100:


title
Braveheart (1995)                         3.347183
Devil's Advocate, The (1997)              3.322446
Ransom (1996)                             3.314819
Beavis and Butt-head Do America (1996)    3.283486
In the Line of Fire (1993)                3.278534
Schindler's List (1993)                   3.278312
Time to Kill, A (1996)                    3.276504
Saint, The (1997)                         3.262674
Twelve Monkeys (1995)                     3.256012
Sound of Music, The (1965)                3.255490
Name: 100, dtype: float64

## Conclusion

We have successfully built a recommendation system from the ground up using matrix factorization. By decomposing the user-item matrix, we were able to predict ratings for unseen movies and generate a personalized list of recommendations.

**Potential Improvements:**
* **Hyperparameter Tuning**: The number of components in SVD (`n_components`) is critical. Experimenting with different values can significantly impact performance.
* **Different Matrix Filling Strategy**: Instead of filling missing values with 0, one could use the global average rating or a user's/movie's average rating.
* **Proper Train-Test Split**: For a more robust evaluation, the data should be split into training and testing sets *before* creating the user-item matrix.