# Movie Recommendation System

## Project Overview

This project demonstrates how to build a robust movie recommendation system using the classic MovieLens 1M dataset. The analysis moves beyond a single modeling technique to explore and build a **Content-Based** model, a **Collaborative Filtering** model, and a **Hybrid System** that combines the strengths of both.

## The Challenge: Sparsity & The "Cold Start" Problem

A primary challenge for recommendation engines is data sparsity—most users have rated only a tiny fraction of the available movies. This project explores how different models can overcome this and the related "cold start" problem (recommending to new users or items).

### 1. Library Imports and Setup

In [3]:
# Install the 'surprise' library for recommendation systems if not already installed.
#pip install scikit-surprise

In [4]:
# Import all necessary libraries.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Reader, Dataset, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
from collections import defaultdict

### 2. Data Loading and Preparation

In [6]:
# --- Define Column Names ---
# Create headers, as the .dat files do not have them.
m_cols = ['MovieID', 'Title', 'Genres']
r_cols = ['UserID', 'MovieID', 'Rating', 'Timestamp']
u_cols = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']

# --- Load the Datasets ---
# Read the .dat files using '::' as the separator.
movies = pd.read_csv('movies.dat', sep='::', names=m_cols, engine='python', encoding='latin-1')
ratings = pd.read_csv('ratings.dat', sep='::', names=r_cols, engine='python', encoding='latin-1')
users = pd.read_csv('users.dat', sep='::', names=u_cols, engine='python', encoding='latin-1')

# --- Merge DataFrames ---
# Combine all data into a single DataFrame for easier analysis.
movie_ratings = pd.merge(movies, ratings)
df = pd.merge(movie_ratings, users)

print("--- Merged DataFrame Head ---")
print(df.head())

--- Merged DataFrame Head ---
   MovieID             Title                       Genres  UserID  Rating  \
0        1  Toy Story (1995)  Animation|Children's|Comedy       1       5   
1        1  Toy Story (1995)  Animation|Children's|Comedy       6       4   
2        1  Toy Story (1995)  Animation|Children's|Comedy       8       4   
3        1  Toy Story (1995)  Animation|Children's|Comedy       9       5   
4        1  Toy Story (1995)  Animation|Children's|Comedy      10       5   

   Timestamp Gender  Age  Occupation Zip-code  
0  978824268      F    1          10    48067  
1  978237008      F   50           9    55117  
2  978233496      M   25          12    11413  
3  978225952      M   25          17    61614  
4  978226474      F   35           1    95370  


### 3. Exploratory Data Analysis: The User-Item Matrix

In [8]:
# --- Create the User-Item Matrix ---
# This matrix will have users as rows, movies as columns, and ratings as values.
# It is fundamental for our recommendation algorithms.
user_item_matrix = df.pivot_table(index='UserID', columns='Title', values='Rating')

print("\n--- User-Item Matrix (first 5 rows/columns) ---")
print(user_item_matrix.iloc[:5, :5])

# --- Calculate Sparsity ---
# Sparsity shows what percentage of the matrix is empty (NaN values).
sparsity = 1.0 - (np.count_nonzero(user_item_matrix) / user_item_matrix.size)
print(f"\nShape of the User-Item Matrix: {user_item_matrix.shape}")
print(f"Sparsity of the matrix: {sparsity:.2%}")


--- User-Item Matrix (first 5 rows/columns) ---
Title   $1,000,000 Duck (1971)  'Night Mother (1986)  \
UserID                                                 
1                          NaN                   NaN   
2                          NaN                   NaN   
3                          NaN                   NaN   
4                          NaN                   NaN   
5                          NaN                   NaN   

Title   'Til There Was You (1997)  'burbs, The (1989)  \
UserID                                                  
1                             NaN                 NaN   
2                             NaN                 NaN   
3                             NaN                 NaN   
4                             NaN                 NaN   
5                             NaN                 NaN   

Title   ...And Justice for All (1979)  
UserID                                 
1                                 NaN  
2                                 NaN 

### 4. Model 1: Content-Based Filtering

This model recommends movies based on their features (genres). The logic is simple: if a user likes a movie, recommend other movies with similar genres. This approach is excellent for solving the "cold start" problem for new movies that don't have ratings yet.

In [10]:
# --- Create TF-IDF (Term Frequency-Inverse Document Frequency) Matrix ---
# Convert the text-based genres into a matrix of numbers.
tfidf = TfidfVectorizer(stop_words='english')

# Replace any NaN values in the 'Genres' column with an empty string.
movies['Genres'] = movies['Genres'].fillna('')

# Create the TF-IDF matrix by fitting and transforming the 'Genres' data.
tfidf_matrix = tfidf.fit_transform(movies['Genres'])

# --- Calculate Cosine Similarity ---
# Calculate the similarity score between every pair of movies based on their genres.
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print("--- Cosine Similarity Matrix Shape ---")
print(cosine_sim.shape)

--- Cosine Similarity Matrix Shape ---
(3883, 3883)


In [11]:
# Create a mapping from movie titles to their index in the dataframe.
indices = pd.Series(movies.index, index=movies['Title']).drop_duplicates()

In [12]:
# Define the content-based recommendation function.
def get_content_based_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title.
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie.
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores.
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies (excluding the movie itself).
    sim_scores = sim_scores[1:11]

    # Get the movie indices.
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies.
    return movies['Title'].iloc[movie_indices]

In [13]:
# --- Get a Recommendation Example ---
# Test the function with a classic movie.
print("\n--- Recommendations For 'Toy Story (1995)' ---")
print(get_content_based_recommendations('Toy Story (1995)'))


--- Recommendations For 'Toy Story (1995)' ---
1050            Aladdin and the King of Thieves (1996)
2072                          American Tail, An (1986)
2073        American Tail: Fievel Goes West, An (1991)
2285                         Rugrats Movie, The (1998)
2286                              Bug's Life, A (1998)
3045                                Toy Story 2 (1999)
3542                             Saludos Amigos (1943)
3682                                Chicken Run (2000)
3685    Adventures of Rocky and Bullwinkle, The (2000)
12                                        Balto (1995)
Name: Title, dtype: object


### 5. Model 2: Collaborative Filtering with SVD

This model recommends movies based on the principle that people who agreed in the past will agree in the future. It uses the **Singular Value Decomposition (SVD)** algorithm to find latent patterns in user ratings.

In [15]:
# --- Load Data for Surprise ---
# The 'surprise' library requires data in a specific format.
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(ratings[['UserID', 'MovieID', 'Rating']], reader)

# --- Train-Test Split ---
# Split the data for model training and evaluation.
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [16]:
# --- Train the SVD Model ---
svd_model = SVD(random_state=42)

print("Training the SVD model...")
# Train the model on the training set.
svd_model.fit(trainset)

# --- Evaluate the Model ---
# Make predictions on the test set.
predictions = svd_model.test(testset)

# Calculate and print the Root Mean Squared Error (RMSE).
rmse = accuracy.rmse(predictions)
print(f"\nModel RMSE: {rmse:.4f}")

Training the SVD model...
RMSE: 0.8729

Model RMSE: 0.8729


In [17]:
# Define the collaborative filtering recommendation function.
def get_collaborative_filtering_recommendations(user_id, model=svd_model, n=10):
    # Get a list of all movie IDs.
    all_movie_ids = ratings['MovieID'].unique()

    # Get the list of movies the user has already rated.
    rated_movie_ids = ratings[ratings['UserID'] == user_id]['MovieID'].unique()

    # Get the list of movies the user has NOT rated.
    unrated_movie_ids = [movie_id for movie_id in all_movie_ids if movie_id not in rated_movie_ids]

    # Predict the ratings for all unrated movies.
    test_set_for_user = [[user_id, movie_id, 4.] for movie_id in unrated_movie_ids]
    user_predictions = model.test(test_set_for_user)

    # Sort the predictions by the estimated rating.
    user_predictions.sort(key=lambda x: x.est, reverse=True)

    # Get the top N recommended movie IDs.
    recommended_movie_ids = [pred.iid for pred in user_predictions[:n]]

    # Get the titles of recommended movies.
    recommended_movies = movies[movies['MovieID'].isin(recommended_movie_ids)]

    return recommended_movies['Title']

# --- Get Recommendation Example ---
print("\n--- Collaborative Filtering Recommendations for User 1 ---")
print(get_collaborative_filtering_recommendations(1))


--- Collaborative Filtering Recommendations for User 1 ---
108                                     Braveheart (1995)
315                      Shawshank Redemption, The (1994)
847                                 Godfather, The (1972)
892                                    Rear Window (1954)
910         Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
1186                            Lawrence of Arabia (1962)
1256                                Cool Hand Luke (1967)
1950    Seven Samurai (The Magnificent Seven) (Shichin...
2953                                  General, The (1927)
3352                                  Animal House (1978)
Name: Title, dtype: object


### 6. Model 3: The Hybrid System

This final model combines the outputs of the previous two models. It provides recommendations that are both similar in content to what a user likes and aligned with the tastes of similar users, creating a more robust and personalized experience.

In [19]:
# Define the hybrid recommendation function.
def get_hybrid_recommendations(user_id, movie_title, n=10):
    """
    Generate hybrid recommendations by combining both models.
    """
    # Get recommendations from the content-based model.
    content_recs = get_content_based_recommendations(movie_title, cosine_sim)

    # Get recommendations from the collaborative filtering model.
    collaborative_recs = get_collaborative_filtering_recommendations(user_id, svd_model)

    # Combine and rank the recommendations, giving more weight to collaborative results.
    content_set = set(content_recs)
    collaborative_set = set(collaborative_recs)
    hybrid_recs = {}
    for movie in content_set:
        hybrid_recs[movie] = hybrid_recs.get(movie, 0) + 1.0 # Base score

    for movie in collaborative_set:
        hybrid_recs[movie] = hybrid_recs.get(movie, 0) + 1.5 # Higher score for taste alignment

    # Sort the recommendations by their combined score.
    sorted_hybrid_recs = sorted(hybrid_recs.items(), key=lambda x: x[1], reverse=True)
    final_recommendations = [rec[0] for rec in sorted_hybrid_recs]

    # Return the top N recommendations.
    return final_recommendations[:n]

In [20]:
# --- Get a Hybrid Recommendation Example ---
# Recommend movies for User 1 who liked 'Toy Story (1995)'
recommendations = get_hybrid_recommendations(user_id=1, movie_title='Toy Story (1995)')
print("\n--- Top 10 Hybrid Recommendations ---")
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec}")


--- Top 10 Hybrid Recommendations ---
1. Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
2. Braveheart (1995)
3. Godfather, The (1972)
4. Animal House (1978)
5. Lawrence of Arabia (1962)
6. Cool Hand Luke (1967)
7. Rear Window (1954)
8. Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
9. Shawshank Redemption, The (1994)
10. General, The (1927)


### 7. Final Evaluation - Precision and Recall @ k

To finalize the project, evaluate the performance of the core collaborative filtering model. This provides a quantitative measure of how good the recommendation lists are.

- **Precision@k:** Out of the top `k` movies recommended, what proportion did the user actually like?
- **Recall@k:** Out of all the movies the user liked, what proportion did we successfully recommend?

In [22]:
# Define the function to calculate precision and recall at k.
def precision_recall_at_k(predictions, k=10, threshold=4.0):
    """
    Return precision and recall at k for each user.
    """
    # Map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():
        # Sort user ratings by estimated value.
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Count number of relevant items.
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Count number of recommended items in top k.
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Count number of relevant and recommended items in top k.
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Calculate Precision@k.
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Calculate Recall@k.
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls

In [23]:
# --- Evaluate our SVD model with the new metrics ---
# The 'predictions' variable is from our SVD model's test set.
print("--- Evaluating Collaborative Filtering (SVD) Model ---")
precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=4.0)

# Average the scores across all users.
avg_precision = sum(p for p in precisions.values()) / len(precisions)
avg_recall = sum(r for r in recalls.values()) / len(recalls)

print(f"Average Precision@10: {avg_precision:.4f}")
print(f"Average Recall@10: {avg_recall:.4f}")

--- Evaluating Collaborative Filtering (SVD) Model ---
Average Precision@10: 0.8734
Average Recall@10: 0.3679


### 8. Project Conclusion

This project successfully built and evaluated a recommendation system using multiple techniques. The core collaborative filtering model (SVD) demonstrated strong performance, achieving a **Precision@10 of 87.3%**. This indicates that the model's top-10 recommendations are highly relevant to users.

Furthermore, a hybrid system was developed to conceptually combine the strengths of both content-based and collaborative filtering models. This advanced approach provides a robust framework for handling real-world challenges like the "cold start" problem and delivering highly personalized recommendations.