# Movie Recommendation System with Data Enrichment

## Project Overview

This notebook demonstrates the development of a movie recommendation system using the MovieLens dataset, enriched with data from The Movie Database (TMDB) API. We use collaborative filtering techniques, specifically the Singular Value Decomposition (SVD) algorithm, to create personalized movie recommendations.

## Data Processing and Enrichment

1. Load MovieLens data (movies and ratings)
2. Fetch additional data from TMDB API for the first 1000 movies
3. Normalize TMDB ratings and merge with MovieLens data

## Model Development

We use the Surprise library to implement our SVD model:

1. Load enriched data
2. Create Surprise dataset
3. Split data into training and test sets
4. Train SVD model
5. Evaluate model performance

## Results

Our enriched SVD model achieved:
- RMSE on test set: 0.7861
- MAE on test set: 0.5894

Compared to our original models:
- SVD: RMSE 0.7856, MAE 0.5889
- NMF: RMSE 0.8712, MAE 0.6626
- BaselineOnly: RMSE 0.8630, MAE 0.6573

## Interpretation

The enriched SVD model performs similarly to the original SVD model, with only a slight decrease in performance. It still outperforms NMF and BaselineOnly models significantly. This suggests that while the TMDB data enrichment didn't substantially improve our predictions, it also didn't negatively impact the model's performance.

The small difference between training and test set performance (RMSE: 0.6766 vs 0.7861, MAE: 0.5109 vs 0.5894) indicates that our model generalizes well to unseen data without severe overfitting.

## Conclusion

Our SVD-based recommendation system, even with enriched data, proves to be robust and effective for movie recommendations. Future work could explore more sophisticated ways of integrating external data or experimenting with hybrid models to further improve performance.

In [None]:
import pandas as pd
import requests
import time

# Load MovieLens data
movies_data = pd.read_csv('data/movies.csv', encoding='latin-1')  # Movie data
ratings_data = pd.read_csv('data/ratings.csv', encoding='latin-1')  # Rating data

# Display the first few rows of both datasets
print("Ratings Data:")
print(ratings_data.head())

print("\nMovies Data:")
print(movies_data.head())

# Check the shape of the datasets
print("\nShape of Ratings Data:", ratings_data.shape)
print("Shape of Movies Data:", movies_data.shape)

# Your TMDB API key
api_key = 'b1f601655ffa0c1620a6c18a06e9a9be'

# Function to get TMDB data
def get_tmdb_info(movie_title, year):
    try:
        # Create a query URL for TMDB API
        query = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&query={movie_title}&year={year}"
        response = requests.get(query)
        
        # Check if the response is successful
        if response.status_code == 200:
            data = response.json()
            if data['results']:
                movie_info = data['results'][0]
                return {
                    'tmdb_id': movie_info['id'],
                    'tmdb_rating': movie_info['vote_average'],
                    'tmdb_vote_count': movie_info['vote_count']
                }
    except Exception as e:
        print(f"Error fetching data for {movie_title} ({year}): {e}")
    return None

# Fetch TMDB data for the first 1000 movies
sample_movies = movies_data.head(1000)
tmdb_info_list = []

for _, row in sample_movies.iterrows():
    movie_title = row['title']
    year = int(movie_title.strip()[-5:-1])  # Extract the year from the movie title
    tmdb_info = get_tmdb_info(movie_title, year)
    if tmdb_info:
        tmdb_info_list.append({
            'movieId': row['movieId'],
            'tmdb_id': tmdb_info['tmdb_id'],
            'tmdb_rating': tmdb_info['tmdb_rating'],
            'tmdb_vote_count': tmdb_info['tmdb_vote_count']
        })
    time.sleep(0.5)  # Delay to avoid hitting the API rate limit

# Convert the list to a DataFrame
tmdb_df = pd.DataFrame(tmdb_info_list)
print("TMDB Data Sample:")
print(tmdb_df.head())

# Step 3: Normalize TMDB Ratings and Merge with MovieLens Data
# Normalize TMDB rating to the MovieLens scale (0.5 to 5.0)
tmdb_df['tmdb_rating_normalized'] = tmdb_df['tmdb_rating'] / 2

# Merge TMDB data with MovieLens ratings data
enriched_ratings = pd.merge(ratings_data, tmdb_df[['movieId', 'tmdb_rating_normalized', 'tmdb_vote_count']], on='movieId', how='left')

# Display the first few rows to check the merged data with normalized ratings
print("Ratings Data with Normalized TMDB Ratings:")
print(enriched_ratings.head())

# Save the enriched ratings data to a CSV file
enriched_ratings.to_csv('enriched_ratings.csv', index=False)
print("Enriched ratings data saved to 'enriched_ratings.csv'")

In [2]:
import pandas as pd
import pickle
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Load the enriched data
ratings_data = pd.read_csv('enriched_ratings.csv') 

# Create a Reader object
reader = Reader(rating_scale=(0.5, 5))

# Create a Surprise dataset from the pandas DataFrame
data = Dataset.load_from_df(ratings_data[['userId', 'movieId', 'rating']], reader)

# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)


# Create and train the SVD model
svd_model = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
svd_model.fit(trainset)

# Evaluate the model on the test set
predictions = svd_model.test(testset)
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

print(f"RMSE on the test set: {rmse:.4f}")
print(f"MAE on the test set: {mae:.4f}")


RMSE: 0.7861
MAE:  0.5894
RMSE on the test set: 0.7861
MAE on the test set: 0.5894


In [3]:
# Save the trained model
with open('svd_model_enriched_RS.pkl', 'wb') as file:
    pickle.dump(svd_model, file)

In [4]:
# Additional: Evaluate on the training set for comparison
train_predictions = svd_model.test(trainset.build_testset())
train_rmse = accuracy.rmse(train_predictions)
train_mae = accuracy.mae(train_predictions)

print(f"RMSE on the training set: {train_rmse:.4f}")
print(f"MAE on the training set: {train_mae:.4f}")

RMSE: 0.6766
MAE:  0.5109
RMSE on the training set: 0.6766
MAE on the training set: 0.5109
