# Project 2: Movie Recommendation System

This notebook demonstrates how to build a simple item-based collaborative filtering recommendation system. We will use the MovieLens 100k dataset to recommend movies similar to a given movie.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

## 1. Load and Prepare the Data

First, we load the `u.data` file, which contains user ratings, and the `u.item` file, which contains movie information. Make sure these files are in a `data/` subdirectory.

In [None]:
try:
    # Define column names for the ratings data
    r_cols = ['user_id', 'item_id', 'rating', 'timestamp']
    ratings = pd.read_csv('data/u.data', sep='\t', names=r_cols, encoding='latin-1')

    # Define column names for the movie titles data
    m_cols = ['item_id', 'title'] + [f'col{i}' for i in range(22)]
    movies = pd.read_csv('data/u.item', sep='|', names=m_cols, usecols=['item_id', 'title'], encoding='latin-1')

    # Merge the two dataframes
    df = pd.merge(ratings, movies, on='item_id')
    print("Data loaded successfully.")
    print(df.head())

except FileNotFoundError:
    print("Data files not found. Please download the dataset and place them in the 'data/' directory as specified in the README.")

## 2. Exploratory Data Analysis (EDA)

Let's explore the data to see which movies have the most ratings and what the distribution of ratings looks like.

In [None]:
# Calculate mean rating and number of ratings for each movie
movie_stats = df.groupby('title').agg(mean_rating=('rating', 'mean'),
                                     num_of_ratings=('rating', 'count')).reset_index()

print("Top 5 movies by number of ratings:")
print(movie_stats.sort_values('num_of_ratings', ascending=False).head())

In [None]:
# Plot histograms of rating distributions
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(movie_stats['num_of_ratings'], bins=70)
plt.title('Distribution of Number of Ratings per Movie')

plt.subplot(1, 2, 2)
sns.histplot(movie_stats['mean_rating'], bins=70)
plt.title('Distribution of Mean Ratings per Movie')

plt.tight_layout()
plt.show()

## 3. Building the Item-Based Collaborative Filter

Now we'll create a user-item matrix, which shows the rating given by each user to each movie. We will then use this matrix to compute the correlation between movies based on user ratings.

In [None]:
# Create the user-item matrix
user_movie_matrix = df.pivot_table(index='user_id', columns='title', values='rating')

print("User-Item Matrix preview:")
print(user_movie_matrix.head())

In [None]:
# Choose a movie to get recommendations for
movie_to_recommend = 'Star Wars (1977)'

# Get the ratings for the chosen movie
target_movie_ratings = user_movie_matrix[movie_to_recommend]

# Compute the correlation with other movies
similar_movies = user_movie_matrix.corrwith(target_movie_ratings)

# Create a dataframe of the results
corr_df = pd.DataFrame(similar_movies, columns=['Correlation'])
corr_df.dropna(inplace=True)

print(f"Correlations with '{movie_to_recommend}':")
print(corr_df.head())

## 4. Generating Recommendations

To improve the quality of our recommendations, we'll filter out movies that have a low number of ratings. A movie might have a high correlation just because one or two people who rated both gave them similar scores. We'll set a threshold for the number of ratings to ensure statistical significance.

In [None]:
# Join the correlation data with the movie stats (which includes num_of_ratings)
corr_df = corr_df.join(movie_stats.set_index('title'))

# Filter out movies with fewer than 100 ratings
recommendations = corr_df[corr_df['num_of_ratings'] > 100].sort_values('Correlation', ascending=False)

print(f"Top recommendations for '{movie_to_recommend}':")
# Exclude the movie itself from the recommendations
print(recommendations.iloc[1:].head())

## 5. Conclusion

This notebook demonstrated a basic item-based collaborative filtering approach. The recommendations seem reasonable, with other popular sci-fi movies appearing at the top of the list for 'Star Wars'.

### Potential Next Steps

- **Experiment with Thresholds:** Try different values for the minimum number of ratings to see how it affects recommendation quality.
- **Try Different Similarity Metrics:** Use other metrics like cosine similarity instead of Pearson correlation.
- **Implement Model-Based Approaches:** Explore more advanced techniques like Matrix Factorization (e.g., Singular Value Decomposition - SVD) which can often yield better and more personalized results.