# Movie Recommender: Model Training and Evaluation
## 1. Introduction
This notebook focuses on implementing and evaluating various recommendation algorithms. We will explore different approaches, from simple popularity-based recommendations to more sophisticated content-based and collaborative filtering methods. The goal is to understand their strengths and weaknesses and identify the most suitable model(s) for our system.

## 2. Setup and Imports
We'll import the necessary libraries, including pandas for data manipulation, numpy for numerical operations, scikit-learn for various ML utilities (e.g., pairwise.cosine_similarity, model_selection.train_test_split), and matplotlib/seaborn for visualization.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.sparse import csr_matrix # For efficient sparse matrix operations

# Configure seaborn for better aesthetics
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['legend.fontsize'] = 12

## 3. Load Processed Data
We will load the preprocessed data saved from the _01_data_exploration_preprocessing.ipynb_ notebook. This includes the merged DataFrame containing user, movie, and rating information, and the _movies_encoded_ DataFrame which has one-hot encoded genres.

In [15]:
# Define paths to processed data
processed_data_path = '../data/processed/'
raw_data_path = '../data/raw/ml-1m/' # Path to original raw data for movies.dat

merged_df_path = os.path.join(processed_data_path, 'merged_movie_data.csv')
movies_encoded_df_path = os.path.join(processed_data_path, 'movies_encoded.csv')
original_movies_file = os.path.join(raw_data_path, 'movies.dat') # Path to original movies.dat

# Load the DataFrames
try:
    df = pd.read_csv(merged_df_path)
    movies_encoded_df = pd.read_csv(movies_encoded_df_path)
    # Load the original movies_df for title lookups
    movies_df = pd.read_csv(original_movies_file, sep='::', engine='python', encoding='latin-1',
                            names=['MovieID', 'Title', 'Genres'])

    print(f"Loaded merged_movie_data.csv with {df.shape[0]} rows and {df.shape[1]} columns.")
    print(f"Loaded movies_encoded.csv with {movies_encoded_df.shape[0]} rows and {movies_encoded_df.shape[1]} columns.")
    print(f"Loaded original movies_df with {movies_df.shape[0]} rows and {movies_df.shape[1]} columns.")
except FileNotFoundError:
    print("Error: Processed or raw data files not found. Please ensure '01_data_exploration_preprocessing.ipynb' has been run and data saved, and raw data is in place.")
    # Exit or handle the error appropriately
    exit() # For demonstration, you might want to raise an exception or provide more robust handling

# Convert Timestamp back to datetime if needed for further time-based analysis
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

print("\n--- Merged DataFrame Head ---")
print(df.head())
print("\n--- Movies Encoded DataFrame Head ---")
print(movies_encoded_df.head())
print("\n--- Original Movies DataFrame Head ---")
print(movies_df.head())


Loaded merged_movie_data.csv with 1000209 rows and 10 columns.
Loaded movies_encoded.csv with 3883 rows and 21 columns.
Loaded original movies_df with 3883 rows and 3 columns.

--- Merged DataFrame Head ---
   UserID  MovieID  Rating           Timestamp  \
0       1     1193       5 2000-12-31 22:12:40   
1       1      661       3 2000-12-31 22:35:09   
2       1      914       3 2000-12-31 22:32:48   
3       1     3408       4 2000-12-31 22:04:35   
4       1     2355       5 2001-01-06 23:38:11   

                                    Title                        Genres  \
0  One Flew Over the Cuckoo's Nest (1975)                         Drama   
1        James and the Giant Peach (1996)  Animation|Children's|Musical   
2                     My Fair Lady (1964)               Musical|Romance   
3                  Erin Brockovich (2000)                         Drama   
4                    Bug's Life, A (1998)   Animation|Children's|Comedy   

  Gender  Age  Occupation Zip-code  
0   

## 4. Model 1: Popularity-Based Recommender
This is the simplest form of a recommender system. It recommends movies that are most popular among all users. Popularity can be defined by the number of ratings or the average rating (with a minimum threshold to avoid highly-rated but rarely-watched movies).

This model serves as a baseline to compare more complex algorithms against.

In [16]:
# Calculate average rating and number of ratings for each movie
movie_popularity = df.groupby('MovieID').agg(
    avg_rating=('Rating', 'mean'),
    num_ratings=('Rating', 'count')
).reset_index()

# Merge with movie titles (movies_df is now loaded at the beginning of this notebook)
movie_popularity = pd.merge(movie_popularity, movies_df[['MovieID', 'Title']], on='MovieID', how='left')

# Define a minimum number of ratings to be considered "popular"
min_ratings_threshold = 50 # A common threshold to ensure statistical significance

# Filter out movies below the threshold
popular_movies_filtered = movie_popularity[movie_popularity['num_ratings'] >= min_ratings_threshold]

# Sort by average rating (descending) and then by number of ratings (descending)
top_popular_movies = popular_movies_filtered.sort_values(by=['avg_rating', 'num_ratings'], ascending=[False, False])

print(f"\n--- Top 10 Popular Movies (min. {min_ratings_threshold} ratings) ---")
print(top_popular_movies.head(10))

def get_popular_recommendations(num_recommendations=10):
    return top_popular_movies.head(num_recommendations)['Title'].tolist()

print("\nExample Popular Recommendations:")
print(get_popular_recommendations(5))



--- Top 10 Popular Movies (min. 50 ratings) ---
      MovieID  avg_rating  num_ratings  \
2698     2905    4.608696           69   
1839     2019    4.560510          628   
309       318    4.554558         2227   
802       858    4.524966         2223   
708       745    4.520548          657   
49         50    4.517106         1783   
513       527    4.510417         2304   
1066     1148    4.507937          882   
861       922    4.491489          470   
1108     1198    4.477725         2514   

                                                  Title  
2698                                     Sanjuro (1962)  
1839  Seven Samurai (The Magnificent Seven) (Shichin...  
309                    Shawshank Redemption, The (1994)  
802                               Godfather, The (1972)  
708                               Close Shave, A (1995)  
49                           Usual Suspects, The (1995)  
513                             Schindler's List (1993)  
1066                    

### Evaluation of Popularity-Based Recommender:
This model doesn't require formal train/test split or complex metrics like RMSE. Its "evaluation" is simply presenting the most popular items. It lacks personalization, which is its main drawback.

## 5. Model 2: Content-Based Recommender (Genre-Based)
A content-based recommender suggests items similar to those a user has liked in the past. Here, we'll use movie genres as the content features.

**Approach:**
1. For a given user, identify the movies they have rated highly.
2. Create a "user profile" by averaging the genre vectors of these highly-rated movies.
3. Calculate the cosine similarity between this user profile and all other movies (that the user hasn't rated yet).
4. Recommend the top N movies with the highest similarity scores.

In [17]:
# Prepare the genre features for similarity calculation
# Drop 'MovieID', 'Title', 'Genres' from movies_encoded_df to get only genre columns
genre_features_df = movies_encoded_df.drop(columns=['MovieID', 'Title', 'Genres'])
genre_features_df = genre_features_df.set_index(movies_encoded_df['MovieID']) # Set MovieID as index for easy lookup

print("\n--- Genre Features DataFrame Head ---")
print(genre_features_df.head())

# Function to get content-based recommendations
def get_content_based_recommendations(user_id, df_merged, movies_encoded_df, num_recommendations=10, min_rating=4):
    """
    Generates content-based recommendations for a given user based on their highly-rated movies' genres.

    Args:
        user_id (int): The ID of the user for whom to generate recommendations.
        df_merged (pd.DataFrame): The merged DataFrame containing ratings, movies, and user info.
        movies_encoded_df (pd.DataFrame): DataFrame with movies and their one-hot encoded genres.
        num_recommendations (int): The number of recommendations to return.
        min_rating (int): Minimum rating a user must have given for a movie to be considered for their profile.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get movies rated by the user
    user_ratings = df_merged[df_merged['UserID'] == user_id]

    # Get movies the user has rated highly
    highly_rated_movies = user_ratings[user_ratings['Rating'] >= min_rating]

    if highly_rated_movies.empty:
        print(f"User {user_id} has no movies rated {min_rating} or higher. Cannot generate content-based recommendations.")
        return []

    # Get genre features for highly-rated movies
    highly_rated_movie_ids = highly_rated_movies['MovieID'].tolist()
    highly_rated_genres = movies_encoded_df[movies_encoded_df['MovieID'].isin(highly_rated_movie_ids)]

    # Drop non-genre columns to get only genre features
    highly_rated_genre_features = highly_rated_genres.drop(columns=['MovieID', 'Title', 'Genres'])

    # Create user profile by averaging genre features of highly-rated movies
    user_profile = highly_rated_genre_features.mean(axis=0).values.reshape(1, -1)

    # Get movies not yet rated by the user
    rated_movie_ids = user_ratings['MovieID'].tolist()
    unrated_movies = movies_encoded_df[~movies_encoded_df['MovieID'].isin(rated_movie_ids)]

    if unrated_movies.empty:
        print(f"User {user_id} has rated all available movies. No new recommendations.")
        return []

    # Prepare genre features for unrated movies
    unrated_genre_features = unrated_movies.drop(columns=['MovieID', 'Title', 'Genres'])

    # Calculate cosine similarity between user profile and unrated movies
    # Ensure both arrays have the same number of features (columns)
    # This might be an issue if a user has only rated movies with a subset of all genres.
    # To fix this, we should align columns or use a predefined list of all genres.
    # For now, let's assume all_genres from EDA is available or recreate it.
    all_genres_list = movies_encoded_df.columns.drop(['MovieID', 'Title', 'Genres']).tolist()

    # Reindex user_profile and unrated_genre_features to ensure column alignment
    user_profile_df = pd.DataFrame(user_profile, columns=all_genres_list)
    unrated_genre_features_aligned = unrated_genre_features.reindex(columns=all_genres_list, fill_value=0)


    similarities = cosine_similarity(user_profile_df, unrated_genre_features_aligned)
    similarity_scores = list(enumerate(similarities[0]))

    # Sort movies by similarity score
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    # Get top N recommended movie IDs
    recommended_movie_indices = [i[0] for i in similarity_scores[:num_recommendations]]
    recommended_movie_ids = unrated_movies.iloc[recommended_movie_indices]['MovieID'].tolist()

    # Get movie titles
    # movies_df is now available globally in this notebook
    recommended_movie_titles = movies_df[movies_df['MovieID'].isin(recommended_movie_ids)]['Title'].tolist()

    return recommended_movie_titles

# Example Content-Based Recommendations for UserID 1
user_id_example = 1
print(f"\nExample Content-Based Recommendations for User {user_id_example}:")
content_recs = get_content_based_recommendations(user_id_example, df, movies_encoded_df, num_recommendations=5)
print(content_recs)

# Example for a user with potentially fewer high ratings or different taste
user_id_example_2 = 500
print(f"\nExample Content-Based Recommendations for User {user_id_example_2}:")
content_recs_2 = get_content_based_recommendations(user_id_example_2, df, movies_encoded_df, num_recommendations=5)
print(content_recs_2)



--- Genre Features DataFrame Head ---
         Action  Adventure  Animation  Children's  Comedy  Crime  Documentary  \
MovieID                                                                         
1             0          0          1           1       1      0            0   
2             0          1          0           1       0      0            0   
3             0          0          0           0       1      0            0   
4             0          0          0           0       1      0            0   
5             0          0          0           0       1      0            0   

         Drama  Fantasy  Film-Noir  Horror  Musical  Mystery  Romance  Sci-Fi  \
MovieID                                                                         
1            0        0          0       0        0        0        0       0   
2            0        1          0       0        0        0        0       0   
3            0        0          0       0        0        0        1

### Evaluation of Content-Based Recommender:
Evaluating content-based systems can be tricky with traditional metrics if we don't have explicit "correct" recommendations. However, we can still use a train-test split approach to predict ratings or measure precision/recall for top-N recommendations. For simplicity in this notebook, we focus on the recommendation logic. A more robust evaluation would involve:

1. Splitting user's historical ratings into training and test sets.
2. Building the user profile from the training set.
3. Measuring how many movies from the test set are present in the top N recommendations.