# Movie Recommendation System: Item-Based Collaborative Filtering

This Jupyter Notebook contains the core logic for a movie recommendation system based on Item-Based Collaborative Filtering. It covers data loading, preprocessing, calculation of item similarity, and generation of movie recommendations for a given user.

**Dataset:** MovieLens Latest Small (Assumed to be in XLSX format based on previous discussions and screenshots).

**Author:** Your Name (Optional)
**Date:** June 30, 2025
"""

In [16]:
# Import necessary libraries
import pandas as pd
import os
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

DATA_FOLDER_PATH = "C:/Users/admin/OneDrive/Desktop/ALL/Folders/internship/RISE/movie_recommender/ml-latest-small/ml-latest-small/"

print(f"Configured data path: {DATA_FOLDER_PATH}")

Configured data path: C:/Users/admin/OneDrive/Desktop/ALL/Folders/internship/RISE/movie_recommender/ml-latest-small/ml-latest-small/


## 1. Data Loading and Preprocessing

This section defines a function to load the movie ratings and metadata. It then transforms this data into a user-movie matrix and calculates the item-item similarity matrix using cosine similarity.

We are using `pd.read_excel` here because your files were identified as `XLS Worksheet` type. If you later convert them to `.csv` files, you would change `pd.read_excel` to `pd.read_csv` and update the file extensions accordingly.
"""

In [14]:
import os
print(os.getcwd())

C:\Users\admin\OneDrive\Desktop\ALL\Folders\internship\RISE\movie_recommender


In [17]:
import os
DATA_FOLDER_PATH = "C:/Users/admin/OneDrive/Desktop/ALL/Folders/internship/RISE/movie_recommender/ml-latest-small/ml-latest-small/"
print(f"Checking contents of: {DATA_FOLDER_PATH}")
if os.path.exists(DATA_FOLDER_PATH):
    print("Path exists. Listing contents:")
    for item in os.listdir(DATA_FOLDER_PATH):
        print(f"- {item}")
else:
    print("Error: The DATA_FOLDER_PATH itself does not exist.")

Checking contents of: C:/Users/admin/OneDrive/Desktop/ALL/Folders/internship/RISE/movie_recommender/ml-latest-small/ml-latest-small/
Path exists. Listing contents:
- links.csv
- movies.csv
- ratings.csv
- README.txt
- tags.csv


In [18]:
def load_and_process_data_jupyter(data_path):
    """
    Loads movie ratings and metadata, then processes them to create
    a user-movie matrix and an item-item similarity matrix.
    This version includes print statements for better visibility in Jupyter.
    """
    try:
        # !!! CHANGE THIS if your files are actually .csv !!!
        # Load the ratings and movies data using pd.read_csv for .csv files
        ratings = pd.read_csv(os.path.join(data_path, 'ratings.csv')) # Changed to .csv
        movies = pd.read_csv(os.path.join(data_path, 'movies.csv'))   # Changed to .csv

        print("Data loaded successfully.")
        print("\n--- Ratings Data Sample ---")
        print(ratings.head())
        print(f"Shape: {ratings.shape}")
        print("\n--- Movies Data Sample ---")
        print(movies.head())
        print(f"Shape: {movies.shape}")

        # Create the user-item matrix:
        # Rows are 'userId', columns are 'movieId', values are 'rating'.
        # .fillna(0) replaces NaN (Not a Number) values (where a user hasn't
        # rated a movie) with 0.
        user_movie_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
        print("\n--- User-Movie Matrix Sample (first 5x5) ---")
        print(user_movie_matrix.iloc[:5, :5])
        print(f"Shape: {user_movie_matrix.shape}")

        # Transpose the user-movie matrix to get a movie-user matrix.
        # For Item-Based Collaborative Filtering, we need movies as rows
        # and users as columns to calculate similarity between movies.
        movie_user_matrix = user_movie_matrix.T
        print("\n--- Movie-User Matrix Sample (first 5x5) ---")
        print(movie_user_matrix.iloc[:5, :5])
        print(f"Shape (transposed): {movie_user_matrix.shape}")

        # Calculate Cosine Similarity between all pairs of movies.
        # Cosine similarity measures the cosine of the angle between two vectors.
        # A value closer to 1 indicates higher similarity.
        item_similarity_matrix = cosine_similarity(movie_user_matrix)
        print(f"\nItem Similarity Matrix computed: {item_similarity_matrix.shape}")

        # Convert the numpy array similarity matrix back into a Pandas DataFrame.
        # This makes it easy to look up similarities using movie IDs as index and columns.
        item_similarity_df = pd.DataFrame(item_similarity_matrix,
                                          index=movie_user_matrix.index,
                                          columns=movie_user_matrix.index)
        print("\n--- Item Similarity DataFrame Sample (first 5x5) ---")
        print(item_similarity_df.iloc[:5, :5])
        print(f"Shape: {item_similarity_df.shape}")

        return ratings, movies, user_movie_matrix, item_similarity_df

    except FileNotFoundError as e:
        print(f"Error: Data files not found. Please check path and file names (e.g., ratings.csv, movies.csv). Details: {e}") # Updated error message
        print("Ensure 'ml-latest-small' folder is correctly placed and DATA_FOLDER_PATH is accurate.")
        return None, None, None, None
    except Exception as e:
        print(f"An unexpected error occurred during data loading or processing: {e}")
        return None, None, None, None

# Execute the data loading and processing
ratings_df, movies_df, user_movie_matrix_filled, item_similarity_df = load_and_process_data_jupyter(DATA_FOLDER_PATH)

# Check if data loaded successfully before proceeding
if ratings_df is None or movies_df is None:
    print("\nData loading failed. Please review the error messages and ensure your data files are correct.")

Data loaded successfully.

--- Ratings Data Sample ---
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
Shape: (100836, 4)

--- Movies Data Sample ---
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
Shape: (9742, 3)

--- User-Movie Matrix Sample (first 5x5) ---
movieId    1

## 2. Recommendation Function

This function `get_recommendations_item_based` takes a `user_id` and the desired number of recommendations as input. It identifies movies previously rated by the user, finds similar movies to those rated, aggregates scores based on similarity and the user's own ratings, and then returns the top recommended movie titles.
"""

In [19]:
def get_recommendations_item_based(user_id, num_recommendations=10):
    """
    Generates movie recommendations for a given user ID using item-based
    collaborative filtering.
    """
    if user_id not in user_movie_matrix_filled.index:
        print(f"User ID {user_id} not found in the dataset.")
        return []

    user_ratings = user_movie_matrix_filled.loc[user_id]
    rated_movies_by_user = user_ratings[user_ratings > 0] # Filter for movies the user actually rated

    if rated_movies_by_user.empty:
        print(f"User ID {user_id} has not rated any movies in the dataset.")
        return []

    recommendation_scores = {}

    for movie_id, rating in rated_movies_by_user.items():
        if movie_id in item_similarity_df.columns:
            similar_movies = item_similarity_df[movie_id]

            for sim_movie_id, similarity_score in similar_movies.items():
                # Only consider movies that the user HAS NOT already rated and exist in movies_df
                if sim_movie_id not in rated_movies_by_user.index and \
                   sim_movie_id in movies_df['movieId'].values:
                    # Calculate a weighted score: similarity * user's rating.
                    # Movies similar to highly-rated movies by the user get a higher score.
                    weighted_score = similarity_score * rating

                    # Add this weighted score to the total recommendation score for the similar movie.
                    recommendation_scores.setdefault(sim_movie_id, 0)
                    recommendation_scores[sim_movie_id] += weighted_score

    # Remove movies with zero or negative scores (e.g., from negative similarity values or no valid contributions)
    recommendation_scores = {k: v for k, v in recommendation_scores.items() if v > 0}

    if not recommendation_scores:
        print(f"Could not generate recommendations for User {user_id} (not enough similar movies or valid scores).")
        return []

    # Sort the recommended movies by their aggregated scores in descending order.
    sorted_recommendations = sorted(recommendation_scores.items(), key=lambda x: x[1], reverse=True)

    # Extract only the movie IDs for the top N recommendations.
    recommended_movie_ids = [movie_id for movie_id, score in sorted_recommendations[:num_recommendations]]

    # Retrieve the actual movie titles using the 'movies_df'.
    recommended_movie_titles = movies_df[movies_df['movieId'].isin(recommended_movie_ids)]['title'].tolist()

    return recommended_movie_titles

## 3. Example Usage

This section demonstrates how to use the `get_recommendations_item_based` function to get movie recommendations for a few sample users from the dataset.
"""

In [20]:
# Ensure data was loaded successfully before running examples
if ratings_df is not None and movies_df is not None and user_movie_matrix_filled is not None and item_similarity_df is not None:
    # Get a list of unique user IDs from your dataset
    all_user_ids = sorted(ratings_df['userId'].unique().tolist())

    # --- Example for a specific user (e.g., User ID 1) ---
    test_user_id = 1
    num_recommendations_to_get = 10

    print(f"\n--- Demonstrating Recommendations for User ID {test_user_id} ---")
    print(f"Movies User {test_user_id} has rated:")
    user_rated_movies = ratings_df[ratings_df['userId'] == test_user_id].merge(
        movies_df[['movieId', 'title']], on='movieId'
    )['title'].tolist()
    if user_rated_movies:
        for i, title in enumerate(user_rated_movies[:10]): # Display up to 10 rated movies
            print(f"- {title}")
        if len(user_rated_movies) > 10:
            print(f"... and {len(user_rated_movies) - 10} more.")
    else:
        print("This user has not rated any movies in the dataset.")

    recommended_movies = get_recommendations_item_based(test_user_id, num_recommendations_to_get)

    if recommended_movies:
        print(f"\nTop {num_recommendations_to_get} Recommended Movies for User {test_user_id}:")
        for i, movie_title in enumerate(recommended_movies):
            print(f"{i+1}. {movie_title}")
    else:
        print(f"No recommendations generated for User {test_user_id}.")

    # --- Example for a few random users ---
    print("\n" + "="*50)
    print("--- Testing with a few other users ---")
    import random
    if len(all_user_ids) > 5:
        sample_users = random.sample(all_user_ids, 5) # Get 5 random users
    else:
        sample_users = all_user_ids # If less than 5 users, use all

    for user_id in sample_users:
        if user_id == test_user_id: # Skip if already processed
            continue

        print(f"\n--- Recommendations for User ID {user_id} ---")
        user_rated_movies_sample = ratings_df[ratings_df['userId'] == user_id].merge(
            movies_df[['movieId', 'title']], on='movieId'
        )['title'].tolist()
        if user_rated_movies_sample:
            print("Movies User has rated (sample):")
            for i, title in enumerate(user_rated_movies_sample[:5]):
                print(f"- {title}")
            if len(user_rated_movies_sample) > 5:
                print(f"... and {len(user_rated_movies_sample) - 5} more.")
        else:
            print("This user has not rated any movies in the dataset.")

        recommended_movies = get_recommendations_item_based(user_id, 10)
        if recommended_movies:
            print(f"\nTop 10 Recommended Movies for User {user_id}:")
            for i, movie_title in enumerate(recommended_movies):
                print(f"{i+1}. {movie_title}")
        else:
            print(f"No recommendations generated for User {user_id}.")
else:
    print("Cannot run examples. Data was not loaded successfully in previous steps.")


--- Demonstrating Recommendations for User ID 1 ---
Movies User 1 has rated:
- Toy Story (1995)
- Grumpier Old Men (1995)
- Heat (1995)
- Seven (a.k.a. Se7en) (1995)
- Usual Suspects, The (1995)
- From Dusk Till Dawn (1996)
- Bottle Rocket (1996)
- Braveheart (1995)
- Rob Roy (1995)
- Canadian Bacon (1995)
... and 222 more.

Top 10 Recommended Movies for User 1:
1. Terminator 2: Judgment Day (1991)
2. 2001: A Space Odyssey (1968)
3. Die Hard (1988)
4. Aliens (1986)
5. Mars Attacks! (1996)
6. Fifth Element, The (1997)
7. Breakfast Club, The (1985)
8. Austin Powers: The Spy Who Shagged Me (1999)
9. Sixth Sense, The (1999)
10. Ferris Bueller's Day Off (1986)

--- Testing with a few other users ---

--- Recommendations for User ID 168 ---
Movies User has rated (sample):
- Taxi Driver (1976)
- Species (1995)
- Like Water for Chocolate (Como agua para chocolate) (1992)
- Pulp Fiction (1994)
- Muriel's Wedding (1994)
... and 89 more.

Top 10 Recommended Movies for User 168:
1. Blade Runner (