# Movie Recommendation System

This notebook builds a simple movie recommendation system from scratch using collaborative filtering based on user ratings.

## 1. Load Data

**Subtask:** Load the movie ratings and movie title data.

**Reasoning:**
Download and unzip the MovieLens 1M dataset, then load the ratings and movie information into appropriate data structures.

In [68]:
# Download and unzip the dataset
!wget -nc http://files.grouplens.org/datasets/movielens/ml-1m.zip
!unzip -n ml-1m.zip

import numpy as np
from collections import defaultdict

# Define variables and dataset parameters
data_path = '/content/ml-1m/ratings.dat'
movies_path = '/content/ml-1m/movies.dat'
n_users = 6040
n_movies = 3706

def load_rating_data(data_path, n_users, n_movies):
    """
    Loads movie rating data from file, and returns the number of
    ratings for each movie and a movie ID to index mapping.
    """
    data = np.zeros([n_users, n_movies], dtype=np.uint8)
    movie_id_mapping = {}
    movie_n_rating = defaultdict(int)

    with open(data_path, 'r') as file:
        for line in file.readlines():
            line_split = line.strip().split('::')
            user_id, movie_id, rating = line_split[0], line_split[1], line_split[2]

            # Convert to zero-based index
            user_id = int(user_id) - 1

            # Create movie ID mapping if it doesn't exist
            if movie_id not in movie_id_mapping:
                movie_id_mapping[movie_id] = len(movie_id_mapping)

            # Convert rating to integer
            rating = int(rating)

            # Store the rating in the data matrix
            data[user_id, movie_id_mapping[movie_id]] = rating

            if rating > 0:
                movie_n_rating[movie_id] += 1

    return data, movie_n_rating, movie_id_mapping

def load_movie_titles(movies_path):
    """
    Loads movie titles from file and returns a dictionary mapping movie ID to title.
    """
    movie_id_to_title = {}
    with open(movies_path, 'r', encoding='ISO-8859-1') as file:
        for line in file.readlines():
            line_split = line.strip().split('::')
            movie_id, movie_title = line_split[0], line_split[1]
            movie_id_to_title[int(movie_id)] = movie_title
    return movie_id_to_title

# Load dataset
data, movie_n_rating, movie_id_mapping = load_rating_data(data_path, n_users, n_movies)
movie_id_to_title = load_movie_titles(movies_path)

print('Shape of user-item matrix:', data.shape)
print('Number of movies loaded:', len(movie_id_to_title))

File ‘ml-1m.zip’ already there; not retrieving.

Archive:  ml-1m.zip
Shape of user-item matrix: (6040, 3706)
Number of movies loaded: 3883


## 2. Prepare Data
**Subtask:** Display dataset statistics and create a binary rating column.

**Reasoning:**
Display the number of users and movies loaded. Count the occurrences of each rating value (0-5) in the `user_item_matrix`. Create a new binary column where the value is 1 if the rating is 4 or 5, and 0 otherwise.

In [69]:
# Display the number of users and movies
print(f"Number of users: {user_item_matrix.shape[0]}")
print(f"Number of movies: {user_item_matrix.shape[1]}")

# Count the occurrences of each rating value
rating_counts = {}
for rating in range(6): # Ratings are 0 through 5
    count = np.sum(user_item_matrix == rating)
    rating_counts[rating] = count

print("\nRating Counts:")
for rating, count in rating_counts.items():
    print(f"Rating {rating}: {count}")

# Create a new binary column for high ratings (>= 4)
# We'll create a new matrix for this to avoid modifying the original user_item_matrix
high_rating_matrix = np.zeros_like(user_item_matrix, dtype=np.uint8)
high_rating_matrix[user_item_matrix >= 4] = 1

print("\nShape of high rating matrix:", high_rating_matrix.shape)

# Print a small slice of the high_rating_matrix to see its contents
print("\nSample of high rating matrix (first 5 users and first 10 movies):")
print(high_rating_matrix[0:5, 0:10])

Number of users: 6040
Number of movies: 3706

Rating Counts:
Rating 0: 21384031
Rating 1: 56174
Rating 2: 107557
Rating 3: 261197
Rating 4: 348971
Rating 5: 226310

Shape of high rating matrix: (6040, 3706)

Sample of high rating matrix (first 5 users and first 10 movies):
[[1 0 0 1 1 0 1 1 1 1]
 [1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 1]]


## 2. Prepare Data (Continued)

**Subtask:** Preprocess the data for building a recommendation model by creating a user-item matrix.

**Reasoning:**
The `data` variable loaded in the previous step already represents the user-item matrix with 0s for missing ratings. We will use this directly and confirm its shape.

In [70]:
# The 'data' variable already represents the user-item matrix with 0s for missing ratings.
# So, we will directly use the 'data' variable.
user_item_matrix = data

# Display the shape of the user-item matrix
print('Shape of user-item matrix:', user_item_matrix.shape)

Shape of user-item matrix: (6040, 3706)


## 3. Build Recommendation Model

**Subtask:** Build a recommendation model using the user-item matrix.

**Reasoning:**
We will use TF-IDF on the user-item matrix to create features that represent how users rate movies. Each row will be treated as a document, and the ratings (0-5) will be the terms.

In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Convert each row of the user-item matrix to a space-separated string
user_item_matrix_strings = [' '.join(map(str, row)) for row in user_item_matrix]

# Instantiate TfidfVectorizer with a vocabulary including '0', ngram_range=(1,1), and a simple tokenizer
vectorizer = TfidfVectorizer(vocabulary=['0', '1', '2', '3', '4', '5'], ngram_range=(1, 1), tokenizer=lambda x: x.split())

# Apply TfidfVectorizer to the user-item matrix strings
user_features_sparse = vectorizer.fit_transform(user_item_matrix_strings)

# Convert the sparse matrix to a dense NumPy array
user_features_dense = user_features_sparse.toarray()

# Transpose the user features to get movie features (each column is a movie)
movie_features_dense = user_features_dense.T

# Print the shape of the dense feature array
print('Shape of movie features (dense):', movie_features_dense.shape)

Shape of movie features (dense): (6, 6040)


## 4. Get User Input

**Subtask:** Prompt the user to enter a movie name.

**Reasoning:**
Use the `input()` function to get the movie name from the user and store it in a variable.

In [72]:
movie_name = input('Please enter a movie name: ')
print(f'You entered: {movie_name}')

Please enter a movie name: Batman Forever
You entered: Batman Forever


## 5. Find Similar Movies

**Subtask:** Find movies similar to the one entered by the user.

**Reasoning:**
Find the movie ID for the user's input, then its index in the movie mapping, calculate cosine similarity with all other movie features, find the top similar movies, and get their titles.

In [73]:
# Find the movie ID from the user's input movie name
movie_id_input = None
for movie_id, title in movie_id_to_title.items():
    if movie_name.lower() in title.lower():
        movie_id_input = movie_id
        print(f'Found movie ID: {movie_id} for "{title}"')
        break

top_similar_movies = [] # Initialize as empty list

if movie_id_input is None:
    print(f"Movie '{movie_name}' not found in the dataset.")
else:
    # Find the index of the user's chosen movie in the movie_id_mapping
    if str(movie_id_input) in movie_id_mapping:
        movie_index_input = movie_id_mapping[str(movie_id_input)]

        # Select the column corresponding to the input movie index from the movie features
        input_movie_features = movie_features_dense[:, movie_index_input].reshape(1, -1)

        # Calculate cosine similarity between the input movie's feature vector and all other movie feature vectors
        similarity_scores = cosine_similarity(input_movie_features, movie_features_dense.T)

        # Get the indices of the movies with the highest similarity scores (excluding the input movie itself)
        similar_movies_indices = similarity_scores.flatten().argsort()[::-1]

        # Exclude the input movie's index from the list of similar movies
        similar_movies_indices = similar_movies_indices[similar_movies_indices != movie_index_input]

        # Get the top N similar movies (e.g., top 10)
        top_n = 10
        top_similar_movies_indices = similar_movies_indices[:top_n]

        # Retrieve the movie titles and their similarity scores
        for index in top_similar_movies_indices:
            # Find the movie ID corresponding to the index in movie_id_mapping
            movie_id_similar = None
            for movie_id, movie_index in movie_id_mapping.items():
                if movie_index == index:
                    movie_id_similar = int(movie_id)
                    break

            if movie_id_similar is not None and movie_id_similar in movie_id_to_title:
                movie_title_similar = movie_id_to_title[movie_id_similar]
                similarity_score = similarity_scores[0, index]
                top_similar_movies.append((movie_title_similar, similarity_score))

    else:
        print(f"Movie ID {movie_id_input} not found in the movie_id_mapping.")

Found movie ID: 153 for "Batman Forever (1995)"


## 6. Recommend Movies

**Subtask:** Present the recommended movies to the user.

**Reasoning:**
Print the list of top similar movies found in the previous step.

In [74]:
# Present the recommended movies to the user
if top_similar_movies:
    print("\nRecommended movies based on your input:")
    for title, score in top_similar_movies:
        print(f"- {title}")
else:
    print("No recommendations could be generated for the entered movie.")


Recommended movies based on your input:
- Sleepless in Seattle (1993)
- Orlando (1993)
- Age of Innocence, The (1993)
- Brother from Another Planet, The (1984)


## 7. Finish task

**Subtask:** Summarize the process and celebrate your working recommendation system!

**Reasoning:**
Provide an exciting summary of the steps taken to build the movie recommendation system and encourage further interaction.

🎉 **Congratulations!** 🎉

You've successfully built a movie recommendation system from scratch! We've gone on a journey:

1.  **Loaded and Prepared Data**: We gathered and cleaned the movie ratings and titles.
2.  **Built the Model**: We used a clever technique (TF-IDF on user ratings) to understand movie relationships.
3.  **Got Your Input**: We listened to what movie you like.
4.  **Found Hidden Gems**: We unearthed similar movies based on how people rate them.
5.  **Delivered Recommendations**: We presented your personalized movie suggestions!

Now the fun really begins! Go back to the "Get User Input" step (Cell `5eb4f697`) and try entering different movie names to discover more cinematic treasures! Happy watching! 🎬🍿