<a href="https://colab.research.google.com/github/Orth33/movie-recommendation/blob/main/Movies_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendation System
**Objective:** Build a recommendation system using collaborative filtering on the MovieLens 100K Dataset.

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

In [10]:
!pip install numpy==1.26.4
import numpy as np



In [5]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rajmehra03/movielens100k")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'movielens100k' dataset.
Path to dataset files: /kaggle/input/movielens100k


In [6]:
# In Colab, you can upload the kaggle CSVs directly or mount Google Drive.
# Assuming 'movies.csv' and 'ratings.csv' are in the current working directory.
try:
    movies = pd.read_csv('/kaggle/input/movielens100k/movies.csv')
    ratings = pd.read_csv('/kaggle/input/movielens100k/ratings.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Please upload movies.csv and ratings.csv to your Colab environment.")

# Take a quick look at the data structure
display(movies.head(3))
display(ratings.head(3))

Data loaded successfully!


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182


## 1. Creating the User-Item Matrix
To compare users, we need to transform our ratings data into a matrix where rows represent `userId` and columns represent `movieId`.

In [None]:
# Create the pivot table
user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')

# Fill NaNs with 0.
# (In advanced systems, you might subtract the user's mean rating to handle bias, but 0 is standard for baseline Cosine Similarity).
user_item_matrix_filled = user_item_matrix.fillna(0)

print(f"User-Item Matrix Shape: {user_item_matrix_filled.shape}")
user_item_matrix_filled.head()

User-Item Matrix Shape: (671, 9066)


movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. Computing User Similarity Scores
We will use **Cosine Similarity** from Scikit-Learn to measure the angle between the rating vectors of different users. A score of 1 means identical taste; 0 means no correlation.

In [None]:
# Calculate cosine similarity between all users
user_similarity = cosine_similarity(user_item_matrix_filled)

# Convert to a DataFrame for easier user-ID referencing
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

print("User Similarity Matrix:")
display(user_similarity_df.head())

User Similarity Matrix:


userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.074482,0.016818,0.0,0.083884,0.0,0.012843,0.0,...,0.0,0.0,0.014474,0.043719,0.0,0.0,0.0,0.062917,0.0,0.017466
2,0.0,1.0,0.124295,0.118821,0.103646,0.0,0.212985,0.11319,0.113333,0.043213,...,0.477306,0.063202,0.077745,0.164162,0.466281,0.425462,0.084646,0.02414,0.170595,0.113175
3,0.0,0.124295,1.0,0.08164,0.151531,0.060691,0.154714,0.249781,0.134475,0.114672,...,0.161205,0.064198,0.176134,0.158357,0.177098,0.124562,0.124911,0.080984,0.136606,0.170193
4,0.074482,0.118821,0.08164,1.0,0.130649,0.079648,0.319745,0.191013,0.030417,0.137186,...,0.114319,0.047228,0.136579,0.25403,0.121905,0.088735,0.068483,0.104309,0.054512,0.211609
5,0.016818,0.103646,0.151531,0.130649,1.0,0.063796,0.095888,0.165712,0.086616,0.03237,...,0.191029,0.021142,0.146173,0.224245,0.139721,0.058252,0.042926,0.038358,0.062642,0.225086


## 3. Recommending Top-Rated Unseen Movies
We will build a function that:
1. Finds users similar to our target user.
2. Identifies movies those similar users rated highly.
3. Removes movies the target user has already seen.
4. Returns the top recommendations.

In [None]:
def recommend_movies(target_user_id, user_item_matrix, similarity_df, movies_df, top_n=5):
    # 1. Get the target user's ratings
    user_ratings = user_item_matrix.loc[target_user_id]

    # 2. Find movies the user HAS NOT seen (where rating is NaN)
    unseen_movies = user_ratings[user_ratings.isna()].index

    # 3. Get similar users (excluding the user themselves)
    similar_users = similarity_df[target_user_id].drop(target_user_id)

    # 4. Predict ratings for unseen movies based on weighted average of similar users' ratings
    predicted_ratings = {}

    for movie_id in unseen_movies:
        # Who has rated this unseen movie?
        users_who_rated = user_item_matrix[movie_id].dropna().index

        if len(users_who_rated) == 0:
            continue

        # Get similarity scores and ratings of those specific users
        sim_scores = similar_users.loc[users_who_rated]
        ratings_given = user_item_matrix.loc[users_who_rated, movie_id]

        # Calculate weighted sum
        numerator = np.dot(sim_scores, ratings_given)
        denominator = sim_scores.sum()

        if denominator > 0:
            predicted_ratings[movie_id] = numerator / denominator

    # 5. Sort the predicted ratings in descending order
    recommended_movie_ids = sorted(predicted_ratings, key=predicted_ratings.get, reverse=True)[:top_n]

    # 6. Map movie IDs back to titles
    recommendations = movies_df[movies_df['movieId'].isin(recommended_movie_ids)]

    return recommendations[['movieId', 'title', 'genres']]

# Let's test it for User ID 1!
recommend_movies(target_user_id=1,
                 user_item_matrix=user_item_matrix,
                 similarity_df=user_similarity_df,
                 movies_df=movies,
                 top_n=5)

Unnamed: 0,movieId,title,genres
632,759,Maya Lin: A Strong Clear Vision (1994),Documentary
1844,2330,Hands on a Hard Body (1996),Comedy|Documentary
2504,3112,'night Mother (1986),Drama
2683,3357,East-West (Est-ouest) (1999),Drama|Romance
3499,4428,"Misfits, The (1961)",Comedy|Drama|Romance|Western


## 4. Evaluation: Train/Test Split & Precision@K
To truly evaluate our system, we need to hide some data. We will:
1. Split the dataset so a portion of each user's ratings is kept as a "test set" (the ground truth).
2. Train our similarity matrix *only* on the training set.
3. Generate top K recommendations for a user.
4. Check how many of those recommendations appear in their test set with a high rating (e.g., >= 4.0).

In [None]:
from sklearn.model_selection import train_test_split

# 1. Train/Test Split (80% train, 20% test)
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# Create train user-item matrix
train_user_item = train_data.pivot(index='userId', columns='movieId', values='rating')
train_matrix_filled = train_user_item.fillna(0)

# Calculate similarity strictly on train data
train_user_sim = cosine_similarity(train_matrix_filled)
train_user_sim_df = pd.DataFrame(train_user_sim, index=train_user_item.index, columns=train_user_item.index)

def evaluate_precision_at_k(user_id, k=5, threshold=4.0):
    # Get movies the user actually liked in the test set (Rating >= threshold)
    user_test_data = test_data[test_data['userId'] == user_id]
    actual_liked = user_test_data[user_test_data['rating'] >= threshold]['movieId'].tolist()

    if not actual_liked:
        return None # User didn't have highly rated movies in the test set

    # Get recommendations using the function we built earlier, using TRAIN data
    recs = recommend_movies(target_user_id=user_id,
                            user_item_matrix=train_user_item,
                            similarity_df=train_user_sim_df,
                            movies_df=movies,
                            top_n=k)

    recommended_ids = recs['movieId'].tolist()

    # Calculate Precision
    hits = set(actual_liked).intersection(set(recommended_ids))
    precision = len(hits) / k
    return precision

# Test on a few users and average the precision
precisions = []
for uid in train_user_item.index[:50]: # Testing on the first 50 users for speed
    if uid in train_user_item.index:
        p = evaluate_precision_at_k(uid, k=2)
        if p is not None:
            precisions.append(p)

print(f"Average Precision@5 for sample users: {np.mean(precisions):.4f}")

Average Precision@5 for sample users: 0.0000


---
## BONUS 1: Item-Based Collaborative Filtering
Instead of asking "Which users are similar to User A?", Item-Based CF asks "Which movies are similar to the movies User A likes?".

This is often more computationally stable because movies change less frequently than user preferences. We calculate this by transposing our user-item matrix so movies are the rows.

In [None]:
# Transpose the matrix so movies are rows and users are columns
item_user_matrix = user_item_matrix_filled.T

# Calculate cosine similarity between movies
item_similarity = cosine_similarity(item_user_matrix)

# Create a DataFrame for item similarity
item_similarity_df = pd.DataFrame(
    item_similarity,
    index=item_user_matrix.index,
    columns=item_user_matrix.index
)

def get_similar_movies(movie_id, top_n=5):
    # Sort the similarity scores for the given movie
    similar_scores = item_similarity_df[movie_id].sort_values(ascending=False)

    # Drop the movie itself and get top N
    similar_movie_ids = similar_scores.drop(movie_id).head(top_n).index

    return movies[movies['movieId'].isin(similar_movie_ids)][['movieId', 'title', 'genres']]

# Let's find movies similar to Movie ID 1 (Usually 'Toy Story')
print("Because you watched Movie ID 1:")
get_similar_movies(movie_id=1, top_n=5)

Because you watched Movie ID 1:


Unnamed: 0,movieId,title,genres
232,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
321,356,Forrest Gump (1994),Comedy|Drama|Romance|War
644,780,Independence Day (a.k.a. ID4) (1996),Action|Adventure|Sci-Fi|Thriller
1019,1265,Groundhog Day (1993),Comedy|Fantasy|Romance
2506,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy


## BONUS 2: SVD with the Surprise Library
Standard SVD struggles with sparse matrices because filling missing values with 0 skews the mathematical reconstruction. To fix this, we will use the `scikit-surprise` library, which implements **Funk SVD**. This version of SVD only optimizes for the observed ratings, completely avoiding the bias introduced by zero-filling.

In [7]:
# Install the library
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp312-cp312-linux_x86_64.whl size=2554971 sha256=19cc419d7a8cc10b5484d2fcaa34619247744b54f8efa518676fb85816d71078
  Stored in directory: /root/.cache/pip/wheels/75/fa/bc/739bc2cb1fbaab6061854e6cfbb81a0ae52c92a502a7fa454b
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Succes

In [8]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# 1. Define the rating scale for Surprise
reader = Reader(rating_scale=(0.5, 5.0))

# 2. Load the dataset directly from our existing pandas DataFrame
# Surprise requires strictly three columns in this exact order: user, item, rating
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# 3. Create a 80/20 train/test split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# 4. Initialize the SVD algorithm (n_factors is the number of latent features)
# We use 50 factors to match our previous attempt
algo = SVD(n_factors=50, random_state=42)

# 5. Train the model on the training set
print("Training the Funk SVD model...")
algo.fit(trainset)
print("Training complete.")

# 6. Test the model on the testing set
predictions = algo.test(testset)

# 7. Calculate the proper RMSE
print("\n======================================")
# Surprise has a built-in RMSE function that handles everything cleanly
rmse = accuracy.rmse(predictions)
print("======================================")

Training the Funk SVD model...
Training complete.

RMSE: 0.8990


**Generating Top-K Recommendations with Funk SVD**

Now that we have a highly accurate model (RMSE ~0.93), let's build the final recommendation function.

This function will:
1. Identify all the movies a specific user has **not** seen yet.
2. Predict the exact star rating the user would give to each unseen movie using our trained Funk SVD model.
3. Sort those predictions from highest to lowest.
4. Return the Top K movies along with their titles and expected ratings.

In [11]:
def get_top_n_surprise(user_id, movies_df, ratings_df, trained_algo, n=5):
    # 1. Get a list of all movie IDs in the dataset
    all_movie_ids = movies_df['movieId'].unique()

    # 2. Get a list of movie IDs the target user has already rated
    user_rated_movies = ratings_df[ratings_df['userId'] == user_id]['movieId'].unique()

    # 3. Find the unseen movies (All Movies - Rated Movies)
    unseen_movies = np.setdiff1d(all_movie_ids, user_rated_movies)

    # 4. Predict the user's rating for every unseen movie
    predictions = []
    for movie_id in unseen_movies:
        # algo.predict returns a Prediction object containing the estimated rating (est)
        pred = trained_algo.predict(user_id, movie_id)
        predictions.append((movie_id, pred.est))

    # 5. Sort the predictions by estimated rating in descending order
    predictions.sort(key=lambda x: x[1], reverse=True)

    # 6. Extract the Top N movie IDs and their predicted ratings
    top_n_movie_ids = [pred[0] for pred in predictions[:n]]
    top_n_est_ratings = [pred[1] for pred in predictions[:n]]

    # 7. Put the results into a clean DataFrame
    recommendations = pd.DataFrame({
        'movieId': top_n_movie_ids,
        'Predicted_Rating': top_n_est_ratings
    })

    # Merge with the movies DataFrame to get titles and genres
    recommendations = recommendations.merge(movies_df, on='movieId', how='left')

    return recommendations[['movieId', 'title', 'genres', 'Predicted_Rating']]

# Let's generate the final Top 5 recommendations for User ID 1!
print("==========================================================")
print(f"Final Funk SVD Recommendations for User 1:")
print("==========================================================")
final_recs = get_top_n_surprise(user_id=1,
                                movies_df=movies,
                                ratings_df=ratings,
                                trained_algo=algo, # The model we trained in Step 12
                                n=5)
display(final_recs)

Final Funk SVD Recommendations for User 1:


Unnamed: 0,movieId,title,genres,Predicted_Rating
0,318,"Shawshank Redemption, The (1994)",Crime|Drama,3.859072
1,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,3.78898
2,5952,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,3.78103
3,899,Singin' in the Rain (1952),Comedy|Musical|Romance,3.762335
4,1136,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy,3.749256
