<a href="https://colab.research.google.com/github/Orth33/music-genre-classification/blob/main/Movies_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendation System
**Objective:** Build a recommendation system using collaborative filtering on the MovieLens 100K Dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rajmehra03/movielens100k")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'movielens100k' dataset.
Path to dataset files: /kaggle/input/movielens100k


In [4]:
# In Colab, you can upload the kaggle CSVs directly or mount Google Drive.
# Assuming 'movies.csv' and 'ratings.csv' are in the current working directory.
try:
    movies = pd.read_csv('/kaggle/input/movielens100k/movies.csv')
    ratings = pd.read_csv('/kaggle/input/movielens100k/ratings.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Please upload movies.csv and ratings.csv to your Colab environment.")

# Take a quick look at the data structure
display(movies.head(3))
display(ratings.head(3))

Data loaded successfully!


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182


## 1. Creating the User-Item Matrix
To compare users, we need to transform our ratings data into a matrix where rows represent `userId` and columns represent `movieId`.

In [5]:
# Create the pivot table
user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')

# Fill NaNs with 0.
# (In advanced systems, you might subtract the user's mean rating to handle bias, but 0 is standard for baseline Cosine Similarity).
user_item_matrix_filled = user_item_matrix.fillna(0)

print(f"User-Item Matrix Shape: {user_item_matrix_filled.shape}")
user_item_matrix_filled.head()

User-Item Matrix Shape: (671, 9066)


movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. Computing User Similarity Scores
We will use **Cosine Similarity** from Scikit-Learn to measure the angle between the rating vectors of different users. A score of 1 means identical taste; 0 means no correlation.

In [6]:
# Calculate cosine similarity between all users
user_similarity = cosine_similarity(user_item_matrix_filled)

# Convert to a DataFrame for easier user-ID referencing
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

print("User Similarity Matrix:")
display(user_similarity_df.head())

User Similarity Matrix:


userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.074482,0.016818,0.0,0.083884,0.0,0.012843,0.0,...,0.0,0.0,0.014474,0.043719,0.0,0.0,0.0,0.062917,0.0,0.017466
2,0.0,1.0,0.124295,0.118821,0.103646,0.0,0.212985,0.11319,0.113333,0.043213,...,0.477306,0.063202,0.077745,0.164162,0.466281,0.425462,0.084646,0.02414,0.170595,0.113175
3,0.0,0.124295,1.0,0.08164,0.151531,0.060691,0.154714,0.249781,0.134475,0.114672,...,0.161205,0.064198,0.176134,0.158357,0.177098,0.124562,0.124911,0.080984,0.136606,0.170193
4,0.074482,0.118821,0.08164,1.0,0.130649,0.079648,0.319745,0.191013,0.030417,0.137186,...,0.114319,0.047228,0.136579,0.25403,0.121905,0.088735,0.068483,0.104309,0.054512,0.211609
5,0.016818,0.103646,0.151531,0.130649,1.0,0.063796,0.095888,0.165712,0.086616,0.03237,...,0.191029,0.021142,0.146173,0.224245,0.139721,0.058252,0.042926,0.038358,0.062642,0.225086


## 3. Recommending Top-Rated Unseen Movies
We will build a function that:
1. Finds users similar to our target user.
2. Identifies movies those similar users rated highly.
3. Removes movies the target user has already seen.
4. Returns the top recommendations.

In [7]:
def recommend_movies(target_user_id, user_item_matrix, similarity_df, movies_df, top_n=5):
    # 1. Get the target user's ratings
    user_ratings = user_item_matrix.loc[target_user_id]

    # 2. Find movies the user HAS NOT seen (where rating is NaN)
    unseen_movies = user_ratings[user_ratings.isna()].index

    # 3. Get similar users (excluding the user themselves)
    similar_users = similarity_df[target_user_id].drop(target_user_id)

    # 4. Predict ratings for unseen movies based on weighted average of similar users' ratings
    predicted_ratings = {}

    for movie_id in unseen_movies:
        # Who has rated this unseen movie?
        users_who_rated = user_item_matrix[movie_id].dropna().index

        if len(users_who_rated) == 0:
            continue

        # Get similarity scores and ratings of those specific users
        sim_scores = similar_users.loc[users_who_rated]
        ratings_given = user_item_matrix.loc[users_who_rated, movie_id]

        # Calculate weighted sum
        numerator = np.dot(sim_scores, ratings_given)
        denominator = sim_scores.sum()

        if denominator > 0:
            predicted_ratings[movie_id] = numerator / denominator

    # 5. Sort the predicted ratings in descending order
    recommended_movie_ids = sorted(predicted_ratings, key=predicted_ratings.get, reverse=True)[:top_n]

    # 6. Map movie IDs back to titles
    recommendations = movies_df[movies_df['movieId'].isin(recommended_movie_ids)]

    return recommendations[['movieId', 'title', 'genres']]

# Let's test it for User ID 1!
recommend_movies(target_user_id=1,
                 user_item_matrix=user_item_matrix,
                 similarity_df=user_similarity_df,
                 movies_df=movies,
                 top_n=5)

Unnamed: 0,movieId,title,genres
632,759,Maya Lin: A Strong Clear Vision (1994),Documentary
1844,2330,Hands on a Hard Body (1996),Comedy|Documentary
2504,3112,'night Mother (1986),Drama
2683,3357,East-West (Est-ouest) (1999),Drama|Romance
3499,4428,"Misfits, The (1961)",Comedy|Drama|Romance|Western


## 4. True Evaluation: Train/Test Split & Precision@K
To truly evaluate our system, we need to hide some data. We will:
1. Split the dataset so a portion of each user's ratings is kept as a "test set" (the ground truth).
2. Train our similarity matrix *only* on the training set.
3. Generate top K recommendations for a user.
4. Check how many of those recommendations appear in their test set with a high rating (e.g., >= 4.0).

In [9]:
from sklearn.model_selection import train_test_split

# 1. Train/Test Split (80% train, 20% test)
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# Create train user-item matrix
train_user_item = train_data.pivot(index='userId', columns='movieId', values='rating')
train_matrix_filled = train_user_item.fillna(0)

# Calculate similarity strictly on train data
train_user_sim = cosine_similarity(train_matrix_filled)
train_user_sim_df = pd.DataFrame(train_user_sim, index=train_user_item.index, columns=train_user_item.index)

def evaluate_precision_at_k(user_id, k=5, threshold=4.0):
    # Get movies the user actually liked in the test set (Rating >= threshold)
    user_test_data = test_data[test_data['userId'] == user_id]
    actual_liked = user_test_data[user_test_data['rating'] >= threshold]['movieId'].tolist()

    if not actual_liked:
        return None # User didn't have highly rated movies in the test set

    # Get recommendations using the function we built earlier, using TRAIN data
    recs = recommend_movies(target_user_id=user_id,
                            user_item_matrix=train_user_item,
                            similarity_df=train_user_sim_df,
                            movies_df=movies,
                            top_n=k)

    recommended_ids = recs['movieId'].tolist()

    # Calculate Precision
    hits = set(actual_liked).intersection(set(recommended_ids))
    precision = len(hits) / k
    return precision

# Test on a few users and average the precision
precisions = []
for uid in train_user_item.index[:50]: # Testing on the first 50 users for speed
    if uid in train_user_item.index:
        p = evaluate_precision_at_k(uid, k=2)
        if p is not None:
            precisions.append(p)

print(f"Average Precision@5 for sample users: {np.mean(precisions):.4f}")

Average Precision@5 for sample users: 0.0000


---
## BONUS 1: Item-Based Collaborative Filtering
Instead of asking "Which users are similar to User A?", Item-Based CF asks "Which movies are similar to the movies User A likes?".

This is often more computationally stable because movies change less frequently than user preferences. We calculate this by transposing our user-item matrix so movies are the rows.

In [10]:
# Transpose the matrix so movies are rows and users are columns
item_user_matrix = user_item_matrix_filled.T

# Calculate cosine similarity between movies
item_similarity = cosine_similarity(item_user_matrix)

# Create a DataFrame for item similarity
item_similarity_df = pd.DataFrame(
    item_similarity,
    index=item_user_matrix.index,
    columns=item_user_matrix.index
)

def get_similar_movies(movie_id, top_n=5):
    # Sort the similarity scores for the given movie
    similar_scores = item_similarity_df[movie_id].sort_values(ascending=False)

    # Drop the movie itself and get top N
    similar_movie_ids = similar_scores.drop(movie_id).head(top_n).index

    return movies[movies['movieId'].isin(similar_movie_ids)][['movieId', 'title', 'genres']]

# Let's find movies similar to Movie ID 1 (Usually 'Toy Story')
print("Because you watched Movie ID 1:")
get_similar_movies(movie_id=1, top_n=5)

Because you watched Movie ID 1:


Unnamed: 0,movieId,title,genres
232,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
321,356,Forrest Gump (1994),Comedy|Drama|Romance|War
644,780,Independence Day (a.k.a. ID4) (1996),Action|Adventure|Sci-Fi|Thriller
1019,1265,Groundhog Day (1993),Comedy|Fantasy|Romance
2506,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy


---
## BONUS 2: Matrix Factorization (Singular Value Decomposition - SVD)
SVD is a powerful mathematical technique that extracts "latent features" (hidden patterns like genre preferences or movie tone) from the user-item matrix.

By decomposing the matrix and then multiplying it back together, we effectively fill in all the blank (0) ratings with highly accurate predicted ratings!

In [11]:
from scipy.sparse.linalg import svds

# 1. Convert user-item matrix to a numpy array, and normalize it by subtracting user means
# Normalization handles "harsh" vs "generous" raters
R = user_item_matrix_filled.values
user_ratings_mean = np.mean(R, axis=1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

# 2. Apply SVD
# k represents the number of latent features. 50 is a common starting point.
U, sigma, Vt = svds(R_demeaned, k=50)

# Convert sigma to a diagonal matrix
sigma = np.diag(sigma)

# 3. Reconstruct the matrix to get predicted ratings
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

# Convert back to DataFrame
preds_df = pd.DataFrame(all_user_predicted_ratings, columns=user_item_matrix_filled.columns, index=user_item_matrix_filled.index)

def recommend_movies_svd(user_id, preds_df, original_ratings_df, movies_df, top_n=5):
    # Get and sort the user's predictions
    sorted_user_predictions = preds_df.loc[user_id].sort_values(ascending=False)

    # Get the user's original ratings (to filter out movies already seen)
    user_data = original_ratings_df[original_ratings_df.userId == user_id]

    # Recommend movies not in the user's previously seen list
    recommendations = movies_df[~movies_df['movieId'].isin(user_data['movieId'])]

    # Merge with predictions and sort
    recommendations = recommendations.merge(
        pd.DataFrame(sorted_user_predictions).reset_index(),
        on='movieId'
    ).rename(columns={user_id: 'Predictions'}).sort_values('Predictions', ascending=False)

    return recommendations[['movieId', 'title', 'genres', 'Predictions']].head(top_n)

# Test SVD Recommendations for User ID 1
print("SVD Recommendations for User 1:")
recommend_movies_svd(user_id=1, preds_df=preds_df, original_ratings_df=ratings, movies_df=movies, top_n=5)

SVD Recommendations for User 1:


Unnamed: 0,movieId,title,genres,Predictions
1103,1374,Star Trek II: The Wrath of Khan (1982),Action|Adventure|Sci-Fi|Thriller,0.274864
1503,1954,Rocky (1976),Drama,0.214021
2379,2987,Who Framed Roger Rabbit? (1988),Adventure|Animation|Children|Comedy|Crime|Fant...,0.213012
2533,3175,Galaxy Quest (1999),Adventure|Comedy|Sci-Fi,0.201658
2759,3479,Ladyhawke (1985),Adventure|Fantasy|Romance,0.179425


## 5. Evaluating SVD & Sanity Check
Since basic User-User CF struggles with sparsity, let's evaluate our Matrix Factorization (SVD) model. We will also print out exactly what is in the test set versus what is being recommended to understand why offline Precision@K is such a harsh metric.

In [12]:
def evaluate_svd_precision(user_id, preds_df, test_data, train_matrix, movies_df, k=5, threshold=4.0, verbose=False):
    # 1. Get the actual highly-rated movies in the test set
    user_test_data = test_data[test_data['userId'] == user_id]
    actual_liked_ids = user_test_data[user_test_data['rating'] >= threshold]['movieId'].tolist()

    if len(actual_liked_ids) == 0:
        return None

    # 2. Get the movies the user has already seen in the TRAIN set so we don't recommend them
    seen_in_train = train_matrix.columns[train_matrix.loc[user_id] > 0].tolist()

    # 3. Get SVD predictions for this user
    user_preds = preds_df.loc[user_id].drop(index=seen_in_train, errors='ignore').sort_values(ascending=False)

    # 4. Get the Top K recommended movie IDs
    recommended_ids = user_preds.head(k).index.tolist()

    # 5. Calculate Precision
    hits = set(actual_liked_ids).intersection(set(recommended_ids))
    precision = len(hits) / k

    if verbose:
        print(f"\n--- USER {user_id} ---")
        actual_movies = movies_df[movies_df['movieId'].isin(actual_liked_ids)]['title'].tolist()
        print(f"Actually Liked in Test Set ({len(actual_movies)} movies):")
        for m in actual_movies[:5]: print(f" - {m}")
        if len(actual_movies) > 5: print(f"   ...and {len(actual_movies)-5} more.")

        rec_movies = movies_df[movies_df['movieId'].isin(recommended_ids)]['title'].tolist()
        print(f"\nTop {k} SVD Recommendations:")
        for m in rec_movies: print(f" - {m}")
        print(f"-> Precision@{k}: {precision}")

    return precision

# Let's run a verbose sanity check on a single user first
evaluate_svd_precision(user_id=10,
                       preds_df=preds_df, # From our previous SVD step
                       test_data=test_data,
                       train_matrix=train_user_item,
                       movies_df=movies,
                       k=5,
                       verbose=True)

# Now let's calculate the average for the first 100 users using SVD
svd_precisions = []
for uid in train_user_item.index[:100]:
    p = evaluate_svd_precision(uid, preds_df, test_data, train_user_item, movies, k=5)
    if p is not None:
        svd_precisions.append(p)

print(f"\n======================================")
print(f"Average SVD Precision@5: {np.mean(svd_precisions):.4f}")
print(f"======================================")


--- USER 10 ---
Actually Liked in Test Set (7 movies):
 - Princess Bride, The (1987)
 - Aliens (1986)
 - Star Wars: Episode VI - Return of the Jedi (1983)
 - Romancing the Stone (1984)
 - Analyze This (1999)
   ...and 2 more.

Top 5 SVD Recommendations:
 - Star Wars: Episode IV - A New Hope (1977)
 - Pulp Fiction (1994)
 - Princess Bride, The (1987)
 - Star Wars: Episode VI - Return of the Jedi (1983)
 - Saving Private Ryan (1998)
-> Precision@5: 0.4

Average SVD Precision@5: 0.4186


## 6. Evaluating Accuracy: Root Mean Squared Error (RMSE)
Since predicting the exact unseen movies a user will watch is highly improbable in an offline dataset, a standard approach is to predict what *rating* a user would give to a movie they have already rated in the test set.

RMSE measures the average magnitude of our prediction errors. The formula is:
$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

* Where $y_i$ is the actual rating and $\hat{y}_i$ is the predicted rating.
* A lower RMSE means our predictions are closer to the actual ratings. For the MovieLens 100K dataset, an RMSE between **0.90 and 1.00** is considered a strong baseline.

In [13]:
from sklearn.metrics import mean_squared_error

def evaluate_svd_rmse(preds_df, test_data):
    actual_ratings = []
    predicted_ratings = []

    # Iterate through the test set
    for index, row in test_data.iterrows():
        user_id = int(row['userId'])
        movie_id = int(row['movieId'])
        actual_rating = row['rating']

        # We only predict if the user and movie exist in our training predictions
        if user_id in preds_df.index and movie_id in preds_df.columns:
            predicted_rating = preds_df.loc[user_id, movie_id]

            # Clip predictions to the 1-5 star range just in case SVD over/under shoots
            if predicted_rating > 5:
                predicted_rating = 5.0
            elif predicted_rating < 0.5:
                predicted_rating = 0.5

            actual_ratings.append(actual_rating)
            predicted_ratings.append(predicted_rating)

    # Calculate Mean Squared Error, then take the square root
    mse = mean_squared_error(actual_ratings, predicted_ratings)
    rmse = np.sqrt(mse)

    return rmse

# Calculate and print the RMSE
svd_rmse = evaluate_svd_rmse(preds_df, test_data)

print("======================================")
print(f"SVD Model RMSE: {svd_rmse:.4f}")
print("======================================")

SVD Model RMSE: 2.0327
