## Overview

This project explores personalized movie recommendations using the MovieLens 100K dataset, which contains 100,000 ratings from 943 users across 1,682 movies. The goal is to build a system that intelligently suggests movies based on user preferences and item characteristics, leveraging collaborative filtering and matrix factorization techniques

## Feature Landscape

**User-Item Interactions**
- Explicit ratings from users
- Sparse matrix representation of preferences
  
**Item Metadata**
- Movie genres (e.g., Action, Drama, Thriller)
- Movie titles and identifiers (IMDb, TMDb)
  
**Similarity Modeling**
- Cosine similarity between items
- Neighborhood-based filtering
  
**Latent Feature Modeling**
- Matrix factorization using SVD
- Dimensionality reduction to uncover hidden tastes


## What Does This Analysis Aim to Do?

- **Recommend movies:** Suggest top-rated, unseen movies tailored to each user
- **Model similarity:** Identify movies similar to a given title using item-based collaborative filtering
- **Predict ratings:** Use matrix factorization (SVD) to estimate how a user might rate unrated movies
- **Evaluate performance:** Apply metrics like Precision@K to assess recommendation quality
- **Compare methods:** Analyze strengths of item-based filtering vs. SVD-based predictions

## Bonus Exploration

- Implemented genre-based filtering to refine recommendations
- Aggregated ratings to reduce noise and highlight consensus
- Visualized similarity matrices and latent features for interpretability

In [1]:
## Datasets URL
url1 = r"links.csv"
url2 = r"movies.csv"
url3 = r"ratings.csv"
url4 = r"tags.csv"

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
## Loading the datasets
df1 = pd.read_csv(url1)
df2 = pd.read_csv(url2)
df3 = pd.read_csv(url3)
df4 = pd.read_csv(url4)

In [4]:
df1.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [5]:
df2.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
df3.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
df4.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [8]:
## Merging the data
df_temp1 = pd.merge(df1, df2, on = 'movieId', how = 'left')
df_temp2 = pd.merge(df3, df4, on = ['movieId', 'userId', 'timestamp'], how = 'left')
df = pd.merge(df_temp1, df_temp2, on = 'movieId', how = 'left')

df

Unnamed: 0,movieId,imdbId,tmdbId,title,genres,userId,rating,timestamp,tag
0,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,9.649827e+08,
1,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,8.474350e+08,
2,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1.106636e+09,
3,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1.510578e+09,
4,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1.305696e+09,
...,...,...,...,...,...,...,...,...,...
100849,193581,5476944,432131.0,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,1.537109e+09,
100850,193583,5914996,445030.0,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,1.537110e+09,
100851,193585,6397426,479308.0,Flint (2017),Drama,184.0,3.5,1.537110e+09,
100852,193587,8391976,483455.0,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,1.537110e+09,


In [9]:
## Build the User-Item Matrix
user_item_matrix = df.pivot_table(index='userId', columns='movieId', values='rating')

In [10]:
user_item_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606.0,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607.0,4.0,,,,,,,,,,...,,,,,,,,,,
608.0,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609.0,3.0,,,,,,,,,4.0,...,,,,,,,,,,


In [11]:
from sklearn.metrics.pairwise import cosine_similarity

## Fill the missing values
user_item_filled = user_item_matrix.fillna(0)

# Compute similarity matrix
user_similarity = pd.DataFrame(cosine_similarity(user_item_filled), index = user_item_matrix.index, columns = user_item_matrix.index)

In [12]:
user_similarity

userId,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,601.0,602.0,603.0,604.0,605.0,606.0,607.0,608.0,609.0,610.0
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,1.000000,0.027283,0.059720,0.194395,0.129080,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2.0,0.027283,1.000000,0.000000,0.003726,0.016614,0.025333,0.027585,0.027257,0.000000,0.067445,...,0.202671,0.016866,0.011997,0.000000,0.000000,0.028429,0.012948,0.046211,0.027565,0.102427
3.0,0.059720,0.000000,1.000000,0.002251,0.005020,0.003936,0.000000,0.004941,0.000000,0.000000,...,0.005048,0.004892,0.024992,0.000000,0.010694,0.012993,0.019247,0.021128,0.000000,0.032119
4.0,0.194395,0.003726,0.002251,1.000000,0.128659,0.088491,0.115120,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5.0,0.129080,0.016614,0.005020,0.128659,1.000000,0.300349,0.108342,0.429075,0.000000,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606.0,0.164191,0.028429,0.012993,0.200395,0.106435,0.102123,0.200035,0.099388,0.075898,0.088963,...,0.178084,0.116534,0.300669,0.066032,0.148141,1.000000,0.153063,0.262558,0.069622,0.201104
607.0,0.269389,0.012948,0.019247,0.131746,0.152866,0.162182,0.186114,0.185142,0.011844,0.010451,...,0.092525,0.199910,0.203540,0.137834,0.118780,0.153063,1.000000,0.283081,0.149190,0.139114
608.0,0.291097,0.046211,0.021128,0.149858,0.135535,0.178809,0.323541,0.187233,0.100435,0.077424,...,0.158355,0.197514,0.232771,0.155306,0.178142,0.262558,0.283081,1.000000,0.121993,0.322055
609.0,0.093572,0.027565,0.000000,0.032198,0.261232,0.214234,0.090840,0.423993,0.000000,0.021766,...,0.035653,0.335231,0.061941,0.236601,0.097610,0.069622,0.149190,0.121993,1.000000,0.053225


In [13]:
## Recommended Movies
def rec_movies(u_id, u_item_matrix, u_similarity):
    ## Similarity Score
    sim_scores = u_similarity.loc[u_id].drop(u_id)
    sim_scores = sim_scores / sim_scores.sum()

    ## Rates of similar users
    sim_user_rate = user_item_matrix.loc[sim_scores.index]
    weighted_ratings = sim_user_rate.mul(sim_scores, axis = 0)

    rec_scores = weighted_ratings.sum(axis = 0) / sim_scores.sum()

    ## Filter unseen Movies
    seen_movies = u_item_matrix.loc[u_id].dropna().index
    recommendations = rec_scores.drop(seen_movies)
    return recommendations.sort_values(ascending = False)

In [69]:
## Test using userId = 200
user_id = 200
recommendations = rec_movies(user_id, user_item_matrix, user_similarity)

## Display the prediction in DataFrame
rec_ids = recommendations.index.tolist()
rec_df = pd.DataFrame({
    'movieId': rec_ids,
    'predicted_rating': recommendations.values
})
unique_titles = df[['movieId', 'title']].drop_duplicates(subset='movieId')
rec_df = rec_df.merge(unique_titles, on='movieId', how='inner')

## Top 10 Movies
rec_df.head(10)

Unnamed: 0,movieId,predicted_rating,title
0,593,2.241383,"Silence of the Lambs, The (1991)"
1,589,1.878633,Terminator 2: Judgment Day (1991)
2,527,1.823776,Schindler's List (1993)
3,50,1.800264,"Usual Suspects, The (1995)"
4,2028,1.765395,Saving Private Ryan (1998)
5,858,1.7316,"Godfather, The (1972)"
6,2762,1.652478,"Sixth Sense, The (1999)"
7,364,1.535623,"Lion King, The (1994)"
8,608,1.516728,Fargo (1996)
9,457,1.481389,"Fugitive, The (1993)"


## Evaluate with Precision@K

In [51]:
def precision_k(rec, actual, k):
    rec_k = rec[:k]
    relevant = set(rec_k).intersection(set(actual))
    return len(relevant) / k

## Test using userId = 200
actual_movies = df[(df['userId'] == 200) & (df['rating'] >= 4)]['movieId'].tolist()
recommended_movies = recommendations.index.tolist()
    
## Top 10 Movies
precision = precision_k(recommended_movies, actual_movies, k = 10)
print(f'Precisions @10: {precision:.2f}')

Precisions @10: 0.40


## item-based collaborative filtering

In [54]:
item_similarity = cosine_similarity(user_item_matrix.T.fillna(0))
item_similarity_df = pd.DataFrame(item_similarity, index = user_item_matrix.columns, columns = user_item_matrix.columns)
item_similarity_df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.410562,0.296917,0.035573,0.308762,0.376316,0.277491,0.131629,0.232586,0.395573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.410562,1.000000,0.282438,0.106415,0.287795,0.297009,0.228576,0.172498,0.044835,0.417693,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.296917,0.282438,1.000000,0.092406,0.417802,0.284257,0.402831,0.313434,0.304840,0.242954,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.035573,0.106415,0.092406,1.000000,0.188376,0.089685,0.275035,0.158022,0.000000,0.095598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.308762,0.287795,0.417802,0.188376,1.000000,0.298969,0.474002,0.283523,0.335058,0.218061,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
193583,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
193585,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
193587,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


In [60]:
## Similar Movies
def rec_similar_items(item_id, top_n = 10):
    similar_items = item_similarity_df[item_id].sort_values(ascending=False)[1:top_n + 1]
    unique_movies = df[['movieId', 'title']].drop_duplicates(subset='movieId')
    return unique_movies[unique_movies['movieId'].isin(similar_items.index)]

## Test using movieId = 50
rec_similar_items(50)

Unnamed: 0,movieId,title
2107,47,Seven (a.k.a. Se7en) (1995)
7860,296,Pulp Fiction (1994)
8652,318,"Shawshank Redemption, The (1994)"
10019,356,Forrest Gump (1994)
16228,593,"Silence of the Lambs, The (1991)"
16952,608,Fargo (1996)
19831,858,"Godfather, The (1972)"
23148,1089,Reservoir Dogs (1992)
26099,1213,Goodfellas (1990)
49837,2959,Fight Club (1999)


## Matrix Factorization (SVD)

In [65]:
from scipy.sparse.linalg import svds

R = user_item_matrix.fillna(0).values
U, sigma, Vt = svds(R, k=20)
sigma = np.diag(sigma)

## Prediction
predicted_ratings = np.dot(np.dot(U, sigma), Vt)
predicted_df = pd.DataFrame(predicted_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)
predicted_df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,2.290336,1.460203,1.033507,-0.061334,-0.002275,1.243261,0.029650,0.056161,0.036220,1.442856,...,-0.008584,-0.007358,-0.009810,-0.009810,-0.008584,-0.009810,-0.008584,-0.008584,-0.008584,-0.038606
2.0,0.038570,0.015272,0.016968,0.002944,0.019201,-0.005821,-0.025436,0.000918,0.010531,-0.117149,...,0.010662,0.009139,0.012186,0.012186,0.010662,0.012186,0.010662,0.010662,0.010662,0.015610
3.0,-0.015220,0.049067,0.047202,-0.004936,-0.035349,0.052758,-0.012911,0.010422,-0.002532,-0.014094,...,0.000029,0.000025,0.000033,0.000033,0.000029,0.000033,0.000029,0.000029,0.000029,-0.002412
4.0,2.238621,0.060011,0.039384,0.066455,0.221806,0.487591,0.318594,-0.057422,0.016371,0.234273,...,0.002029,0.001739,0.002319,0.002319,0.002029,0.002319,0.002029,0.002029,0.002029,-0.007359
5.0,1.358363,0.970071,0.340939,0.121053,0.479936,0.628346,0.504583,0.136293,0.040721,1.122003,...,0.000348,0.000299,0.000398,0.000398,0.000348,0.000398,0.000348,0.000348,0.000348,0.001611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606.0,-0.617336,0.556016,-0.374855,0.162583,-0.155438,-1.403045,2.364098,-0.205127,-0.444244,0.380738,...,-0.046865,-0.040170,-0.053560,-0.053560,-0.046865,-0.053560,-0.046865,-0.046865,-0.046865,-0.077927
607.0,2.056401,1.216670,0.593186,-0.006625,-0.020369,1.678307,0.261799,0.060570,0.025766,1.289120,...,-0.012653,-0.010845,-0.014460,-0.014460,-0.012653,-0.014460,-0.012653,-0.012653,-0.012653,-0.030033
608.0,2.369716,1.838958,1.577564,-0.131902,0.362084,3.628608,0.248347,0.278704,0.125466,3.895638,...,-0.043875,-0.037607,-0.050143,-0.050143,-0.043875,-0.050143,-0.043875,-0.043875,-0.043875,0.005026
609.0,0.809741,0.651456,0.297184,0.081167,0.334388,0.577311,0.362697,0.091491,0.067186,0.940384,...,0.000254,0.000217,0.000290,0.000290,0.000254,0.000290,0.000254,0.000254,0.000254,0.001664


In [68]:
## Recommended Movies
def rec_svd(u_id, top_n = 10):
    user_ratings = predicted_df.loc[u_id]
    seen_movies = user_item_matrix.loc[u_id].dropna().index
    recommendations = user_ratings.drop(seen_movies).sort_values(ascending = False).head(top_n)
    unique_movies = df[['movieId', 'title']].drop_duplicates(subset='movieId')
    return unique_movies[unique_movies['movieId'].isin(recommendations.index)]

## Test using userId = 200
rec_svd(200)

Unnamed: 0,movieId,title
10522,364,"Lion King, The (1994)"
24008,1136,Monty Python and the Holy Grail (1975)
28313,1265,Groundhog Day (1993)
46858,2706,American Pie (1999)
47124,2716,Ghostbusters (a.k.a. Ghost Busters) (1984)
47683,2762,"Sixth Sense, The (1999)"
51712,3114,Toy Story 2 (1999)
58981,3996,"Crouching Tiger, Hidden Dragon (Wo hu cang lon..."
65955,5218,Ice Age (2002)
83525,48516,"Departed, The (2006)"
