# Movie Recommendation System
## Content-Based Filtering and Collaborative Filtering

This project builds a movie recommendation system using the MovieLens dataset.  
It includes two techniques:

1. Content-Based Filtering  
2. Collaborative Filtering  

The goal is to recommend movies to users based on similarity in content or user behavior.


## Importing Required Libraries


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


## Loading the Dataset


In [2]:
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

print("Movies shape:", movies.shape)
print("Ratings shape:", ratings.shape)


Movies shape: (9742, 3)
Ratings shape: (100836, 4)


## Data Preprocessing

Keep only required columns and clean the genres column by replacing '|' with spaces.


In [3]:
movies = movies[['movieId','title','genres']].drop_duplicates().reset_index(drop=True)
movies['genres'] = movies['genres'].fillna('').str.replace('|',' ', regex=False)
movies.head()


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


## Content-Based Filtering

We convert the `genres` text into TF-IDF vectors and compute cosine similarity between movies.


In [4]:
tfidf = TfidfVectorizer(stop_words='english', max_features=2000)
tfidf_matrix = tfidf.fit_transform(movies['genres'])
content_similarity = cosine_similarity(tfidf_matrix, tfidf_matrix)
title_to_idx = pd.Series(movies.index, index=movies['title']).drop_duplicates()


### Recommendation Function for Content-Based Filtering


In [5]:
def recommend_content(title, top_n=10):
    """
    Return top_n movies similar to the given title using genres-based TF-IDF.
    """
    if title not in title_to_idx:
        return pd.DataFrame()  # title not found
    idx = title_to_idx[title]
    sim_scores = list(enumerate(content_similarity[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    indices = [i for i,_ in sim_scores]
    return movies.iloc[indices][['movieId','title','genres']]


### Test Content-Based Recommender


In [6]:
# Example: change title if not present in your dataset
display(recommend_content("Toy Story (1995)", top_n=8))


Unnamed: 0,movieId,title,genres
1706,2294,Antz (1998),Adventure Animation Children Comedy Fantasy
2355,3114,Toy Story 2 (1999),Adventure Animation Children Comedy Fantasy
2809,3754,"Adventures of Rocky and Bullwinkle, The (2000)",Adventure Animation Children Comedy Fantasy
3000,4016,"Emperor's New Groove, The (2000)",Adventure Animation Children Comedy Fantasy
3568,4886,"Monsters, Inc. (2001)",Adventure Animation Children Comedy Fantasy
6194,45074,"Wild, The (2006)",Adventure Animation Children Comedy Fantasy
6486,53121,Shrek the Third (2007),Adventure Animation Children Comedy Fantasy
6948,65577,"Tale of Despereaux, The (2008)",Adventure Animation Children Comedy Fantasy


## Collaborative Filtering

Build a user–item matrix from ratings, compute user–user cosine similarity, then recommend movies liked by similar users.


In [7]:
# Create user-item rating matrix (rows: userId, columns: movieId)
user_item = ratings.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)

# Ensure user_item includes all movieIds present in movies (adds zero columns if needed)
for mid in movies['movieId']:
    if mid not in user_item.columns:
        user_item[mid] = 0
user_item = user_item.reindex(columns=movies['movieId'])  # align column order


### Calculating User Similarity


In [8]:
user_similarity = pd.DataFrame(
    cosine_similarity(user_item),
    index=user_item.index,
    columns=user_item.index
)


### Recommendation Function for Collaborative Filtering


In [10]:
def collaborative_scores_for_user(user_id):
    """
    Compute normalized weighted score for each movie user hasn't rated.
    Return dict {movieId: score}
    """
    if user_id not in user_similarity.index:
        return {}
    sims = user_similarity.loc[user_id]
    rated = set(ratings[ratings.userId == user_id]['movieId'].unique())
    scores = {}
    sim_sums = {}
    for other_user, sim in sims.items():
        if sim <= 0: 
            continue
        other_ratings = ratings[ratings.userId == other_user]
        for _, row in other_ratings.iterrows():
            mid = row['movieId']
            if mid in rated:
                continue
            scores[mid] = scores.get(mid, 0.0) + sim * row['rating']
            sim_sums[mid] = sim_sums.get(mid, 0.0) + sim
    # normalize
    for mid in list(scores.keys()):
        if sim_sums.get(mid, 0) > 0:
            scores[mid] = scores[mid] / sim_sums[mid]
    return scores

def recommend_cf(user_id, top_n=10):
    scores = collaborative_scores_for_user(user_id)
    if not scores:
        # fallback: popular movies by count
        pop = ratings.groupby('movieId').size().sort_values(ascending=False).index.tolist()
        top = [m for m in pop if m in set(movies['movieId'])][:top_n]
        return movies[movies['movieId'].isin(top)][['movieId','title','genres']]
    top = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    mids = [m for m,_ in top]
    return movies[movies['movieId'].isin(mids)][['movieId','title','genres']]


### Test Collaborative Filtering


In [14]:
display(recommend_cf(1, top_n=10))


Unnamed: 0,movieId,title,genres
870,1151,Lesson Faust (1994),Animation Comedy Drama Fantasy
1228,1631,"Assignment, The (1997)",Action Thriller
1540,2075,Mephisto (1981),Drama War
2880,3851,I'm the One That I Want (2000),Comedy
4045,5746,Galaxy of Terror (Quest) (1981),Action Horror Mystery Sci-Fi
4390,6442,Belle époque (1992),Comedy Romance
4595,6835,Alien Contamination (1980),Action Horror Sci-Fi
5906,33649,Saving Face (2004),Comedy Drama Romance
8828,131724,The Jinx: The Life and Deaths of Robert Durst ...,Documentary
8946,136556,Kung Fu Panda: Secrets of the Masters (2011),Animation Children


## Small Evaluation Examples (approximate)

We provide a quick approximate Precision@K for a single user as a sanity check.
This is not a rigorous evaluation — for rigorous results use cross-validation and proper train/test splits.


In [13]:
from collections import defaultdict

def precision_at_k_approx(recommend_fn, user_id, k=10):
    user_r = ratings[ratings.userId == user_id]
    if len(user_r) < 2:
        return None
    test_item = user_r.tail(1)['movieId'].iloc[0]
    recs = recommend_fn(user_id, top_n=k)
    rec_ids = recs['movieId'].tolist()
    return 1.0 if test_item in rec_ids else 0.0

print("Precision@10 (CF) for user 1 (approx):", precision_at_k_approx(recommend_cf, 1, 10))


Precision@10 (CF) for user 1 (approx): 0.0


## Conclusion

We implemented two recommenders:
- Content-Based Filtering (TF-IDF on genres + cosine similarity)
- Collaborative Filtering (user–user CF using cosine similarity)

Both methods work without additional library installs and are suitable for demonstration and internships.


## Project Notes / Resume Bullet

Resume bullet:
Built a Movie Recommendation System using MovieLens: content-based filtering (TF-IDF + cosine similarity on genres) and collaborative filtering (user–user CF using cosine similarity). Implemented in Python and demonstrated results in a Jupyter Notebook.

Notes:
- Place `movies.csv` and `ratings.csv` in the same folder as this notebook.
- For larger datasets or production, use matrix factorization (SVD) or approximate neighbors for performance.
