# Scenario 3: Two-Stage Recommendation Pipeline (Retrieval + Ranking)

This notebook implements a modern **Two-Stage Recommendation Pipeline** that combines semantic search with popularity based ranking.

## Concept
In large-scale systems, it's impossible to rank all items for every user. Instead, we use a two stage process:
1. **Retrieval (Candidate Generation)**: Quickly narrow down millions of items to a few hundred candidates using efficient methods like semantic search or collaborative filtering.
2. **Ranking (Reranking)**: Use a more complex model (or business logic) to rank the candidates and provide the final top recommendations.

### Implementation Details:
- **Fusion of Metadata**: We combine movie titles, genres, and user-generated tags into a single "document" for each movie.
- **Semantic Retrieval**: We use a pre trained Sentence-Transformer model to create embeddings and perform similarity search.
- **Popularity Ranking**: We re-rank candidates using the weighted popularity score from Scenario 2.

In [2]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the data
movies = pd.read_csv('data/movies.csv')
ratings = pd.read_csv('data/ratings.csv')
tags = pd.read_csv('data/tags.csv')

print(f"Loaded {len(movies)} movies, {len(ratings)} ratings, and {len(tags)} tags.")

Loaded 9742 movies, 100836 ratings, and 3683 tags.


## 1. Data Fusion (Structured + Unstructured)
We merge movie metadata with user tags to create a rich semantic description.

In [3]:
# Combine tags for each movie
movie_tags = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(x.astype(str))).reset_index()

# Merge with movies
movie_content = movies.merge(movie_tags, on='movieId', how='left')
movie_content['tag'] = movie_content['tag'].fillna('')

# Create a combined feature string
movie_content['combined_features'] = movie_content['title'] + " " + movie_content['genres'].replace('|', ' ') + " " + movie_content['tag']

print("Sample combined features:")
print(movie_content['combined_features'].iloc[0])

Sample combined features:
Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy pixar pixar fun


## 2. Stage 1: Retrieval (Semantic Search)
We use `all-MiniLM-L6-v2` to embed our movie descriptions.

In [4]:
# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all movies
movie_embeddings = model.encode(movie_content['combined_features'].tolist(), show_progress_bar=True)

def retrieve_candidates(query, n_candidates=50):
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, movie_embeddings).flatten()
    
    candidate_indices = similarities.argsort()[-n_candidates:][::-1]
    candidates = movie_content.iloc[candidate_indices].copy()
    candidates['retrieval_score'] = similarities[candidate_indices]
    return candidates

Batches: 100%|██████████| 305/305 [00:21<00:00, 14.09it/s]


## 3. Stage 2: Ranking (Popularity-Based)
We incorporate the popularity logic from Scenario 2 to rank our semantic candidates.

In [5]:
# Pre-calculate popularity scores
movie_stats = ratings.groupby('movieId').agg({'rating': ['mean', 'count']})
movie_stats.columns = ['avg_rating', 'rating_count']
C = movie_stats['avg_rating'].mean()
m = movie_stats['rating_count'].quantile(0.9)

def weighted_rating(v, R, m=m, C=C):
    return (v/(v+m) * R) + (m/(m+v) * C)

movie_stats['popularity_score'] = movie_stats.apply(lambda x: weighted_rating(x['rating_count'], x['avg_rating']), axis=1)

def rank_candidates(candidates, n_final=10):
    # Merge with popularity scores
    ranked = candidates.merge(movie_stats[['popularity_score']], on='movieId', how='left')
    ranked['popularity_score'] = ranked['popularity_score'].fillna(C)
    
    # Final Score = 0.7 * Retrieval Similarity + 0.3 * Normalized Popularity
    ranked['final_score'] = ranked['retrieval_score'] * 0.7 + (ranked['popularity_score'] / 5.0) * 0.3
    
    return ranked.sort_values('final_score', ascending=False).head(n_final)

## 4. Testing the Pipeline
Let's try a query like "Space travel and futuristic wars".

In [6]:
query = "Space travel and futuristic wars"
print(f"Query: {query}\n")

candidates = retrieve_candidates(query)
final_recommendations = rank_candidates(candidates)

print("Top 10 Recommendations:")
final_recommendations[['title', 'genres', 'final_score']]

Query: Space travel and futuristic wars

Top 10 Recommendations:


Unnamed: 0,title,genres,final_score
1,Interstellar (2014),Sci-Fi|IMAX,0.592011
0,War of the Worlds (2005),Action|Sci-Fi,0.574664
4,Battlestar Galactica (2003),Drama|Sci-Fi|War,0.558079
5,"Right Stuff, The (1983)",Drama,0.557627
28,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,0.554978
2,The Space Between Us (2016),Adventure|Sci-Fi,0.55443
32,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,0.55407
10,Star Trek (2009),Action|Adventure|Sci-Fi|IMAX,0.551554
43,"Matrix, The (1999)",Action|Sci-Fi|Thriller,0.543027
6,Space Buddies (2009),Adventure|Children|Fantasy|Sci-Fi,0.542655
