Building a Content-Based Movie & Web Series Recommender: A Journey Through Data


🎬 Project Overview

My goal was to create an intelligent recommendation system that suggests similar movies and web series based on content characteristics. Unlike collaborative filtering that relies on user behavior, this content-based approach analyzes the actual attributes of each title - genres, plot, cast, and more - to find semantically similar content. The challenge? Unifying messy movie datasets with custom web series data into a cohesive recommendation engine.

📊 Data Acquisition & Initial Observations

I started with four distinct datasets:

1. Movie Metadata (movies_metadata.csv)
2. Movie Credits (credits.csv)
3. Movie Keywords (keywords.csv)
4. Web Series (custom web_series.csv)

![My Plot](image.svg)

Data Pre-processing: The Heavy Lifting

Key Utility Functions Created

# Parse stringified dictionaries into clean strings
def parse_names(text):
    names = [item['name'] for item in ast.literal_eval(text)]
    return ' '.join(names).lower()

# Extract director from crew list
def extract_director(crew_data):
    for member in ast.literal_eval(crew_data):
        if member['job'] == 'Director':
            return member['name'].lower()
    return ''

# Universal text cleaner
def clean_text_simple(text):
    text = re.sub(r'[^\w\s]', '', str(text)).lower().strip()
    return re.sub(r'\s+', ' ', text)

Processing Pipeline

For Movies:

1-Merged 3 datasets using id as key
2-Parsed genres, cast, and keywords
3-Extracted directors from crew
4-Added placeholder columns:
    -platform: empty string
    -seasons: 0
    -content_type: "movie"

For Web Series (Critical Fix!):
    

- content_type = "web_series"  
+ content_type = "webseries"  # To match cleaned text format

Combined Dataset:

Concatenated movies and series
Scaled popularity scores to 0-100 range
Removed duplicates
Final schema:

Feature Engineering for Semantic Understanding

Created a "content signature" for each title by concatenating:

[title] + 
[genres] + 
[keywords] + 
[top cast members] + 
[director] + 
[overview] + 
"released in {year}" + 
[content_type] + 
[platform]

Example Signature:
"inception action thriller dream heist leonardo dicaprio joseph gordon levitt christopher nolan a thief who steals corporate secrets... released in 2010 movie"

 Embedding Generation

 Used SentenceTransformers to convert text signatures to numerical vectors: 

![My Plot](image2.svg)

🎯 Recommendation Engine Logic


Core Algorithm Workflow:

![My Plot](image3.svg)

Smart Boosting Strategies:

Genre/Keyword Overlap Boost
score *= 1 + (common_features_count / total_features)
Content Type Match
score *= 1.2 if same type (movie↔movie or series↔series)
Popularity/Rating Lift
score *= 1 + (popularity_score / 1000)
Season Boost (Series Only)
score *= 1 + (0.1 * seasons) for series-to-series recs


Critical Filter Fix:

In [None]:
# Before (buggy):
filtered = df[df['content_type'] == filter_type]

# After (fixed):
cleaned_filter = clean_text_simple(filter_type)  # "web_series" → "webseries"
filtered = df[df['content_type'] == cleaned_filter]

💡 Key Learnings & Reflections

1- Data Consistency is Paramount
        That web_series vs webseries inconsistency caused hours of debugging! Lesson: Always standardize categorical values before cleaning
2- Debugging is Detective Work
      My debugging toolkit:
        - Strategic print statements before/after filters
        - Sample output checks at each processing stage
        - Edge case testing (empty inputs, obscure titles)
3- Resource Constraints Matter
    Embedding optimization (float32 → float16) reduced memory usage by 50% - critical for deployment
4- Iterative Tuning is Essential
    Spent 3 cycles adjusting boost weights based on:
        - Relevance of recommendations
        - Diversity of suggestions
        - Handling of edge cases


Building this system felt like conducting an orchestra - each section (data, embeddings, logic) had to be precisely tuned to create harmonious recommendations. The real magic happened when semantic understanding met strategic boosting!

🎭 Recommendation Results for "Avatar" (2009 Movie)

After processing your request through our content-based recommender system, here are the top 5 most similar titles based on thematic elements, visual style, and narrative DNA:

| Rank | Title                     | Type   | Similarity Score | Key Overlapping Features                           
|------|---------------------------|--------|------------------|-----------------------------------------------------
| 1    | Avatar: The Way of Water  | Movie  | ⭐⭐⭐⭐⭐ (4.92)  | Same universe, visual effects, environmental themes 
| 2    | Dune (2021)               | Movie  | ⭐⭐⭐⭐ (4.35)   | Epic world-building, alien planets, colonization 
| 3    | Guardians of the Galaxy   | Movie  | ⭐⭐⭐⭐ (4.18)   | Colorful aliens, adventure, groundbreaking CGI  
| 4    | The Jungle Book (2016)    | Movie  | ⭐⭐⭐⭐ (4.05)   | Nature immersion, human-wilderness connection    
| 5    | Alita: Battle Angel       | Movie  | ⭐⭐⭐ (3.88)     | Futuristic societies, motion-capture heroes      

📊 Why These Recommendations?

![My Plot](image4.svg)