# 🎬 Movie Recommendation System (Content-Based)
This notebook implements a **content-based movie recommender system** using metadata such as genres, keywords, cast, and crew. We apply **TF-IDF vectorization** and **cosine similarity** to find movies similar to a given input movie.

### Steps:
1. Load dataset
2. Clean and preprocess metadata (genres, keywords, cast, crew, overview)
3. Create combined text features
4. Convert text into vectors using **TF-IDF**
5. Compute **cosine similarity** between all movies
6. Build a recommendation function that suggests top-5 movies similar to a given title


## Step 1: Import libraries

I import:
- pandas for data handling,
- ast to safely parse JSON-like text fields,
- TfidfVectorizer to convert text to numeric vectors (TF-IDF),
- cosine_similarity to compute similarity between movie vectors,
- difflib.get_close_matches to support fuzzy matching of titles.

These tools let us clean the data, create meaningful text features, and compare movies mathematically.


In [36]:

import pandas as pd
import ast
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import get_close_matches


## Step 2: Load dataset and inspect

Load the CSV file and take a quick look at the shape and top rows. This confirms the file path and shows which columns exist.
If column names differ from what we expect (for example 'id' vs 'movie_id'), fix them here.
I only keep the columns I need for recommendations to keep the notebook tidy.

## Step 3: Quick checks (columns & missing values)

I print column names and check for missing values or duplicates. This step prevents later errors (like KeyError) and helps decide which columns to use.
If important fields are missing, I either fill them or remove those rows.

In [37]:

movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

# Merge on common column 'title'
movies = movies.merge(credits, on='title')

print("Dataset shape:", movies.shape)
movies.head(2)


Dataset shape: (4809, 23)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Step 4: Select features to use

I use 'genres', 'keywords', 'overview', 'cast' (top actors) and 'crew' (director) because:
- genres & keywords describe the movie's type and themes,
- overview captures summary/plot and useful descriptive words,
- cast and director often shape the movie's style.
Using these fields gives a strong content signal without needing user ratings.

## Step 6: Parse JSON-like fields into readable tokens

The genres/keywords/cast/crew columns are stored as strings that look like lists/dicts.
I use ast.literal_eval to parse those strings safely and extract the 'name' fields.
I also remove spaces inside tokens (e.g., "Science Fiction" -> "ScienceFiction") so multi-word phrases count as a single feature.



In [38]:

def safe_eval(x):
    try:
        return ast.literal_eval(x)
    except:
        return []

def convert_to_names(obj_str):
    items = safe_eval(obj_str)
    names = []
    for i in items:
        if isinstance(i, dict) and 'name' in i:
            names.append(i['name'].replace(" ", ""))
        elif isinstance(i, str):
            names.append(i.replace(" ", ""))
    return " ".join(names)

def convert_cast(obj_str, top_n=3):
    items = safe_eval(obj_str)
    names = []
    for i in items[:top_n]:
        if isinstance(i, dict) and 'name' in i:
            names.append(i['name'].replace(" ", ""))
        elif isinstance(i, str):
            names.append(i.replace(" ", ""))
    return " ".join(names)

def fetch_director(obj_str):
    items = safe_eval(obj_str)
    for i in items:
        if isinstance(i, dict) and i.get('job') == 'Director':
            return i.get('name', "").replace(" ", "")
    return ""

movies['genres'] = movies['genres'].apply(convert_to_names)
movies['keywords'] = movies['keywords'].apply(convert_to_names)
movies['cast'] = movies['cast'].apply(lambda x: convert_cast(x, top_n=3))
movies['crew'] = movies['crew'].apply(fetch_director)
movies['overview'] = movies['overview'].apply(lambda x: x if isinstance(x, str) else "")

movies.head(2)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,Action Adventure Fantasy ScienceFiction,http://www.avatarmovie.com/,19995,cultureclash future spacewar spacecolony socie...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,SamWorthington ZoeSaldana SigourneyWeaver,JamesCameron
1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drugabuse exoticisland eastindiatradingc...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,JohnnyDepp OrlandoBloom KeiraKnightley,GoreVerbinski


## Step 7: Keep top 3 cast members and the director

I keep only the top 3 cast names and the director because:
- leading actors and the director strongly influence the movie's feel and target audience,
- keeping a small number avoids noisy or irrelevant minor names.
This keeps features focused and improves similarity quality.

## Step 8: Combine features into a single text field (with weighting)

I merge genres, keywords, cast, crew, and overview into one combined string per movie.
I intentionally repeat genres and keywords (weighting) so structural attributes matter more than long overview paragraphs.
The combined text acts like a short 'document' describing each movie for the vectorizer.


In [39]:

movies['combined'] = (
    (movies['genres'] + " ") * 3 +
    (movies['keywords'] + " ") * 2 +
    movies['cast'] + " " +
    movies['crew'] + " " +
    movies['overview']
)
movies.head(2)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew,combined
0,237000000,Action Adventure Fantasy ScienceFiction,http://www.avatarmovie.com/,19995,cultureclash future spacewar spacecolony socie...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,SamWorthington ZoeSaldana SigourneyWeaver,JamesCameron,Action Adventure Fantasy ScienceFiction Action...
1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drugabuse exoticisland eastindiatradingc...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,JohnnyDepp OrlandoBloom KeiraKnightley,GoreVerbinski,Adventure Fantasy Action Adventure Fantasy Act...


## Step 9: Convert text to vectors using TF-IDF

TF-IDF creates numeric vectors that:
- downweight very common words across all movies, and
- upweight words that are more informative for a particular movie.
This is better than raw counts because it reduces the effect of generic words like "action" or "movie" from dominating similarity.
I also use n-grams and limit max_features to keep it efficient.

## Step 10: Compute cosine similarity

Cosine similarity measures how close two TF-IDF vectors point in the high-dimensional space.
A higher cosine score means the two movies share more important words/features.
We compute a similarity matrix so we can quickly look up the nearest neighbors (most similar movies) for any title.


In [40]:

tfidf = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1,2))
tfidf_matrix = tfidf.fit_transform(movies['combined'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


## Step 11: Recommendation function (how it works)

The function:
1. Finds the best-matching title (exact or fuzzy).
2. Gets the movie's index and the cosine similarity scores for that index.
3. Sorts other movies by similarity score (highest first).
4. Skips the same movie and returns the top-N similar titles (optionally with similarity scores).

Edge cases handled:
- If title not found, a friendly message is returned.
- If too few movies exist, it returns as many as available.


In [41]:

movies = movies.reset_index(drop=True)
title_to_index = pd.Series(movies.index, index=movies['title'].str.lower()).drop_duplicates()

def find_best_title(query):
    query_low = query.lower()
    if query_low in title_to_index:
        return movies.iloc[title_to_index[query_low]]['title']
    all_titles = movies['title'].tolist()
    matches = get_close_matches(query, all_titles, n=1, cutoff=0.6)
    return matches[0] if matches else None

def recommend(title, top=5, show_score=True):
    best = find_best_title(title)
    if best is None:
        return ["Movie not found."]
    
    idx = title_to_index[best.lower()]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1: top+1]
    
    results = []
    for i, score in sim_scores:
        if show_score:
            results.append((movies.iloc[i]['title'], round(float(score), 3)))
        else:
            results.append(movies.iloc[i]['title'])
    return results


## Step 13: Demo examples to run

Try these examples for a clean demo:
- recommend("Inception", top=5)
- recommend("The Dark Knight Rises", top=5)

Explain that these titles have rich metadata and therefore produce more meaningful neighbors in the demo.
Show the similarity scores to highlight how close each recommended movie is.


In [42]:

print("Recommendations for 'Inception':")
print(recommend("Inception", top=5))

print("\nRecommendations for 'The Dark Knight Rises':")
print(recommend("The Dark Knight Rises", top=5))


Recommendations for 'Inception':
[('Minority Report', 0.362), ('2001: A Space Odyssey', 0.311), ('Super 8', 0.31), ('The Thirteenth Floor', 0.306), ('A Sound of Thunder', 0.299)]

Recommendations for 'The Dark Knight Rises':
[('The Dark Knight', 0.534), ('Batman Begins', 0.439), ('Batman Returns', 0.397), ('Batman & Robin', 0.306), ('Batman Forever', 0.289)]
