<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning/05_projects/Movie%20Recommendation%20System/tmdb_movie_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TMDB Movie Recommendation System

This notebook builds a **content-based recommendation system** using the [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata).  
We'll recommend movies based on **overview**, **genres**, **cast**, **crew**, and **keywords** using text similarity.

---

## You Will Learn
- How to clean and combine content features
- Use TF-IDF and CountVectorizer
- Measure cosine similarity between movie vectors
- Recommend similar movies to any title

---


## Step 1: Upload and Load TMDB Datasets

In [2]:
from google.colab import files
import pandas as pd

movies = pd.read_csv('/content/tmdb_5000_movies.csv')
credits = pd.read_csv('/content/tmdb_5000_credits.csv')

movies.shape, credits.shape

((4803, 20), (4803, 4))

## Step 2: Merge Movies and Credits

In [3]:
# Merge on 'id'
movies = movies.merge(credits, left_on='id', right_on='movie_id')
movies = movies[['id', 'title_x', 'overview', 'genres', 'keywords', 'cast', 'crew']]
movies.rename(columns={'title_x': 'title'}, inplace=True)
movies.head(2)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Step 3: Preprocess Text Columns

In [4]:
import ast

# Helper to extract names from JSON-like strings
def extract_names(text):
    try:
        return [obj['name'] for obj in ast.literal_eval(text)]
    except:
        return []

# For cast, only top 3 actors
def extract_top_cast(text):
    try:
        return [obj['name'] for obj in ast.literal_eval(text)[:3]]
    except:
        return []

# For director from crew
def extract_director(text):
    try:
        for obj in ast.literal_eval(text):
            if obj['job'] == 'Director':
                return [obj['name']]
        return []
    except:
        return []

movies['genres'] = movies['genres'].apply(extract_names)
movies['keywords'] = movies['keywords'].apply(extract_names)
movies['cast'] = movies['cast'].apply(extract_top_cast)
movies['crew'] = movies['crew'].apply(extract_director)

# Fill missing overviews
movies['overview'] = movies['overview'].fillna('')
movies['overview'] = movies['overview'].apply(lambda x: x.split())

# Combine into tags
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
movies['tags'] = movies['tags'].apply(lambda x: " ".join(x).lower())
movies[['title', 'tags']].head()

Unnamed: 0,title,tags
0,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,Spectre,a cryptic message from bond’s past sends him o...
3,The Dark Knight Rises,following the death of district attorney harve...
4,John Carter,"john carter is a war-weary, former military ca..."


## Step 4: Vectorization and Similarity

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(movies['tags']).toarray()

similarity = cosine_similarity(vectors)
similarity

array([[1.        , 0.06885304, 0.04948717, ..., 0.03142697, 0.05410018,
        0.        ],
       [0.06885304, 1.        , 0.04259177, ..., 0.04057204, 0.        ,
        0.        ],
       [0.04948717, 0.04259177, 1.        , ..., 0.01944039, 0.08924215,
        0.        ],
       ...,
       [0.03142697, 0.04057204, 0.01944039, ..., 1.        , 0.06375767,
        0.03276488],
       [0.05410018, 0.        , 0.08924215, ..., 0.06375767, 1.        ,
        0.03760222],
       [0.        , 0.        , 0.        , ..., 0.03276488, 0.03760222,
        1.        ]])

## Step 5: Recommendation Function

In [10]:
# Map index to movie title
movie_titles = movies['title']

def recommend(title):
    if title not in movie_titles.values:
        return "Movie not found in dataset."

    idx = movie_titles[movie_titles == title].index[0]
    distances = list(enumerate(similarity[idx]))
    sorted_movies = sorted(distances, key=lambda x: x[1], reverse=True)[1:6]
    for i in sorted_movies:
        print(movies.iloc[i[0]].title)

## Step 6: Try Recommending

In [17]:
recommend('Toy Story 3')

Toy Story 2
Toy Story
Small Soldiers
The 40 Year Old Virgin
Ted


---

## Summary

- Merged movie metadata with credits
- Extracted key features (overview, genre, cast, crew, keywords)
- Built a content-based recommender using CountVectorizer and cosine similarity
- Recommended top 5 similar movies to a given title

---

## Dataset Used
- [TMDB 5000 Movies Dataset on Kaggle](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)

Make sure to upload:
- `tmdb_5000_movies.csv`
- `tmdb_5000_credits.csv`

---
