# Content-Based Movie Recommendation System

## Project Overview
This notebook builds a content-based recommendation system using the TMDB 5000 Movie Dataset.
The system recommends movies based on similarity in content features like overview, genres, keywords, and cast.

## Dataset
- TMDB 5000 Movie Dataset from Kaggle
- Two CSV files: tmdb_5000_movies.csv and tmdb_5000_credits.csv

## 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import ast
import pickle
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 2. Load the Dataset

**Note:** Download the TMDB 5000 Movie Dataset from Kaggle:
- https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
- Place `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv` in the same folder as this notebook

In [None]:
# Load datasets
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

print(f"Movies shape: {movies.shape}")
print(f"Credits shape: {credits.shape}")

In [None]:
# Display first few rows
print("Movies Dataset:")
display(movies.head())

print("\nCredits Dataset:")
display(credits.head())

In [None]:
# Check columns
print("Movies columns:", movies.columns.tolist())
print("\nCredits columns:", credits.columns.tolist())

## 3. Data Preprocessing

In [None]:
# Merge datasets on title
movies = movies.merge(credits, on='title')
print(f"Merged dataset shape: {movies.shape}")
movies.head(2)

In [None]:
# Select relevant columns
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]
print(f"Selected columns: {movies.columns.tolist()}")
movies.head()

In [None]:
# Check for missing values
print("Missing values:")
print(movies.isnull().sum())

In [None]:
# Drop rows with missing values in critical columns
movies.dropna(inplace=True)
print(f"Dataset shape after dropping nulls: {movies.shape}")

In [None]:
# Check for duplicates
print(f"Duplicate rows: {movies.duplicated().sum()}")
movies = movies.drop_duplicates()
print(f"Shape after removing duplicates: {movies.shape}")

## 4. Feature Engineering - Extract Information from JSON Columns

In [None]:
# Function to extract names from genres/keywords (list of dictionaries)
def convert(obj):
    try:
        L = []
        for i in ast.literal_eval(obj):
            L.append(i['name'])
        return L
    except:
        return []

# Test the function
print("Sample genres before:", movies['genres'].iloc[0])
print("Sample genres after:", convert(movies['genres'].iloc[0]))

In [None]:
# Apply to genres and keywords
movies['genres'] = movies['genres'].apply(convert)
movies['keywords'] = movies['keywords'].apply(convert)

print("Genres converted successfully")
movies[['title', 'genres', 'keywords']].head()

In [None]:
# Function to extract top 3 cast members
def convert_cast(obj):
    try:
        L = []
        counter = 0
        for i in ast.literal_eval(obj):
            if counter < 3:
                L.append(i['name'])
                counter += 1
            else:
                break
        return L
    except:
        return []

movies['cast'] = movies['cast'].apply(convert_cast)
print("Cast converted successfully")
movies[['title', 'cast']].head()

In [None]:
# Function to extract director from crew
def fetch_director(obj):
    try:
        L = []
        for i in ast.literal_eval(obj):
            if i['job'] == 'Director':
                L.append(i['name'])
        return L
    except:
        return []

movies['crew'] = movies['crew'].apply(fetch_director)
movies.rename(columns={'crew': 'director'}, inplace=True)

print("Director extracted successfully")
movies[['title', 'director']].head()

In [None]:
# Convert overview to list of words
movies['overview'] = movies['overview'].apply(lambda x: x.split() if isinstance(x, str) else [])

print("Overview converted to list")
movies[['title', 'overview']].head()

## 5. Data Cleaning - Remove Spaces from Names

In [None]:
# Remove spaces from multi-word names to treat them as single tokens
def remove_space(L):
    if isinstance(L, list):
        return [i.replace(" ", "") for i in L]
    return L

movies['cast'] = movies['cast'].apply(remove_space)
movies['director'] = movies['director'].apply(remove_space)
movies['genres'] = movies['genres'].apply(remove_space)
movies['keywords'] = movies['keywords'].apply(remove_space)

print("Spaces removed from names")
movies.head()

## 6. Create Tags Column - Combine All Features

In [None]:
# Combine all features into a single 'tags' column
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['director']

print("Tags column created")
movies[['title', 'tags']].head()

In [None]:
# Create a new dataframe with only necessary columns
new_df = movies[['movie_id', 'title', 'tags']].copy()

print(f"New dataframe shape: {new_df.shape}")
new_df.head()

In [None]:
# Convert tags list to string
new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x) if isinstance(x, list) else "")

print("Tags converted to string")
print("\nSample tags:")
print(new_df['tags'].iloc[0][:500])

In [None]:
# Convert to lowercase
new_df['tags'] = new_df['tags'].apply(lambda x: x.lower())

print("Tags converted to lowercase")
print("\nSample tags after lowercase:")
print(new_df['tags'].iloc[0][:500])

## 7. Vectorization - Convert Text to Vectors

In [None]:
# Using CountVectorizer with stemming
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(new_df['tags']).toarray()

print(f"Vectors shape: {vectors.shape}")
print(f"Number of movies: {vectors.shape[0]}")
print(f"Number of features: {vectors.shape[1]}")

In [None]:
# Check feature names
print("Sample feature names:")
print(cv.get_feature_names_out()[:50])

## 8. Apply Stemming for Better Results (Optional but Recommended)

In [None]:
# Install nltk if not already installed
# !pip install nltk

import nltk
from nltk.stem.porter import PorterStemmer

# Download required NLTK data
try:
    nltk.download('punkt', quiet=True)
    print("NLTK data downloaded")
except:
    print("NLTK data already exists")

In [None]:
# Apply stemming
ps = PorterStemmer()

def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

new_df['tags'] = new_df['tags'].apply(stem)

print("Stemming applied")
print("\nSample tags after stemming:")
print(new_df['tags'].iloc[0][:500])

In [None]:
# Re-vectorize after stemming
cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(new_df['tags']).toarray()

print(f"Vectors shape after stemming: {vectors.shape}")

## 9. Calculate Cosine Similarity

In [None]:
# Calculate cosine similarity between all movie vectors
similarity = cosine_similarity(vectors)

print(f"Similarity matrix shape: {similarity.shape}")
print(f"\nSample similarity scores for first movie:")
print(similarity[0][:10])

## 10. Build Recommendation Function

In [None]:
def recommend(movie):
    """
    Recommend top 5 similar movies based on content similarity
    
    Parameters:
    movie (str): Title of the movie
    
    Returns:
    list: List of 5 recommended movie titles
    """
    try:
        # Get the index of the movie
        movie_index = new_df[new_df['title'] == movie].index[0]
        
        # Get similarity scores for this movie with all other movies
        distances = similarity[movie_index]
        
        # Sort movies based on similarity scores and get top 6 (including the movie itself)
        movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
        
        # Get movie titles
        recommended_movies = []
        for i in movies_list:
            recommended_movies.append(new_df.iloc[i[0]].title)
        
        return recommended_movies
    
    except IndexError:
        return f"Movie '{movie}' not found in the database. Please check the title and try again."
    except Exception as e:
        return f"Error: {str(e)}"

## 11. Test the Recommendation System

In [None]:
# Test with some popular movies
print("Recommendations for 'Avatar':")
print(recommend('Avatar'))

print("\nRecommendations for 'The Dark Knight':")
print(recommend('The Dark Knight'))

print("\nRecommendations for 'Inception':")
print(recommend('Inception'))

In [None]:
# Check available movie titles
print(f"Total movies in database: {len(new_df)}")
print("\nSample movie titles:")
print(new_df['title'].head(20).tolist())

## 12. Save Models and Data for Deployment

In [None]:
# Save the movie dataframe
pickle.dump(new_df, open('movies.pkl', 'wb'))
print("Movies dataframe saved as 'movies.pkl'")

In [None]:
# Save the similarity matrix
pickle.dump(similarity, open('similarity.pkl', 'wb'))
print("Similarity matrix saved as 'similarity.pkl'")

In [None]:
# Save the vectorizer (optional)
pickle.dump(cv, open('vectorizer.pkl', 'wb'))
print("Vectorizer saved as 'vectorizer.pkl'")

In [None]:
# Verify saved files
import os

files = ['movies.pkl', 'similarity.pkl', 'vectorizer.pkl']
for file in files:
    if os.path.exists(file):
        size = os.path.getsize(file) / (1024 * 1024)  # Size in MB
        print(f"✓ {file} - {size:.2f} MB")
    else:
        print(f"✗ {file} - Not found")

## 13. Enhanced Recommendation Function with Details

In [None]:
def recommend_with_details(movie):
    """
    Recommend movies with similarity scores
    """
    try:
        movie_index = new_df[new_df['title'] == movie].index[0]
        distances = similarity[movie_index]
        movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
        
        print(f"Top 5 recommendations for '{movie}':\n")
        for idx, (i, score) in enumerate(movies_list, 1):
            print(f"{idx}. {new_df.iloc[i].title} (Similarity: {score:.4f})")
        
    except IndexError:
        print(f"Movie '{movie}' not found in the database.")
    except Exception as e:
        print(f"Error: {str(e)}")

# Test
recommend_with_details('Avatar')

## Summary

### What We Built:
1. ✅ Loaded and merged TMDB movie and credits datasets
2. ✅ Extracted features: overview, genres, keywords, cast, and director
3. ✅ Created a combined 'tags' column with all features
4. ✅ Applied text preprocessing: lowercase, stemming, and space removal
5. ✅ Vectorized text using CountVectorizer (5000 features)
6. ✅ Calculated cosine similarity matrix
7. ✅ Built recommend() function returning top 5 similar movies
8. ✅ Saved models using pickle for deployment

### Files Created:
- `movies.pkl` - Movie dataframe with titles and tags
- `similarity.pkl` - Cosine similarity matrix
- `vectorizer.pkl` - Fitted CountVectorizer

### Next Steps:
- Build Streamlit app for interactive UI
- Add movie posters using TMDB API
- Deploy to cloud platform