# ðŸŽ¬ Movie Recommendation System

A content-based movie recommendation system using TF-IDF vectorization, K-Means clustering, and cosine similarity.

**Features:**
- Content-Based Filtering using genres, cast, director, and keywords
- Hybrid Recommendations combining content similarity with popularity
- K-Means Clustering for efficient similarity search
- Fuzzy Search for handling typos

---

## 1. Setup and Imports

Import required libraries for data processing, visualization, and machine learning.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import ast
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from difflib import get_close_matches

pd.set_option('display.max_columns', None)

Libraries imported!


## 2. Data Loading

Load the TMDB 5000 dataset consisting of two files:
- `tmdb_5000_movies.csv` - Movie metadata (genres, keywords, overview, etc.)
- `tmdb_5000_credits.csv` - Cast and crew information

In [348]:
movies_df = pd.read_csv('../data/tmdb_5000_movies.csv')
credits_df = pd.read_csv('../data/tmdb_5000_credits.csv')

print(f'Movies dataset: {movies_df.shape}')
print(f'Credits dataset: {credits_df.shape}')

Movies dataset: (4803, 20)
Credits dataset: (4803, 4)


### Preview the Movies Dataset

In [371]:
movies_df.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bondâ€™s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


### Preview the Credits Dataset

In [372]:
credits_df.head(5)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## 3. Data Preprocessing & Feature Engineering

### Key Steps:
1. Merge movies and credits datasets
2. Parse JSON-formatted columns (genres, keywords, cast, crew)
3. Extract director from crew
4. Handle missing values
5. Create movie profile for similarity matching

### 3.1 Merge Datasets

Merge movies and credits on `id` and `movie_id` columns.

In [351]:
df = movies_df.merge(credits_df.drop(columns=['title']), left_on='id', right_on='movie_id')
print(f'Merged dataset: {df.shape}')

Merged dataset: (4803, 23)


### 3.2 Define Helper Functions

Create functions to parse JSON-formatted strings in the dataset.

In [None]:
def parse_json(text, key='name', limit=None):
    """Parse JSON string and extract values."""
    try:
        items = ast.literal_eval(text)
        if limit:
            items = items[:limit]
        return ' '.join([item[key].replace(' ', '') for item in items])
    except:
        return ''

def get_director(text):
    """Extract director name from crew JSON."""
    try:
        for member in ast.literal_eval(text):
            if member.get('job') == 'Director':
                return member.get('name', '').replace(' ', '')
        return ''
    except:
        return ''


Helper functions defined


### 3.3 Parse JSON Columns

Extract genres, keywords, cast (top 5), and director from JSON columns.

In [None]:
df['genres_clean'] = df['genres'].apply(parse_json)
df['keywords_clean'] = df['keywords'].apply(lambda x: parse_json(x, limit=5))
df['cast_clean'] = df['cast'].apply(lambda x: parse_json(x, limit=5))
df['director'] = df['crew'].apply(get_director)

df[['title', 'genres_clean', 'cast_clean', 'director']].head(3)

JSON columns parsed!


Unnamed: 0,title,genres_clean,cast_clean,director
0,Avatar,Action Adventure Fantasy ScienceFiction,SamWorthington ZoeSaldana SigourneyWeaver Step...,JamesCameron
1,Pirates of the Caribbean: At World's End,Adventure Fantasy Action,JohnnyDepp OrlandoBloom KeiraKnightley Stellan...,GoreVerbinski
2,Spectre,Action Adventure Crime,DanielCraig ChristophWaltz LÃ©aSeydoux RalphFie...,SamMendes


### 3.4 Feature Engineering

Extract release year and calculate profit metrics.

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['release_year'] = df['release_date'].dt.year.fillna(0).astype(int)
df['profit'] = df['revenue'] - df['budget']

def profit_category(p):
    if p > 100_000_000:
        return 'Blockbuster'
    elif p > 0:
        return 'Profitable'
    elif p == 0:
        return 'Break-even'
    else:
        return 'Loss'

df['profit_category'] = df['profit'].apply(profit_category)

Feature engineering complete!


### 3.5 Create Movie Profile

Combine genres, keywords, cast, and director into a single text profile for content-based filtering.

In [355]:
df['overview'] = df['overview'].fillna('')
df['profile'] = df['genres_clean'] + ' ' + df['keywords_clean'] + ' ' + df['cast_clean'] + ' ' + df['director']

df = df[df['profile'].str.strip() != ''].reset_index(drop=True)

print(f'Final dataset: {len(df)} movies')
print(f'\nSample profile for "{df.loc[0, "title"]}":')
print(df.loc[0, 'profile'][:100] + '...')

Final dataset: 4790 movies

Sample profile for "Avatar":
Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society SamWorthing...


## 4. Exploratory Data Analysis (EDA)

Visualize key patterns and distributions in the movie dataset.

### 4.1 Movies Released Per Year

In [356]:
yearly = df[df['release_year'] > 1900].groupby('release_year').size().reset_index(name='count')

fig = px.line(yearly, x='release_year', y='count',
              title='Movie Releases Per Year',
              labels={'release_year': 'Year', 'count': 'Number of Movies'},
              template='plotly_white')
fig.update_traces(line=dict(color='#636EFA', width=2))
fig.show()

### 4.2 Top 15 Genres

In [357]:
genres = df['genres_clean'].str.split().explode().value_counts().head(15)

fig = px.bar(x=genres.index, y=genres.values,
             title='Top 15 Movie Genres',
             labels={'x': 'Genre', 'y': 'Count'},
             template='plotly_white',
             color=genres.values,
             color_continuous_scale='Viridis')
fig.update_layout(showlegend=False)
fig.show()

### 4.3 Top 10 Highest Rated Movies

Movies with at least 1,000 votes.

In [358]:
top_rated = df[df['vote_count'] >= 1000].nlargest(10, 'vote_average')

fig = px.bar(top_rated, x='vote_average', y='title',
             orientation='h',
             title='Top 10 Highest Rated Movies (min 1,000 votes)',
             labels={'vote_average': 'Rating', 'title': 'Movie'},
             template='plotly_white',
             color='vote_average',
             color_continuous_scale='RdYlGn')
fig.update_layout(yaxis=dict(autorange='reversed'), showlegend=False)
fig.show()

### 4.4 Budget vs Revenue

Scatter plot showing the relationship between budget and revenue.

In [359]:
df_budget = df[(df['budget'] > 0) & (df['revenue'] > 0)]

fig = px.scatter(df_budget, x='budget', y='revenue',
                 color='profit_category',
                 hover_data=['title'],
                 title='Budget vs Revenue',
                 template='plotly_white',
                 opacity=0.7)
fig.add_trace(go.Scatter(x=[0, df_budget['budget'].max()],
                          y=[0, df_budget['budget'].max()],
                          mode='lines', name='Break-even',
                          line=dict(dash='dash', color='gray')))
fig.show()

## 5. Model Building

Build the content-based filtering model using TF-IDF vectors and K-Means clustering.

### 5.1 TF-IDF Vectorization

Convert movie profiles into numerical vectors using Term Frequency-Inverse Document Frequency.

In [360]:
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf.fit_transform(df['profile'])

print(f'TF-IDF Matrix Shape: {tfidf_matrix.shape}')
print(f'   - {tfidf_matrix.shape[0]:,} movies')
print(f'   - {tfidf_matrix.shape[1]:,} features')

TF-IDF Matrix Shape: (4790, 5000)
   - 4,790 movies
   - 5,000 features


### 5.2 K-Means Clustering

Group similar movies into 25 clusters for efficient similarity search.

In [370]:
kmeans = KMeans(n_clusters=25, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(tfidf_matrix)

cluster_dist = df['cluster'].value_counts().sort_index()
print(f'Cluster sizes: min={cluster_dist.min()}, max={cluster_dist.max()}, avg={cluster_dist.mean():.0f}')

Cluster sizes: min=33, max=761, avg=192


### 5.3 Cluster Distribution

In [362]:
fig = px.bar(x=cluster_dist.index, y=cluster_dist.values,
             title='Movies Per Cluster',
             labels={'x': 'Cluster ID', 'y': 'Number of Movies'},
             template='plotly_white',
             color=cluster_dist.values,
             color_continuous_scale='Turbo')
fig.update_layout(showlegend=False)
fig.show()

### 5.4 PCA Visualization

2D visualization of movie clusters using Principal Component Analysis.

In [363]:
pca = PCA(n_components=2, random_state=42)
coords = pca.fit_transform(tfidf_matrix.toarray())

df['pca_x'], df['pca_y'] = coords[:, 0], coords[:, 1]

fig = px.scatter(df, x='pca_x', y='pca_y', color='cluster',
                 hover_data=['title', 'genres_clean'],
                 title='Movie Clusters (PCA Visualization)',
                 template='plotly_white',
                 opacity=0.6)
fig.show()

## 6. Recommendation Engine

Build the `MovieRecommender` class with:
- Fuzzy movie search
- Content-based recommendations using cosine similarity
- Hybrid recommendations combining content and popularity
- Genre filtering option

In [None]:
class MovieRecommender:
    """Content-based movie recommendation system."""
    
    def __init__(self, df, tfidf_matrix):
        self.df = df
        self.tfidf_matrix = tfidf_matrix
        self.titles = df['title'].str.lower().tolist()
    
    def find_movie(self, title):
        """Find movie using fuzzy matching."""
        t = title.lower().strip()
        if t in self.titles:
            idx = self.titles.index(t)
            return self.df.iloc[idx]['title'], idx
        matches = get_close_matches(t, self.titles, n=1, cutoff=0.6)
        if matches:
            idx = self.titles.index(matches[0])
            return self.df.iloc[idx]['title'], idx
        return None, None
    
    def recommend(self, title, top_n=10, genre=None, hybrid=False):
        """Get movie recommendations."""
        found, idx = self.find_movie(title)
        if not found:
            suggestions = get_close_matches(title.lower(), self.titles, n=5, cutoff=0.3)
            if suggestions:
                return f"Movie not found. Did you mean: {suggestions}?"
            return "Movie not found."
        
        print(f"ðŸŽ¬ Recommendations for: '{found}'")
        
        cluster = self.df.loc[idx, 'cluster']
        cluster_df = self.df[self.df['cluster'] == cluster].copy()
        
        sim = cosine_similarity(self.tfidf_matrix[idx], self.tfidf_matrix[cluster_df.index]).flatten()
        cluster_df['similarity'] = sim
        
        if hybrid:
            pop = cluster_df['vote_count'] * cluster_df['vote_average']
            cluster_df['hybrid'] = 0.7 * cluster_df['similarity'] + 0.3 * (pop / pop.max())
            sort_col = 'hybrid'
        else:
            sort_col = 'similarity'
        
        if genre:
            cluster_df = cluster_df[cluster_df['genres_clean'].str.contains(genre, case=False, na=False)]
        
        result = cluster_df[cluster_df.index != idx].nlargest(top_n, sort_col)
        cols = ['title', 'genres_clean', 'director', 'release_year', 'vote_average', 'similarity']
        if hybrid:
            cols.append('hybrid')
        return result[cols].round(3)

recommender = MovieRecommender(df, tfidf_matrix)

MovieRecommender initialized!


## 7. Demonstration

Test the recommendation system with various queries.

### 7.1 Basic Recommendation

In [365]:
recommender.recommend('Avatar', top_n=10)

ðŸŽ¬ Recommendations for: 'Avatar'


Unnamed: 0,title,genres_clean,director,release_year,vote_average,similarity
4398,The Helix... Loaded,Action Comedy ScienceFiction,,2005,4.8,0.163
4721,Echo Dr.,Thriller Action Drama ScienceFiction,PatrickRyanSims,2013,5.0,0.148
279,Terminator 2: Judgment Day,Action Thriller ScienceFiction,JamesCameron,1991,7.7,0.148
56,Star Trek Beyond,Action Adventure ScienceFiction,JustinLin,2016,6.6,0.144
1395,Resident Evil,Horror Action ScienceFiction,PaulW.S.Anderson,2002,6.4,0.14
657,Resident Evil: Retribution,Action Horror ScienceFiction,PaulW.S.Anderson,2012,5.6,0.139
322,The Fifth Element,Adventure Fantasy Action Thriller ScienceFiction,LucBesson,1997,7.3,0.138
47,Star Trek Into Darkness,Action Adventure ScienceFiction,J.J.Abrams,2013,7.4,0.135
4641,The Sticky Fingers of Time,ScienceFiction Drama,HilaryBrougher,1997,4.8,0.133
466,The Time Machine,ScienceFiction Adventure Action,SimonWells,2002,5.8,0.133


### 7.2 Hybrid Recommendation

Combines content similarity (70%) with popularity (30%).

In [366]:
recommender.recommend('The Dark Knight', top_n=10, hybrid=True)

ðŸŽ¬ Recommendations for: 'The Dark Knight'


Unnamed: 0,title,genres_clean,director,release_year,vote_average,similarity,hybrid
3,The Dark Knight Rises,Action Crime Drama Thriller,ChristopherNolan,2012,7.6,0.534,0.585
119,Batman Begins,Action Crime Drama,ChristopherNolan,2005,7.5,0.535,0.543
1881,The Shawshank Redemption,Drama Crime,FrankDarabont,1994,8.5,0.022,0.228
3337,The Godfather,Drama Crime,FrancisFordCoppola,1972,8.4,0.022,0.167
4323,The Raid,Action Thriller Crime,GarethEvans,2011,7.3,0.191,0.157
519,Now You See Me,Thriller Crime,LouisLeterrier,2013,7.3,0.028,0.142
210,Batman & Robin,Action Crime Fantasy,JoelSchumacher,1997,4.2,0.175,0.141
739,London Has Fallen,Action Crime Thriller,BabakNajafi,2016,5.8,0.154,0.137
3257,American Psycho,Thriller Drama Crime,MaryHarron,2000,7.3,0.127,0.135
3714,Exiled,Action Crime Thriller,JohnnieTo,2006,7.0,0.191,0.135


### 7.3 Genre Filter

Filter recommendations by genre.

In [367]:
recommender.recommend('Avatar', genre='Action')

ðŸŽ¬ Recommendations for: 'Avatar'


Unnamed: 0,title,genres_clean,director,release_year,vote_average,similarity
4398,The Helix... Loaded,Action Comedy ScienceFiction,,2005,4.8,0.163
4721,Echo Dr.,Thriller Action Drama ScienceFiction,PatrickRyanSims,2013,5.0,0.148
279,Terminator 2: Judgment Day,Action Thriller ScienceFiction,JamesCameron,1991,7.7,0.148
56,Star Trek Beyond,Action Adventure ScienceFiction,JustinLin,2016,6.6,0.144
1395,Resident Evil,Horror Action ScienceFiction,PaulW.S.Anderson,2002,6.4,0.14
657,Resident Evil: Retribution,Action Horror ScienceFiction,PaulW.S.Anderson,2012,5.6,0.139
322,The Fifth Element,Adventure Fantasy Action Thriller ScienceFiction,LucBesson,1997,7.3,0.138
47,Star Trek Into Darkness,Action Adventure ScienceFiction,J.J.Abrams,2013,7.4,0.135
466,The Time Machine,ScienceFiction Adventure Action,SimonWells,2002,5.8,0.133
2444,Damnation Alley,Action Adventure ScienceFiction,JackSmight,1977,5.0,0.129


### 7.4 Fuzzy Search

Handles typos in movie names.

In [368]:
recommender.recommend('Avtar', top_n=5)

ðŸŽ¬ Recommendations for: 'Avatar'


Unnamed: 0,title,genres_clean,director,release_year,vote_average,similarity
4398,The Helix... Loaded,Action Comedy ScienceFiction,,2005,4.8,0.163
4721,Echo Dr.,Thriller Action Drama ScienceFiction,PatrickRyanSims,2013,5.0,0.148
279,Terminator 2: Judgment Day,Action Thriller ScienceFiction,JamesCameron,1991,7.7,0.148
56,Star Trek Beyond,Action Adventure ScienceFiction,JustinLin,2016,6.6,0.144
1395,Resident Evil,Horror Action ScienceFiction,PaulW.S.Anderson,2002,6.4,0.14


## 8. Summary & Conclusions

### System Built
- Content-based movie recommendation using TF-IDF + K-Means + Cosine Similarity
- Hybrid recommendations combining content and popularity
- Fuzzy search, genre filtering capabilities

### Dataset
- TMDB 5000 Movies dataset
- Features used: genres, keywords, cast, director

### Future Improvements
- Add collaborative filtering using user ratings
- Implement deep learning embeddings
- Build web interface

### Final Statistics

In [369]:
print('FINAL STATISTICS')
print('-' * 50)
print(f'Total movies: {len(df):,}')
print(f'Number of clusters: {df["cluster"].nunique()}')
print(f'TF-IDF features: {tfidf_matrix.shape[1]:,}')
print(f'Avg movies per cluster: {len(df) // df["cluster"].nunique()}')
print('\nRecommendation system ready!')

FINAL STATISTICS
--------------------------------------------------
Total movies: 4,790
Number of clusters: 25
TF-IDF features: 5,000
Avg movies per cluster: 191

Recommendation system ready!
