# Movie Recommendation System - Exploratory Data Analysis (EDA)

This notebook analyzes the TMDB 5000 Movie Dataset to understand the distribution of genres, keywords, and other features used in the recommendation engine. This analysis helps in validating the choice of features for content-based filtering.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast
import sys
import os
import kagglehub
from collections import Counter

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Add src to path
sys.path.append(os.path.abspath(os.path.join('..')))
%matplotlib inline

In [None]:
# Load raw data
path = kagglehub.dataset_download("tmdb/tmdb-movie-metadata")
movies = pd.read_csv(f"{path}/tmdb_5000_movies.csv")
credits = pd.read_csv(f"{path}/tmdb_5000_credits.csv")

# Merge for full context
movies = movies.merge(credits, on='title')

movies.head(2)

## 1. Genre Distribution
Understanding the most common genres in our database to see what kind of movies dominate the dataset.

In [None]:
def parse_json_col(x):
    try:
        return [i['name'] for i in ast.literal_eval(x)]
    except:
        return []

# Parse Genres
movies['parsed_genres'] = movies['genres'].apply(parse_json_col)
all_genres = sum(movies['parsed_genres'], [])

genre_counts = pd.Series(all_genres).value_counts().head(20)

plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='viridis')
plt.title('Top 20 Movie Genres')
plt.xlabel('Count')
plt.show()

## 2. Vote Average Distribution
Checking the distribution of ratings. A normal distribution suggests a balanced dataset, while skewness might indicate bias.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(movies['vote_average'], bins=30, kde=True, color='blue')
plt.title('Distribution of Vote Averages')
plt.xlabel('Vote Average')
plt.show()

## 3. Correlation Heatmap
Analyzing the relationship between numerical features like **Popularity**, **Budget**, **Revenue**, and **Vote Average**.

In [None]:
cols = ['budget', 'revenue', 'popularity', 'vote_average', 'vote_count', 'runtime']
corr_matrix = movies[cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Feature Matrix')
plt.show()

## 4. Top 20 Most Frequent Actors
Since our recommendation system uses `Cast` as a feature, it's important to know which actors appear most frequently.

In [None]:
movies['parsed_cast'] = movies['cast'].apply(parse_json_col)
all_cast = sum(movies['parsed_cast'], [])

cast_counts = pd.Series(all_cast).value_counts().head(20)

plt.figure(figsize=(12, 8))
sns.barplot(x=cast_counts.values, y=cast_counts.index, palette='magma')
plt.title('Top 20 Most Frequent Actors')
plt.xlabel('Number of Movies')
plt.show()

## 5. Top Keywords Analysis
Keywords are crucial for content matching. Let's see the most common themes/keywords in the dataset.

In [None]:
movies['parsed_keywords'] = movies['keywords'].apply(parse_json_col)
all_keywords = sum(movies['parsed_keywords'], [])

keyword_counts = pd.Series(all_keywords).value_counts().head(20)

plt.figure(figsize=(12, 8))
sns.barplot(x=keyword_counts.values, y=keyword_counts.index, palette='rocket')
plt.title('Top 20 Common Keywords')
plt.xlabel('Frequency')
plt.show()