# Movie Data Analysis with PySpark

This Jupyter Notebook performs a comprehensive analysis of movie data fetched from the TMDB API using PySpark. 

It leverages functions defined in `functions.py` to fetch, clean, and analyze data, calculate KPIs, perform advanced filtering, compare franchises vs. standalone movies, analyze franchises and directors, and generate visualizations.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession


# Initialize Spark session
spark = SparkSession.builder \
    .appName("MovieDataAnalysis") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

In [None]:
from functions import (
    build_schema,              # Function to define the schema for movie data
    fetch_movie_data,               # Function to fetch movie data from TMDB API
    clean_movie_data,               # Function to clean and transform the dataframe
    kpi_ranking,                    # Function to calculate key performance indicators
    advanced_search,                # Function to perform advanced movie filtering
    franchise_vs_standalone,        # Function to compare franchise and standalone movies
    analyze_franchise,              # Function to analyze franchise performance
    analyze_directors,               # Function to evaluate director performance
    plot_revenue_vs_budget,
    plot_roi_by_genre,
    plot_popularity_vs_rating,
    plot_yearly_box_office,
    plot_franchise_vs_standalone
)

## Fetching Raw Movie Data

- **Define movie IDs:** Creates a list of specific TMDB movie IDs to analyze.

- **Build schema:** Uses `build_schema()` to define the structure of the DataFrame.

- **Fetch data:** Uses `fetch_movie_data()` to collect metadata for each movie from the TMDB API.

- **Preview data:** Displays the schema and first few records of the fetched raw data.

In [None]:
# Define movie IDs
movie_ids = [0, 299534, 19995, 140607, 299536, 597, 135397,
             420818, 24428, 168259, 99861, 284054, 12445,
             181808, 330457, 351286, 109445, 321612, 260513]

# Build schema
schema = build_schema()

# Fetch movie data
raw_data = fetch_movie_data(movie_ids, schema)

# Display schema
print("Raw Data Schema:")
raw_data.printSchema()


# Display first few rows
print("\nSample Raw Data:")
raw_data.show(5, truncate=False)

## Cleaning the Movie Dataset

- **Clean raw data:** Applies `clean_movie_data()` to standardize column names, process JSON-like fields, and convert data types.

- **Handle zeros:** Replaces zero values in budget, revenue, and runtime with NULL.

- **Calculate metrics:** Adds financial metrics like profit and ROI (return on investment).

- **Standardize fields:** Unifies genre and production country formats for consistency.

- **Preview cleaned data:** Displays the schema and first few records of the cleaned dataset.

In [None]:
# Clean the raw data
cleaned_data = clean_movie_data(raw_data)

# Cache the cleaned DataFrame for performance
cleaned_data.cache()

# Display schema
print("Cleaned Data Schema:")
cleaned_data.printSchema()

# Display first few rows
print("\nSample Cleaned Data:")
cleaned_data.show(5, truncate=False)

## Key Performance Indicators (KPIs)

- **Rank movies:** Uses `kpi_ranking()` to rank movies by revenue, displaying the top 5.

- **Filter and rank:** Optionally filters movies by a metric (e.g., vote count) before ranking.

In [None]:
print("\nTop 5 Movies by Revenue:")
kpi_ranking(cleaned_data, 'revenue_millions', n=5, top=True).select(
    'title', 'revenue_millions', 'budget_millions'
).show(truncate=False)

print("\nTop 5 Movies by Budget:")
kpi_ranking(cleaned_data, 'budget_millions', n=5, top=True).select(
    'title', 'budget_millions', 'revenue_millions'
).show(truncate=False)

print("\nTop 5 Movies by Profit:")
kpi_ranking(cleaned_data, 'profit', n=5, top=True).select(
    'title', 'profit', 'revenue_millions', 'budget_millions'
).show(truncate=False)

print("\nBottom 5 Movies by Profit:")
kpi_ranking(cleaned_data, 'profit', n=5, top=False).select(
    'title', 'profit', 'revenue_millions', 'budget_millions'
).show(truncate=False)

print("\nTop 5 Movies by ROI (Budget >= 10M):")
kpi_ranking(cleaned_data, 'roi', n=5, top=True, filter_col='budget_millions', filter_val=10).select(
    'title', 'roi', 'revenue_millions', 'budget_millions'
).show(truncate=False)

print("\nBottom 5 Movies by ROI (Budget >= 10M):")
kpi_ranking(cleaned_data, 'roi', n=5, top=False, filter_col='budget_millions', filter_val=10).select(
    'title', 'roi', 'revenue_millions', 'budget_millions'
).show(truncate=False)

print("\nTop 5 Most Voted Movies:")
kpi_ranking(cleaned_data, 'vote_count', n=5, top=True).select(
    'title', 'vote_count', 'vote_average'
).show(truncate=False)

print("\nTop 5 Highest Rated Movies (Vote Count >= 10):")
kpi_ranking(cleaned_data, 'vote_average', n=5, top=True, filter_col='vote_count', filter_val=10).select(
    'title', 'vote_average', 'vote_count'
).show(truncate=False)

print("\nBottom 5 Lowest Rated Movies (Vote Count >= 10):")
kpi_ranking(cleaned_data, 'vote_average', n=5, top=False, filter_col='vote_count', filter_val=10).select(
    'title', 'vote_average', 'vote_count'
).show(truncate=False)

print("\nTop 5 Most Popular Movies:")
kpi_ranking(cleaned_data, 'popularity', n=5, top=True).select(
    'title', 'popularity', 'vote_average'
).show(truncate=False)

print("\nMovies with Action Genre:")
advanced_search(cleaned_data, genre_keywords='Action', sort_by='revenue_millions').select(
    'title', 'genre_names', 'revenue_millions'
).show(truncate=False)

## Advanced Search

- **Filter movies:** Uses `advanced_search()` to find movies based on genre, cast, or director keywords.

- **Sort results:** Sorts the filtered results by a specified metric (e.g., revenue).

In [None]:
print("\nBest-Rated Science Fiction Action Movies Starring Bruce Willis:")
advanced_search(
    cleaned_data,
    genre_keywords='Science Fiction|Action',
    cast_keywords='Bruce Willis',
    sort_by='vote_average',
    ascending=False
).select(
    'title', 'genre_names', 'vote_average', 'cast_names'
).show(truncate=False)

print("\nMovies Starring Uma Thurman Directed by Quentin Tarantino:")
advanced_search(
    cleaned_data,
    cast_keywords='Uma Thurman',
    director_keywords='Quentin Tarantino',
    sort_by='runtime',
    ascending=True
).select(
    'title', 'cast_names', 'director', 'runtime'
).show(truncate=False)

## Franchise vs. Standalone Comparison

- **Compare groups:** Uses `franchise_vs_standalone()` to compute mean revenue, ROI, budget, popularity, and rating for franchise and standalone movies.
- **Display results:** Shows a comparison table with aggregated metrics.

In [None]:
# Compare franchise vs standalone movies
print("Franchise vs Standalone Comparison:")
franchise_vs_standalone(cleaned_data).show(truncate=False)

## Franchise Analysis

- **Analyze franchises:** Uses `analyze_franchise()` to aggregate movie counts, budgets, revenues, ratings, and ROI by franchise.

- **Sort results:** Sorts franchises by total revenue to identify top performers.

In [None]:
# Top franchises by total revenue
print("Top Franchises by Total Revenue:")
analyze_franchise(cleaned_data, sort_by='total_revenue_millions').show(5, truncate=False)

## Director Analysis

- **Analyze directors:** Uses `analyze_directors()` to aggregate movie counts, revenues, ratings, and ROI for directors of franchise movies.

- **Sort results:** Sorts directors by total revenue to highlight top performers.

In [None]:
# Top directors by total revenue
print("Top Directors by Total Revenue:")
analyze_directors(cleaned_data, sort_by='total_revenue_millions').show(5, truncate=False)

## Visualizations

- **Generate plots:** Converts Spark DataFrames to Pandas for visualization using Matplotlib.

- **Revenue vs. Budget:** Scatter plot of revenue vs. budget.

- **ROI by Genre:** Bar plot of average ROI by genre.

- **Popularity vs. Rating:** Scatter plot of popularity vs. rating.

- **Yearly Box Office:** Line plot of total revenue by release year.

- **Franchise vs. Standalone:** Bar plot comparing mean metrics.

In [None]:
print("\nGenerating visualizations...")
plot_revenue_vs_budget(cleaned_data)
print("Saved revenue_vs_budget.png")

plot_roi_by_genre(cleaned_data)
print("Saved roi_by_genre.png")

plot_popularity_vs_rating(cleaned_data)
print("Saved popularity_vs_rating.png")

plot_yearly_box_office(cleaned_data)
print("Saved yearly_box_office.png")

plot_franchise_vs_standalone(cleaned_data)
print("Saved franchise_vs_standalone.png")

## Cleanup

- **Stop Spark session:** Releases resources by stopping the SparkSession.

In [None]:
# Stop the Spark session
spark.stop()
print("Analysis complete!")