# Movie Data Analysis with PySpark

This Jupyter Notebook performs a comprehensive analysis of movie data fetched from the TMDB API using PySpark. It leverages functions defined in `functions.py` to fetch, clean, and analyze data, calculate KPIs, perform advanced filtering, compare franchises vs. standalone movies, analyze franchises and directors, and generate visualizations.

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("MovieDataAnalysis") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

25/04/23 21:48:57 WARN Utils: Your hostname, Hakeems-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 172.20.10.2 instead (on interface en0)
25/04/23 21:48:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/23 21:48:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
from functions import (
    build_schema,              # Function to define the schema for movie data
    fetch_movie_data,               # Function to fetch movie data from TMDB API
    clean_movie_data,               # Function to clean and transform the dataframe
    kpi_ranking,                    # Function to calculate key performance indicators
    advanced_search,                # Function to perform advanced movie filtering
    franchise_vs_standalone,        # Function to compare franchise and standalone movies
    analyze_franchise,              # Function to analyze franchise performance
    analyze_directors,               # Function to evaluate director performance
    plot_revenue_vs_budget,
    plot_roi_by_genre,
    plot_popularity_vs_rating,
    plot_yearly_box_office,
    plot_franchise_vs_standalone
)