# Movie Data Analysis with PySpark

This Jupyter Notebook performs a comprehensive analysis of movie data fetched from the TMDB API using PySpark. It leverages functions defined in `functions.py` to fetch, clean, and analyze data, calculate KPIs, perform advanced filtering, compare franchises vs. standalone movies, analyze franchises and directors, and generate visualizations.

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("MovieDataAnalysis") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

25/04/23 21:48:57 WARN Utils: Your hostname, Hakeems-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 172.20.10.2 instead (on interface en0)
25/04/23 21:48:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/23 21:48:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
from functions import (
    build_schema,              # Function to define the schema for movie data
    fetch_movie_data,               # Function to fetch movie data from TMDB API
    clean_movie_data,               # Function to clean and transform the dataframe
    kpi_ranking,                    # Function to calculate key performance indicators
    advanced_search,                # Function to perform advanced movie filtering
    franchise_vs_standalone,        # Function to compare franchise and standalone movies
    analyze_franchise,              # Function to analyze franchise performance
    analyze_directors,               # Function to evaluate director performance
    plot_revenue_vs_budget,
    plot_roi_by_genre,
    plot_popularity_vs_rating,
    plot_yearly_box_office,
    plot_franchise_vs_standalone
)

## Fetching Raw Movie Data

- **Define movie IDs:** Creates a list of specific TMDB movie IDs to analyze.

- **Build schema:** Uses `build_schema()` to define the structure of the DataFrame.

- **Fetch data:** Uses `fetch_movie_data()` to collect metadata for each movie from the TMDB API.

- **Preview data:** Displays the schema and first few records of the fetched raw data.

In [4]:
# Define movie IDs
movie_ids = [0, 299534, 19995, 140607, 299536, 597, 135397,
             420818, 24428, 168259, 99861, 284054, 12445,
             181808, 330457, 351286, 109445, 321612, 260513]

# Build schema
schema = build_schema()

# Fetch movie data
raw_data = fetch_movie_data(movie_ids, schema)

# Display schema
print("Raw Data Schema:")
raw_data.printSchema()

# Display first few rows
print("\nSample Raw Data:")
raw_data.show(5, truncate=False)



Raw Data Schema:
root
 |-- id: integer (nullable = false)
 |-- title: string (nullable = true)
 |-- tagline: string (nullable = true)
 |-- release_date: string (nullable = true)
 |-- original_language: string (nullable = true)
 |-- budget: long (nullable = true)
 |-- revenue: long (nullable = true)
 |-- vote_count: integer (nullable = true)
 |-- vote_average: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- runtime: integer (nullable = true)
 |-- overview: string (nullable = true)
 |-- poster_path: string (nullable = true)
 |-- belongs_to_collection: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- genres: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- production_companies: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: integer (nullable = true)
 |   

                                                                                

+------+----------------------------+-----------------------------------------+------------+-----------------+---------+----------+----------+------------+----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------------------

## Cleaning the Movie Dataset

- **Clean raw data:** Applies `clean_movie_data()` to standardize column names, process JSON-like fields, and convert data types.

- **Handle zeros:** Replaces zero values in budget, revenue, and runtime with NULL.

- **Calculate metrics:** Adds financial metrics like profit and ROI (return on investment).

- **Standardize fields:** Unifies genre and production country formats for consistency.

- **Preview cleaned data:** Displays the schema and first few records of the cleaned dataset.

In [6]:
# Clean the raw data
cleaned_data = clean_movie_data(raw_data)

# Cache the cleaned DataFrame for performance
cleaned_data.cache()

# Display schema
print("Cleaned Data Schema:")
cleaned_data.printSchema()

# Display first few rows
print("\nSample Cleaned Data:")
cleaned_data.show(5, truncate=False)

Cleaned Data Schema:
root
 |-- id: integer (nullable = false)
 |-- title: string (nullable = true)
 |-- tagline: string (nullable = true)
 |-- release_date: date (nullable = true)
 |-- genre_names: string (nullable = true)
 |-- collection_name: string (nullable = true)
 |-- original_language: string (nullable = true)
 |-- budget_millions: double (nullable = true)
 |-- revenue_millions: double (nullable = true)
 |-- production_companies_str: string (nullable = true)
 |-- production_countries_str: string (nullable = true)
 |-- vote_count: integer (nullable = true)
 |-- vote_average: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- runtime: integer (nullable = true)
 |-- overview: string (nullable = true)
 |-- spoken_languages_str: string (nullable = true)
 |-- poster_path: string (nullable = true)
 |-- cast_names: string (nullable = true)
 |-- cast_size: integer (nullable = false)
 |-- director: string (nullable = true)
 |-- crew_size: integer (nullable = false)
 |

25/04/23 21:58:40 WARN CacheManager: Asked to cache already cached data.
