# IMDB Data Merging Notebook

This notebook merges all IMDb dataset TSV files into a single, consolidated Parquet file. The key objectives are:

1.  **Focus on US Movies:** Filter the data to include only titles classified as 'movie' and listed with a 'US' region in the `title.akas` dataset. This is a subset of the whole dataset that we are interested in and want to investigate.
2.  **Flatten `akas`:** Use the `title.akas` table (filtered for US) as the base, ensuring columns like `title`, `region`, `language` are directly available at the top level.
3.  **Nest Related Data:** Aggregate information about `cast` (principals + names), `directors`, `writers`, and `episodes` into list structures associated with each movie's `tconst`.
4.  **Efficiency with LazyFrames:** Utilize Polars' LazyFrame API (`scan_csv`, `filter`, `join`, `group_by`, `agg`) to build an optimized query plan *before* loading all the data into memory. This is crucial for handling large datasets.
5.  **Output Parquet:** Save the final merged table in the efficient Parquet format (`imdb_us_movies_merged.parquet`).

## Setup and Configuration

Import necessary libraries, define URLs for the data files, set up common Polars scan settings, and configure Polars display options.


In [None]:
import polars as pl
import sys
import os
from pprint import pprint # For a clean preview
from dotenv import load_dotenv

# --- Configuration to see the full nested lists ---
pl.Config.set_fmt_str_lengths(1000)
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(50)

print(f"Python version: {sys.version}")
print(f"Polars version: {pl.__version__}")

# URLs for IMDb datasets
load_dotenv("config/.env")
URL_NAME_BASICS = os.getenv("URL_NAME_BASICS")
URL_TITLE_AKAS = os.getenv("URL_TITLE_AKAS")
URL_TITLE_BASICS = os.getenv("URL_TITLE_BASICS")
URL_TITLE_CREW = os.getenv("URL_TITLE_CREW")
URL_TITLE_EPISODE = os.getenv("URL_TITLE_EPISODE")
URL_TITLE_PRINC = os.getenv("URL_TITLE_PRINC")
URL_TITLE_RATINGS = os.getenv("URL_TITLE_RATINGS")

# --- Define common scan settings for TSV files ---
scan_settings = {
    'separator': '\t',       # IMDb uses tab separators
    'null_values': '\\N',    # IMDb uses '\N' to represent null
    'low_memory': True,     # Use low memory mode for potentially large files
    'quote_char': None,     # Disable quote parsing as TSV generally doesn't use quotes like CSV
}

# --- Define the final output file name ---
# Changed from v5 to v6 as per your original script's output name
output_file = 'imdb_us_movies_merged.parquet' 

Python version: 3.10.16 (main, Dec 11 2024, 10:22:29) [Clang 14.0.6 ]
Polars version: 1.34.0


## Step 1: Identify Target US Movie IDs

**What:** Create a small, unique list of `tconst` identifiers representing titles that are both `titleType == "movie"` (from `title.basics`) and have a `region == "US"` entry (from `title.akas`).
**Why:** This list acts as our primary filter. By identifying these target IDs *first* and collecting them into memory, we can perform efficient `inner join` operations later, significantly reducing the amount of data processed from other large tables.
**How:**
1.  Scan `title.basics`, filter for movies, select `tconst`.
2.  Scan `title.akas`, filter for US region, select `titleId` (renamed to `tconst`).
3.  Inner join these two LazyFrames on `tconst`.
4.  Select `tconst` and get unique values.
5.  Rename `tconst` to `tconst_filter_key` to avoid naming conflicts during subsequent joins.
6.  **`.collect()`:** Execute the lazy query to bring the small list of unique IDs into memory as a Polars Series.
7.  Convert the Series back into a LazyFrame for joining.

In [2]:
print("Step 1: Finding all 'US Movies'...")
# Find all movie tconsts
lf_movies = pl.scan_csv(URL_TITLE_BASICS, **scan_settings).filter(
    pl.col("titleType") == "movie"
).select("tconst")

# Find all tconsts associated with the US region
lf_akas_us = pl.scan_csv(URL_TITLE_AKAS, **scan_settings).filter(
    pl.col("region") == "US"
).select(["titleId", "region"]).rename({"titleId": "tconst"})

# Inner join to get only tconsts that are movies AND have a US entry
lf_us_movie_tconsts = lf_movies.join(
    lf_akas_us, on="tconst", how="inner"
).select("tconst").unique()

# Rename the key to prevent column name collisions later
lf_us_movie_tconsts_renamed = lf_us_movie_tconsts.select(
    pl.col("tconst").alias("tconst_filter_key")
)

# --- Collect this small list into memory --- 
# This is the crucial optimization step
print("Collecting the list of US movie IDs...")
try:
    us_movie_tconsts_series = lf_us_movie_tconsts_renamed.collect().to_series()
    print(f"Found {len(us_movie_tconsts_series)} US movies to process.")
    # Create a LazyFrame from the collected series for efficient joining
    lf_us_movie_tconsts = pl.LazyFrame({"tconst_filter_key": us_movie_tconsts_series})
except Exception as e:
    print(f"Error collecting US movie IDs: {e}")
    # Stop execution if we can't get the filter keys
    raise e

Step 1: Finding all 'US Movies'...
Collecting the list of US movie IDs...
Found 338374 US movies to process.


## Step 2: Scan Remaining Data Files Lazily

**What:** Create LazyFrame representations for all the other IMDb tables (`basics`, `ratings`, `crew`, `principals`, `akas`, `episode`, `names`).
**Why:** Using `scan_csv` instead of `read_csv` allows Polars to build the entire processing pipeline without loading these potentially huge files into memory immediately. Operations will only be executed when necessary (e.g., during `.collect()` or `.sink_parquet()`).
**How:** Call `pl.scan_csv()` for each URL, applying the common `scan_settings`.

In [3]:
print("Step 2: Scanning all other tables...")
lf_basics   = pl.scan_csv(URL_TITLE_BASICS, **scan_settings)
lf_ratings  = pl.scan_csv(URL_TITLE_RATINGS, **scan_settings)
lf_crew     = pl.scan_csv(URL_TITLE_CREW, **scan_settings)
lf_princip  = pl.scan_csv(URL_TITLE_PRINC, **scan_settings)
lf_akas     = pl.scan_csv(URL_TITLE_AKAS, **scan_settings).rename({'titleId':'tconst'}) # Rename key early
lf_episode  = pl.scan_csv(URL_TITLE_EPISODE, **scan_settings)
lf_names    = pl.scan_csv(URL_NAME_BASICS, **scan_settings)

Step 2: Scanning all other tables...


## Step 3: Build Nested Cast Data

**What:** Create a LazyFrame (`lf_cast_nested`) where each row corresponds to a `tconst` and contains a nested list (`cast`) of structs, with each struct holding details about a principal cast member (category, job, characters, name, birth/death year, profession).
**Why:** Aggregates detailed cast information into a structured list for each movie, keeping the main DataFrame concise.
**How:**
1.  **Filter `lf_princip` Early:** Inner join `lf_princip` with our collected `lf_us_movie_tconsts` *before* joining with `lf_names`. This drastically reduces the size of the data being processed.
2.  Join the filtered principals with `lf_names` (on `nconst`) to add name details.
3.  `group_by('tconst')`.
4.  `agg()`: Use `pl.struct(pl.all().exclude(...))` to gather all columns (except the group key `tconst`, the join key `nconst`, and the filter key) into a struct for each cast member, then aggregate these structs into a list named `cast`.

In [4]:
print("Step 3: Building lazy graph for cast...")
# Filter principals to only include those associated with our target US movies
lf_princip_filtered = lf_princip.join(
    lf_us_movie_tconsts, 
    left_on="tconst",
    right_on="tconst_filter_key",
    how="inner"
)

# Join filtered principals with names, then group and aggregate into nested structs
lf_cast_nested = lf_princip_filtered.join(
    lf_names.select(['nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession']), 
    on='nconst', 
    how='left' # Keep all principals, even if name details are missing
).group_by('tconst').agg(
    # Create a list of structs, excluding keys used for joining/grouping
    pl.struct(pl.all().exclude(['tconst', 'nconst', 'tconst_filter_key'])).alias('cast')
)

Step 3: Building lazy graph for cast...


## Step 4: Build Nested Crew Data (Directors & Writers)

**What:** Create LazyFrames (`lf_directors_nested`, `lf_writers_nested`) containing nested lists of directors and writers, respectively, including their name details.
**Why:** Similar to cast, this aggregates crew information efficiently.
**How:**
1.  **Filter `lf_crew` Early:** Inner join `lf_crew` with `lf_us_movie_tconsts`.
2.  **Define Helper Function `get_nested_crew`:** This function encapsulates the logic for processing comma-separated `nconst` strings:
    * Select `tconst` and the target column (`directors` or `writers`).
    * Split the comma-separated string into a list.
    * `explode` the list to create one row per crew member.
    * Rename the exploded column to `nconst`.
    * Join with `lf_names` to get details.
    * `group_by('tconst')` and `agg()` into a list of structs.
3.  Call the helper function for both `directors` and `writers`.

In [5]:
print("Step 4: Building lazy graph for crew...")
# Filter crew table to only include rows for our target US movies
lf_crew_filtered = lf_crew.join(
    lf_us_movie_tconsts, 
    left_on="tconst",
    right_on="tconst_filter_key",
    how="inner"
)

# Helper function to process comma-separated crew columns (directors, writers)
def get_nested_crew(lf_filtered_crew, lf_names_table, column_name):
    return lf_filtered_crew.select(['tconst', column_name])\
        .drop_nulls(column_name)\
        .with_columns(
            pl.col(column_name).str.split(',') # Split string into list
        )\
        .explode(column_name)\
        .rename({column_name: 'nconst'})\
        .join(
            lf_names_table.select(['nconst', 'primaryName', 'birthYear', 'deathYear']),
            on='nconst',
            how='left'
        )\
        .group_by('tconst').agg(
            pl.struct(pl.all().exclude('tconst')).alias(column_name) # Aggregate into list of structs
        )

# Apply the function to create nested directors and writers
lf_directors_nested = get_nested_crew(lf_crew_filtered, lf_names, 'directors')
lf_writers_nested = get_nested_crew(lf_crew_filtered, lf_names, 'writers')

Step 4: Building lazy graph for crew...


## Step 5: Build Nested Episode Data

**What:** Create `lf_episode_nested` containing nested lists of episode details for titles that are series (even though our primary filter is movies, some movies might be parents of episodes in edge cases, or this prepares for future expansion).
**Why:** Aggregates episode information.
**How:**
1.  **Filter `lf_episode` Early:** Inner join `lf_episode` with `lf_us_movie_tconsts` on `parentTconst`.
2.  `group_by('parentTconst')`.
3.  `agg()` into a list of structs named `episodes`.
4.  Rename `parentTconst` back to `tconst` for the final join.

In [6]:
print("Step 5: Building lazy graph for episodes...") # Renumbered step
# Filter episodes to keep only those whose parent is one of our target US movies
# Note: This might result in an empty or near-empty frame if target titles are truly only movies
lf_episode_filtered = lf_episode.join(
    lf_us_movie_tconsts, 
    left_on="parentTconst", 
    right_on="tconst_filter_key",
    how="inner"
)

# Group by parent and aggregate episode details
lf_episode_nested = lf_episode_filtered.group_by('parentTconst').agg(
    pl.struct(
        pl.all().exclude(['parentTconst', 'tconst_filter_key']) 
    ).alias('episodes')
).rename({'parentTconst': 'tconst'}) # Rename key for the final join

Step 5: Building lazy graph for episodes...


## Step 6: Build Final Query - Combining All LazyFrames

**What:** Construct the final LazyFrame (`lf_final`) by joining the base table (`lf_base_flat`) with all the supplementary and nested LazyFrames.
**Why:** This defines the complete data structure before execution.
**How:**
1.  **Define Base Table:** Filter `lf_akas` for `region == "US"` and inner join with `lf_us_movie_tconsts`. This creates our base table containing only US aka entries for the target movies, with `akas` columns flattened.
2.  **Left Join Others:** Sequentially `left_join` the `lf_basics`, `lf_ratings`, `lf_cast_nested`, `lf_directors_nested`, `lf_writers_nested`, and `lf_episode_nested` LazyFrames onto the `lf_base_flat` using `tconst` as the key. Left joins ensure we keep all US aka entries even if corresponding data (like ratings or episodes) is missing.

In [7]:
print("Step 6: Combining all lazy graphs...")

# --- Start with 'akas' as the base table ---
# Filter akas for US region first
lf_akas_us_filtered = lf_akas.filter(
    pl.col("region") == "US"
)

# Inner join the US akas with our collected list of US movie IDs
# This ensures our base table ONLY contains US AKAs for titles confirmed to be movies
lf_base_flat = lf_akas_us_filtered.join(
    lf_us_movie_tconsts, 
    left_on="tconst",
    right_on="tconst_filter_key",
    how="inner"
)

# --- Join all other tables onto the flat 'akas' base --- 
# Use left joins to keep all rows from lf_base_flat
lf_final = lf_base_flat.join(
    lf_basics, on='tconst', how='left' # Add 'basics' data (primaryTitle, etc.)
).join(
    lf_ratings, on='tconst', how='left'
).join(
    lf_cast_nested, on='tconst', how='left'
).join(
    lf_directors_nested, on='tconst', how='left'
).join(
    lf_writers_nested, on='tconst', how='left'
).join(
    lf_episode_nested, on='tconst', how='left'
)

# Display the query plan (optional, but useful for understanding optimization)
print("\n--- Optimized Query Plan: ---")
print(lf_final.explain())

Step 6: Combining all lazy graphs...

--- Optimized Query Plan: ---
LEFT JOIN:
LEFT PLAN ON: [col("tconst")]
  LEFT JOIN:
  LEFT PLAN ON: [col("tconst")]
    LEFT JOIN:
    LEFT PLAN ON: [col("tconst")]
      LEFT JOIN:
      LEFT PLAN ON: [col("tconst")]
        LEFT JOIN:
        LEFT PLAN ON: [col("tconst")]
          LEFT JOIN:
          LEFT PLAN ON: [col("tconst")]
            INNER JOIN:
            LEFT PLAN ON: [col("tconst")]
              SELECT [col("titleId").alias("tconst"), col("ordering"), col("title"), col("region"), col("language"), col("types"), col("attributes"), col("isOriginalTitle")]
                Csv SCAN [https://workspace4824871889.blob.core.windows.net/azureml-blobstore-84f516da-0fe5-4f33-8f3c-f18ec8e2b4f7/UI/2025-10-22_104546_UTC/title.akas.tsv.gz]
                PROJECT 8/8 COLUMNS
                SELECTION: [(col("region")) == ("US")]
            RIGHT PLAN ON: [col("tconst_filter_key")]
              DF ["tconst_filter_key"]; PROJECT */1 COLUMNS
      

## Step 7: Execute Query and Save to Parquet

**What:** Trigger the execution of the entire lazy query plan and save the resulting DataFrame to a Parquet file.
**Why:** Parquet is a columnar storage format that is highly efficient for analytics, offering good compression and fast read/write speeds.
**How:** Call `.sink_parquet()` on the final LazyFrame (`lf_final`). This method streams the result directly to the file, which is more memory-efficient than `.collect()` followed by `.write_parquet()` for large results. `compression='zstd'` provides a good balance of speed and compression ratio.

In [8]:
print(f"\nStep 7: Executing query and saving to {output_file}...")
try:
    lf_final.sink_parquet(
        output_file,
        compression='zstd' # Use Zstandard compression
    )
    print("Done with Polars!")
    print(f"File '{output_file}' is saved.")
except Exception as e:
    print(f"Error saving Parquet file: {e}")
    raise e


Step 7: Executing query and saving to imdb_us_movies_merged.parquet...
Done with Polars!
File 'imdb_us_movies_merged.parquet' is saved.


## Step 8: Preview Output File

**What:** Load the first few rows of the newly created Parquet file and display them along with the schema.
**Why:** To verify that the merge process completed successfully and the data structure is as expected (e.g., `akas` columns are flat, other data is nested).
**How:** Use `pl.scan_parquet()` to lazily scan the output file, `.limit(3)` to take only the first three rows, and `.collect()` to bring them into memory for display using `pprint` for better readability. Print the `.schema`.

In [9]:
print(f"\n--- Previewing first 3 rows of new file: {output_file} ---")
try:
    # Scan the output file, limit to 3 rows, collect
    df_preview = pl.scan_parquet(output_file).limit(3).collect()
    
    # Convert to dictionaries for pretty printing
    preview_dicts = df_preview.to_dicts()
    pprint(preview_dicts)
    
    print("\n--- New Schema (akas columns should be top-level) ---")
    # Display the schema of the resulting dataframe
    print(df_preview.schema)

except Exception as e:
    print(f"An error occurred during preview: {e}")


--- Previewing first 3 rows of new file: imdb_us_movies_merged.parquet ---
[{'attributes': None,
  'averageRating': 5.7,
  'cast': [{'birthYear': None,
            'category': 'cinematographer',
            'characters': None,
            'deathYear': None,
            'job': None,
            'ordering': 3,
            'primaryName': 'Ingo Kratisch',
            'primaryProfession': 'cinematographer,director,writer'},
           {'birthYear': None,
            'category': 'director',
            'characters': None,
            'deathYear': None,
            'job': None,
            'ordering': 1,
            'primaryName': 'Daniel Eisenberg',
            'primaryProfession': 'editor,director,producer'},
           {'birthYear': None,
            'category': 'producer',
            'characters': None,
            'deathYear': None,
            'job': 'producer',
            'ordering': 2,
            'primaryName': 'Daniel Eisenberg',
            'primaryProfession': 'editor,director,