# IMDB Movie Data Cleaning Notebook

This notebook performs a comprehensive cleaning process on the `imdb_us_movies_merged.parquet` dataset. The goal is to prepare the data for potential machine learning tasks by addressing missing values, inconsistencies, and potential errors, while keeping the data largely human-readable.

**Approach:**
* **Polars for Efficiency:** We use the Polars library for fast, memory-efficient data manipulation, especially for handling large datasets and complex nested structures.
* **Step-by-Step Cleaning:** We apply cleaning steps sequentially, starting with broad column removals and moving towards finer-grained data sanitization.
* **Data-Driven Decisions:** Many decisions (like dropping columns or mapping values) are based on initial data exploration (EDA) and insights gained during the process (e.g., identifying single-value columns, understanding the structure of `primaryProfession`).
* **Sklearn Pipeline:** A final `sklearn` pipeline handles placeholder imputation for remaining top-level missing values, ensuring reproducibility for model training.

In [1]:
import polars as pl
import numpy as np
import datetime
import os
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import set_config
from functools import reduce
import operator
from dotenv import load_dotenv

# Set sklearn to output Polars DataFrames
try:
    set_config(transform_output="polars")
except Exception as e:
    print(f"Could not set sklearn config: {e}. Proceeding without it.")

# --- Configuration ---
pl.Config.set_fmt_str_lengths(1000)
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(50)


# Define file path
load_dotenv("config/.env")
DATA_FILE = os.getenv("URL_MERGED_DATA")

## Load Data


In [2]:
try:
    df = pl.read_parquet(DATA_FILE)
except Exception as e:
    print(f"Error loading file '{DATA_FILE}': {e}")
    print("Exiting. Please check the file path.")
    raise e 

print("--- Original DataFrame Head ---")
print(df.head())
print("\n--- Original Schema ---")
print(df.schema)

--- Original DataFrame Head ---
shape: (5, 22)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ tco ┆ ord ┆ tit ┆ reg ┆ lan ┆ typ ┆ att ┆ isO ┆ tit ┆ pri ┆ ori ┆ isA ┆ sta ┆ end ┆ run ┆ gen ┆ ave ┆ num ┆ cas ┆ dir ┆ wri ┆ epi │
│ nst ┆ eri ┆ le  ┆ ion ┆ gua ┆ es  ┆ rib ┆ rig ┆ leT ┆ mar ┆ gin ┆ dul ┆ rtY ┆ Yea ┆ tim ┆ res ┆ rag ┆ Vot ┆ t   ┆ ect ┆ ter ┆ sod │
│ --- ┆ ng  ┆ --- ┆ --- ┆ ge  ┆ --- ┆ ute ┆ ina ┆ ype ┆ yTi ┆ alT ┆ t   ┆ ear ┆ r   ┆ eMi ┆ --- ┆ eRa ┆ es  ┆ --- ┆ ors ┆ s   ┆ es  │
│ str ┆ --- ┆ str ┆ str ┆ --- ┆ str ┆ s   ┆ lTi ┆ --- ┆ tle ┆ itl ┆ --- ┆ --- ┆ --- ┆ nut ┆ str ┆ tin ┆ --- ┆ lis ┆ --- ┆ --- ┆ --- │
│     ┆ i64 ┆     ┆     ┆ str ┆     ┆ --- ┆ tle ┆ str ┆ --- ┆ e   ┆ i64 ┆ i64 ┆ str ┆ es  ┆     ┆ g   ┆ i64 ┆ t[s ┆ lis ┆ lis ┆ lis │
│     ┆     ┆     ┆     ┆     ┆     ┆ str ┆ --- ┆     ┆ str ┆ --- ┆     ┆     ┆     ┆ --- ┆     ┆ --- ┆     ┆ tru ┆ t[s ┆ t[s ┆ t[s │
│     ┆     ┆  

## Step 1: Drop Unnecessary Top-Level Columns

**What:** Remove identifier columns (`tconst`, `ordering`) and redundant title columns (`primaryTitle`, `originalTitle`).
**Why:** 
* Identifiers like `tconst` and `ordering` provide no generalizable information for modeling.
* We decided based on prior inspection that `primaryTitle`, and `originalTitle` will be removed as we only needed one title (which was used in the US, which is the region we are interested in).
**How:** Using `df.drop()`.

In [3]:
print("\n--- Step 1: Dropping top-level ID and title columns ---")
cols_to_drop_step1 = ['tconst', 'ordering', 'primaryTitle', 'originalTitle'] 
existing_cols_to_drop = [col for col in cols_to_drop_step1 if col in df.columns]

if existing_cols_to_drop:
    print(f"Dropping existing top-level columns: {existing_cols_to_drop}")
    df = df.drop(existing_cols_to_drop)
else:
    print("No specified columns ('tconst', 'ordering', 'primaryTitle', 'originalTitle') found to drop.")


--- Step 1: Dropping top-level ID and title columns ---
Dropping existing top-level columns: ['tconst', 'ordering', 'primaryTitle', 'originalTitle']


## Step 2: Check for and Drop Duplicate Rows

**What:** Identify and remove rows that are exact duplicates after dropping initial identifiers.
**Why:** Duplicate rows can skew analysis and model training. It's standard practice to remove them.
**How:** Using `df.is_duplicated()` to find and `df.unique()` to remove.

In [4]:
print("\n--- Step 2: Checking for duplicate rows ---")
duplicates_df = df.filter(df.is_duplicated())

if duplicates_df.height > 0:
    print(f"Found {duplicates_df.height} duplicate rows. Showing first 5:")
    # Display relevant columns to understand the duplicates
    cols_to_show_duplicates = [col for col in ['title', 'startYear', 'runtimeMinutes', 'genres'] if col in duplicates_df.columns]
    print(duplicates_df.head(5).select(cols_to_show_duplicates))
    print("Dropping duplicate rows...")
    df = df.unique(keep='first')
else:
    print("No duplicate rows found.")


--- Step 2: Checking for duplicate rows ---
Found 233 duplicate rows. Showing first 5:
shape: (5, 4)
┌─────────────────────────────────┬───────────┬────────────────┬─────────────┐
│ title                           ┆ startYear ┆ runtimeMinutes ┆ genres      │
│ ---                             ┆ ---       ┆ ---            ┆ ---         │
│ str                             ┆ i64       ┆ i64            ┆ str         │
╞═════════════════════════════════╪═══════════╪════════════════╪═════════════╡
│ Anatomy of an Athlete           ┆ 2018      ┆ null           ┆ Documentary │
│ Weaponized: Hip-Hop Under Siege ┆ null      ┆ null           ┆ Documentary │
│ Murphy                          ┆ null      ┆ null           ┆ Thriller    │
│ The Wind Collector              ┆ null      ┆ null           ┆ Drama       │
│ The Projectionist               ┆ null      ┆ null           ┆ Drama       │
└─────────────────────────────────┴───────────┴────────────────┴─────────────┘
Dropping duplicate rows...


## Step 3: Drop Top-Level Single-Value Columns

**What:** Remove columns where all rows contain the same single value (or are all null).
**Why:** Such columns have zero variance and provide no predictive information for a model.
**How:** Calculate `n_unique()` for each column and drop those where the count is <= 1.

In [5]:
print("\n--- Step 3: Dropping top-level single-value columns ---")
cardinality = df.select(pl.all().n_unique())
cols_to_drop_step3 = [
    col.name for col in cardinality.select(pl.all())
    if col[0] is not None and col[0] <= 1
]
if cols_to_drop_step3:
    print(f"Dropping single-value columns: {cols_to_drop_step3}")
    df = df.drop(cols_to_drop_step3)
else:
    print("No single-value columns to drop.")


--- Step 3: Dropping top-level single-value columns ---
Dropping single-value columns: ['region', 'isOriginalTitle', 'titleType', 'endYear', 'episodes']


## Step 4: Drop Top-Level High-Null Columns

**What:** Remove columns where more than 90% of the values are null.
**Why:** These columns contain very little information and attempts to impute them often introduce more noise than signal.
**How:** Calculate `null_count()` as a percentage of total rows and drop columns exceeding the 90% threshold.

In [6]:
print("\n--- Step 4: Dropping >90% null top-level columns ---")
null_percentages = df.null_count() / len(df)
cols_to_drop_step4 = [
    col.name for col in null_percentages.select(pl.all())
    if col[0] is not None and col[0] > 0.90
]
if cols_to_drop_step4:
    print(f"Dropping high-null columns: {cols_to_drop_step3}")
    df = df.drop(cols_to_drop_step4)
else:
    print("No columns found with >90% null values.")


--- Step 4: Dropping >90% null top-level columns ---
Dropping high-null columns: ['region', 'isOriginalTitle', 'titleType', 'endYear', 'episodes']


## Step 5: Inspect Nested Null Percentages

**What:** Calculate and display the percentage of null values for specific fields *within* the list structures (`cast`, `directors`, `writers`, `episodes`).
**Why:** To understand the data quality inside the nested lists *before* applying imputation. This helps verify if our imputation strategy (using "missing"/-1) is reasonable given the amount of missing data.
**How:** Define a helper function `analyze_nested_nulls` that uses `explode`, `unnest`, calculates the total number of nested items, and then computes the null percentage for the specified fields.

In [7]:
print("\n--- Step 5: Inspecting nulls inside list structures ---")

def analyze_nested_nulls(df, list_col, fields_to_check):
    """Helper function to explode, unnest, and calculate null percentages."""
    if list_col not in df.columns:
        print(f"--- '{list_col}' not found. Skipping analysis. ---")
        return
    print(f"--- '{list_col}' nested null percentages ---")
    try:
        # Use LazyFrame for potentially large intermediate result during explode/unnest
        df_unnested_lazy = df.lazy().explode(list_col).unnest(list_col)
        
        # Calculate total count efficiently
        total_count = df_unnested_lazy.select(pl.count()).collect().item()
        
        if total_count == 0:
            print("No data found after exploding.")
            return
            
        print(f"Total entries: {total_count}")
        
        # Check available columns after unnesting
        unnested_cols = df_unnested_lazy.columns
        fields_exist = [f for f in fields_to_check if f in unnested_cols]
        
        if not fields_exist:
            print(f"None of the specified fields {fields_to_check} exist in '{list_col}'. Available: {unnested_cols}")
            return
            
        # Calculate nulls and percentages for existing fields
        null_stats_lazy = df_unnested_lazy.select([
            (pl.col(field).null_count() / total_count * 100).alias(f"{field}_null_pct")
            for field in fields_exist
        ])
        
        # Collect the final stats
        null_stats = null_stats_lazy.collect()
        print(null_stats)
        
    except Exception as e:
        print(f"Error analyzing '{list_col}': {e}")

# Run the analysis for each list column
analyze_nested_nulls(df, 'cast', ['job', 'birthYear', 'primaryProfession', 'primaryName'])
analyze_nested_nulls(df, 'directors', ['primaryName', 'birthYear', 'deathYear'])
analyze_nested_nulls(df, 'writers', ['primaryName', 'birthYear'])
analyze_nested_nulls(df, 'episodes', ['seasonNumber', 'episodeNumber'])


--- Step 5: Inspecting nulls inside list structures ---
--- 'cast' nested null percentages ---


(Deprecated in version 0.20.5)
  total_count = df_unnested_lazy.select(pl.count()).collect().item()


Total entries: 5310319
shape: (1, 4)
┌──────────────┬────────────────────┬────────────────────────────┬──────────────────────┐
│ job_null_pct ┆ birthYear_null_pct ┆ primaryProfession_null_pct ┆ primaryName_null_pct │
│ ---          ┆ ---                ┆ ---                        ┆ ---                  │
│ f64          ┆ f64                ┆ f64                        ┆ f64                  │
╞══════════════╪════════════════════╪════════════════════════════╪══════════════════════╡
│ 77.678629    ┆ 54.289093          ┆ 2.153694                   ┆ 0.119145             │
└──────────────┴────────────────────┴────────────────────────────┴──────────────────────┘
--- 'directors' nested null percentages ---
Total entries: 437660


  unnested_cols = df_unnested_lazy.columns


shape: (1, 3)
┌──────────────────────┬────────────────────┬────────────────────┐
│ primaryName_null_pct ┆ birthYear_null_pct ┆ deathYear_null_pct │
│ ---                  ┆ ---                ┆ ---                │
│ f64                  ┆ f64                ┆ f64                │
╞══════════════════════╪════════════════════╪════════════════════╡
│ 9.573413             ┆ 57.729288          ┆ 79.481333          │
└──────────────────────┴────────────────────┴────────────────────┘
--- 'writers' nested null percentages ---
Total entries: 675913
shape: (1, 2)
┌──────────────────────┬────────────────────┐
│ primaryName_null_pct ┆ birthYear_null_pct │
│ ---                  ┆ ---                │
│ f64                  ┆ f64                │
╞══════════════════════╪════════════════════╡
│ 8.495176             ┆ 56.815744          │
└──────────────────────┴────────────────────┘
--- 'episodes' not found. Skipping analysis. ---


## Step 6: Standardize Null Lists to Empty Lists

**What:** Replace any `null` values in list-type columns with empty lists (`[]`).
**Why:** Ensures consistency. An empty list (`[]`) and a `null` list conceptually mean "no items", but `null` can cause errors in list processing functions. Using `[]` is safer and maintains the list data type.
**How:** Identify list columns and use `fill_null([])`.

In [8]:
print("\n--- Step 6: Standardizing null lists to empty lists [] ---")
list_cols = [col for col in df.columns if df[col].dtype == pl.List]
expressions = []
for col_name in list_cols:
    if col_name in df.columns:
        expressions.append(pl.col(col_name).fill_null([]))
if expressions:
    df = df.with_columns(expressions)
    print(f"Standardized nulls for: {list_cols}")


--- Step 6: Standardizing null lists to empty lists [] ---
Standardized nulls for: ['cast', 'directors', 'writers']


## Step 7: Sanitize & Standardize Top-Level Columns

**What:** Apply various cleaning operations to top-level columns:
* **Text:** Trim whitespace, convert to lowercase (or uppercase for regions).
* **Genres:** Split, trim, lowercase, sort alphabetically, and rejoin to handle order differences (e.g., `crime,drama` vs `drama,crime`).
* **Years:** Convert `endYear` to integer, fill nulls with -1, and sanitize `startYear` and `endYear` by setting impossible dates (e.g., future years, `endYear < startYear`) to -1.
* **Runtime:** Fill nulls with -1 and sanitize by setting implausible runtimes (e.g., <= 1 min) to -1.
**Why:** To ensure consistency in text formatting, handle data type issues, and correct obvious data entry errors or illogical values.
**How:** Using Polars string functions (`.str`), type conversions (`.str.to_integer`), `.map_elements()` for custom logic, and conditional logic (`pl.when`).

In [9]:
print("\n--- Step 7: Sanitizing top-level text, numerics, and data types ---")
current_year = datetime.datetime.now().year + 3 # Add 1 year buffer
standardize_exprs = []

if 'genres' in df.columns:
    standardize_exprs.append(
        pl.col('genres').fill_null("missing").str.strip_chars().str.to_lowercase()
          .map_elements(lambda s: ",".join(sorted([part.strip() for part in s.split(',')])), return_dtype=pl.String)
          .alias('genres')
    )
if 'titleType' in df.columns:
    standardize_exprs.append(
        pl.col('titleType').fill_null("missing").str.strip_chars().str.to_lowercase().alias('titleType')
    )
if 'language' in df.columns:
     standardize_exprs.append(
        pl.col('language').fill_null("missing").str.strip_chars().str.to_lowercase().alias('language')
    )
if 'types' in df.columns:
     standardize_exprs.append(
        pl.col('types').fill_null("missing").str.strip_chars().str.to_lowercase().alias('types')
    )
if 'region' in df.columns:
    standardize_exprs.append(
        pl.col('region').fill_null("missing").str.strip_chars().str.to_uppercase().alias('region')
    )
if 'endYear' in df.columns:
    standardize_exprs.append(
        pl.col('endYear').str.to_integer(strict=False).fill_null(-1).alias('endYear')
    )
if 'startYear' in df.columns:
    standardize_exprs.append(
        pl.col('startYear').fill_null(-1)
          .map_elements(lambda y: -1 if (y is not None and (y > current_year or (y < 1870 and y != -1))) else y, return_dtype=pl.Int64)
          .alias('startYear')
    )
if 'runtimeMinutes' in df.columns:
    standardize_exprs.append(
        pl.col('runtimeMinutes').fill_null(-1)
          .map_elements(lambda r: -1 if (r is not None and (r <= 1 or r > 30000)) else r, return_dtype=pl.Int64)
          .alias('runtimeMinutes')
    )

if standardize_exprs:
    df = df.with_columns(standardize_exprs)
    print("Top-level text, numeric, and type sanitization complete.")

# Fix illogical date ranges (endYear < startYear)
if 'endYear' in df.columns and 'startYear' in df.columns:
    df = df.with_columns(
        pl.when(
            (pl.col('endYear') < pl.col('startYear')) &
            (pl.col('endYear') != -1) &
            (pl.col('startYear') != -1)
        )
        .then(-1)
        .otherwise(pl.col('endYear'))
        .alias('endYear')
    )
    print("Fixed illogical date ranges.")


--- Step 7: Sanitizing top-level text, numerics, and data types ---
Top-level text, numeric, and type sanitization complete.


## Step 8: Harmonize Categories, Fix Logic & Show Outliers

**What:**
* Identify and display rows that are potential outliers based on defined thresholds for runtime, vote count, and start year.

**Why:**
* Allows for further manual inspection of potentially erroneous or unusual data points without altering them at this stage.

**How:** Using filtering based on combined conditions.

In [10]:
print("\n--- Step 8: Showing outliers ---")

# Show Potential Outliers
print("\n--- Showing potential outlier rows ---")
MAX_RUNTIME_MINS = 600
MAX_VOTES = 5_000_000
MAX_YEAR = datetime.datetime.now().year + 5

outlier_conditions = []
if 'runtimeMinutes' in df.columns:
    outlier_conditions.append(
        (pl.col('runtimeMinutes') > MAX_RUNTIME_MINS) & (pl.col('runtimeMinutes') != -1)
    )
if 'numVotes' in df.columns:
     outlier_conditions.append(
        (pl.col('numVotes') > MAX_VOTES) & (pl.col('numVotes') != -1)
     )
if 'startYear' in df.columns:
     outlier_conditions.append(
        (pl.col('startYear') > MAX_YEAR) & (pl.col('startYear') != -1)
     )

if outlier_conditions:
    combined_condition = reduce(operator.or_, outlier_conditions)
    outlier_rows = df.filter(combined_condition)

    if outlier_rows.height > 0:
        print(f"Found {outlier_rows.height} rows flagged as potential outliers. Showing first 10:")
        cols_to_show = [col for col in ['startYear', 'runtimeMinutes', 'numVotes'] if col in df.columns]
        if 'titleType' in df.columns: cols_to_show = ['titleType'] + cols_to_show # Add context
        print(outlier_rows.select(cols_to_show).head(10))
    else:
        print("No rows flagged as outliers.")
else:
    print("No outlier conditions could be checked (columns missing).")


--- Step 8: Showing outliers ---

--- Showing potential outlier rows ---
Found 52 rows flagged as potential outliers. Showing first 10:
shape: (10, 3)
┌───────────┬────────────────┬──────────┐
│ startYear ┆ runtimeMinutes ┆ numVotes │
│ ---       ┆ ---            ┆ ---      │
│ i64       ┆ i64            ┆ i64      │
╞═══════════╪════════════════╪══════════╡
│ 2020      ┆ 960            ┆ null     │
│ 2023      ┆ 608            ┆ null     │
│ 2015      ┆ 1234           ┆ 166      │
│ 2015      ┆ 1151           ┆ null     │
│ 1987      ┆ 5220           ┆ 440      │
│ 2008      ┆ 840            ┆ 64       │
│ 2017      ┆ 623            ┆ null     │
│ 2019      ┆ 1260           ┆ 67       │
│ 1967      ┆ 1500           ┆ 111      │
│ 1971      ┆ 776            ┆ 1662     │
└───────────┴────────────────┴──────────┘


## Step 9: Comprehensive Nested List Cleaning

**What:** Apply detailed cleaning *inside* the list structures (`cast`, `directors`, `writers`, `episodes`):
* **Drop Nested IDs:** Implicitly drop `ordering`, `nconst`, `tconst` by rebuilding structs without them.
* **Impute Nulls:** Fill nulls with "missing" (string) or -1 (numeric).
* **Standardize Text:** Trim whitespace, lowercase relevant fields.
* **Harmonize Categories:** Map `cast.category` ('actress'->'actor'). Map `cast.job` using the `map_job` function. Map `cast.primaryProfession` using the `clean_professions` function (handles comma-separated values).
**Why:** This is the core cleaning step for the complex nested data, ensuring consistency, fixing errors, extracting valuable features, and preparing the data for potential flattening or feature engineering later.
**How:** Primarily using `list.eval()` combined with `pl.struct()` to rebuild the nested structs. Helper Python functions (`map_job`, `clean_professions`) are applied using `.map_elements()`. String manipulation (`.str`), regex (`.str.contains`, `.str.replace_all`), and conditional logic are used extensively.

In [11]:
print("\n--- Step 9: Advanced cleaning of all nested lists ---")
# --- Profession Map ---
profession_map = {
    "actor": "actor", "actress": "actor", "director": "director", "writer": "writer",
    "producer": "producer", "composer": "composer", "cinematographer": "cinematographer",
    "editor": "editor", "casting_director": "casting_director", "casting_department": "casting_director",
    "production_designer": "production_designer", "art_director": "art_department",
    "set_decorator": "art_department", "art_department": "art_department",
    "costume_designer": "costume_designer", "costume_department": "costume_designer",
    "make_up_department": "make_up_department", "sound_department": "sound_crew",
    "music_department": "sound_crew", "camera_department": "camera_crew",
    "editorial_department": "editorial_crew", "animation_department": "vfx_animation_crew",
    "visual_effects": "vfx_animation_crew", "special_effects": "vfx_animation_crew",
    "assistant_director": "production_crew", "production_manager": "production_crew",
    "production_department": "production_crew", "location_management": "production_crew",
    "transportation_department": "production_crew", "script_department": "production_crew",
    "stunts": "stunts", "soundtrack": "soundtrack", "archive_footage": "archive_footage",
    "miscellaneous": "other", "talent_agent": "other_business", "manager": "other_business",
    "publicist": "other_business", "legal": "other_business", "executive": "other_business",
}
def clean_professions(prof_string, mapping):
    if prof_string is None or prof_string == "missing": return "missing"
    professions = prof_string.split(',')
    mapped_professions = set(mapping.get(prof.strip(), "other") for prof in professions)
    return ",".join(sorted(list(mapped_professions)))

# --- Job Map ---
def map_job(job_str):
    if job_str is None or job_str == "missing" or job_str.strip() == "": return "missing"
    job_str = job_str.lower()
    if "screenplay" in job_str or "screen play" in job_str: return "screenplay"
    if "story" in job_str: return "story"
    if "writer" in job_str or "written by" in job_str or "scenario" in job_str: return "writer"
    if "adaptation" in job_str or "dialogue" in job_str or "script" in job_str: return "writer"
    if "director of photography" in job_str or "cinematographer" in job_str: return "cinematographer/dp"
    if "director" in job_str: return "director"
    if "executive producer" in job_str: return "executive_producer"
    if "line producer" in job_str: return "line_producer"
    if "producer" in job_str: return "producer"
    if "composer" in job_str: return "composer"
    if "editor" in job_str or "film editor" in job_str: return "editor"
    if "casting_director" in job_str: return "casting_director"
    if "production_designer" in job_str: return "production_designer"
    if "novel" in job_str or "book" in job_str or "manga" in job_str: return "source_material (novel/book)"
    if "play" in job_str: return "source_material (play)"
    if "characters" in job_str or "created by" in job_str or "creator" in job_str: return "creator"
    if "based on" in job_str: return "source_material (based on)"
    if "titles" in job_str: return "titles"
    if "idea" in job_str: return "idea"
    return "other"

cleaning_expressions = []

if 'cast' in df.columns:
    print("\nCleaning 'cast' (with metadata extraction)...")
    cleaning_expressions.append(
        pl.col('cast').list.eval(
            pl.struct([
                pl.element().struct.field('category').fill_null("missing").str.strip_chars().str.to_lowercase().replace({"actress": "actor"}).alias('category'),
                pl.element().struct.field('job').fill_null("missing").str.strip_chars()
                    .map_elements(map_job, return_dtype=pl.String)
                    .alias('job'),
                pl.element().struct.field('characters').fill_null("missing").str.strip_chars()
                    .str.replace_all(r"\(voice\)", "").str.replace_all(r"\(uncredited\)", "").str.replace_all(r"\(archive footage\)", "")
                    .str.strip_chars()
                    .alias('characters'),
                pl.element().struct.field('primaryName').fill_null("missing").str.strip_chars()
                    .alias('primaryName'),
                pl.element().struct.field('primaryProfession').fill_null("missing").str.strip_chars().str.to_lowercase()
                    .map_elements(lambda s: clean_professions(s, profession_map), return_dtype=pl.String)
                    .alias('primaryProfession'),
                pl.element().struct.field('birthYear').fill_null(-1).alias('birthYear'),
                pl.element().struct.field('deathYear').fill_null(-1).alias('deathYear'),
             ])
        ).alias('cast')
    )

if 'directors' in df.columns:
    print("Cleaning 'directors' (with name standardization)...")
    cleaning_expressions.append(
        pl.col('directors').list.eval(
             pl.struct([
                 pl.element().struct.field('primaryName').fill_null("missing").str.strip_chars()
                     .alias('primaryName'),
                 pl.element().struct.field('birthYear').fill_null(-1).alias('birthYear'),
                 pl.element().struct.field('deathYear').fill_null(-1).alias('deathYear'),
             ])
        ).alias('directors')
    )

if 'writers' in df.columns:
    print("Cleaning 'writers' (with name standardization)...")
    cleaning_expressions.append(
        pl.col('writers').list.eval(
            pl.struct([
                 pl.element().struct.field('primaryName').fill_null("missing").str.strip_chars()
                     .alias('primaryName'),
                 pl.element().struct.field('birthYear').fill_null(-1).alias('birthYear'),
                 pl.element().struct.field('deathYear').fill_null(-1).alias('deathYear'),
             ])
        ).alias('writers')
    )

if 'episodes' in df.columns:
    print("Cleaning 'episodes'...")
    cleaning_expressions.append(
        pl.col('episodes').list.eval(
           pl.struct([
               pl.element().struct.field('seasonNumber').fill_null(-1).alias('seasonNumber'),
               pl.element().struct.field('episodeNumber').fill_null(-1).alias('episodeNumber')
           ])
        ).alias('episodes')
    )
else:
    print("Skipping 'episodes' cleaning (column may have been dropped).")

if cleaning_expressions:
    df = df.with_columns(cleaning_expressions)
    print("\nNested list cleaning complete.")


--- Step 9: Advanced cleaning of all nested lists ---

Cleaning 'cast' (with metadata extraction)...
Cleaning 'directors' (with name standardization)...
Cleaning 'writers' (with name standardization)...
Skipping 'episodes' cleaning (column may have been dropped).

Nested list cleaning complete.


## Step 10: Build and Apply Sklearn Pipeline for Imputation

**What:** Create and apply an `sklearn` pipeline to perform final imputation on any remaining missing values in top-level **numeric** and **categorical** columns.
**Why:** This uses standard placeholder values (-1 for numeric, "missing" for categorical) as a final catch-all. Crucially, putting this in an `sklearn` pipeline makes the imputation step reproducible. The same imputation logic (learned from the training set if `.fit()` were used) can be applied consistently to new data (using `.transform()`).
**How:** 
* Identify numeric and categorical columns dynamically.
* Use `SimpleImputer(strategy='constant')` within a `ColumnTransformer` to apply the correct placeholder to the correct column types.
* Set `remainder='passthrough'` to keep columns not explicitly handled (like list columns or boolean flags).
* Apply the pipeline using `.fit_transform()` (in a real scenario, you'd `.fit()` on train and `.transform()` on train/test/new data).

In [12]:
print("\n--- Step 10: Building imputation pipeline for top-level columns ---")
numeric_features = [
    col for col, dtype in df.schema.items()
    if dtype in [pl.Int64, pl.Float64, pl.Int32, pl.Float32]
]
categorical_features = [
    col for col, dtype in df.schema.items()
    if dtype in [pl.String, pl.Categorical]
]
print(f"Numeric features for pipeline: {numeric_features}")
print(f"Categorical features for pipeline: {categorical_features}")

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=-1)),
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep list columns, 'isAdult', etc.
)

cleaning_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])


--- Step 10: Building imputation pipeline for top-level columns ---
Numeric features for pipeline: ['isAdult', 'startYear', 'runtimeMinutes', 'averageRating', 'numVotes']
Categorical features for pipeline: ['title', 'types', 'genres']


In [13]:
print("\nFinal sklearn pipeline created:")
cleaning_pipeline


Final sklearn pipeline created:


In [14]:
# --- Apply the pipeline ---
try:
    df_cleaned = cleaning_pipeline.fit_transform(df)
except Exception as e:
    print(f"\nCould not apply pipeline: {e}")
    df_cleaned = df # Keep the pre-pipeline state for inspection

print("\n--- Final Cleaned Schema ---")
print(df_cleaned.schema)

print("\n--- Final Data Preview ---")
print(df_cleaned.head())


--- Final Cleaned Schema ---
Schema([('num__isAdult', Float64), ('num__startYear', Float64), ('num__runtimeMinutes', Float64), ('num__averageRating', Float64), ('num__numVotes', Float64), ('cat__title', String), ('cat__types', String), ('cat__genres', String), ('remainder__cast', List(Struct({'category': String, 'job': String, 'characters': String, 'primaryName': String, 'primaryProfession': String, 'birthYear': Int64, 'deathYear': Int64}))), ('remainder__directors', List(Struct({'primaryName': String, 'birthYear': Int64, 'deathYear': Int64}))), ('remainder__writers', List(Struct({'primaryName': String, 'birthYear': Int64, 'deathYear': Int64})))])

--- Final Data Preview ---
shape: (5, 11)
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│ num__i ┆ num__s ┆ num__r ┆ num__a ┆ num__n ┆ cat__t ┆ cat__t ┆ cat__g ┆ remain ┆ remain ┆ remain │
│ sAdult ┆ tartYe ┆ untime ┆ verage ┆ umVote ┆ itle   ┆ ypes   ┆ enres  ┆ der__c ┆ der__d ┆ der__w

In [15]:
df_cleaned.write_parquet("imdb_us_movies_cleaned.parquet")