# Movie Data Imputation and Preprocessing

**Last Updated**: November 2024  
**Data Source**: DuckDB Database (`data/db/movies.duckdb`)

## Overview
This notebook performs comprehensive data imputation and preprocessing for movie revenue prediction modeling. 

## Modern Data Architecture
This project now uses a **modern, database-centric approach**:
- **DuckDB** as the primary data source (replacing CSV files)
- **Parquet** format for processed data (faster, smaller than CSV)
- **Dedicated utilities** in `src/data/query_utils.py` for data access
- **Structured directories**: `raw/`, `processed/`, `artifacts/`, `db/`

See `DATA_GUIDE.md` for complete documentation.

## Workflow
1. Load data from DuckDB (joined tables: movies, tmdb_movies, omdb_movies)
2. Explore missing data patterns
3. Engineer features (awards, dates, multi-label categories)
4. Impute missing values using ML algorithms
5. Create train/test splits
6. Save processed data and artifacts

In [None]:
import os
import re
import pandas as pd
import matplotlib.pyplot as plt

# from verstack import NaNImputer
from functools import partial

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

from sklearn.model_selection import train_test_split

In [None]:
# Directory config to project root to insure consistency across environments for project specific imports
from pyprojroot import here
os.chdir(here())

# Project specific imports
from ayne.core.config import settings
from ayne.utils.query_utils import (
    load_full_dataset,
    get_movies_with_financials,
    get_db_client
)
from ayne.utils.io import (
    save_processed_data,
    save_artifacts
)

## Available Data Query Functions

The `src/data/query_utils.py` module provides convenient functions for working with the DuckDB database:

**Loading Data:**
- `load_full_dataset()` - Load all movies with joined TMDB/OMDB/Numbers data
- `get_movies_with_financials()` - Filter movies with budget/revenue data
- `get_movies_by_year_range()` - Get movies from specific years
- `execute_custom_query()` - Run custom SQL queries

**Saving Data:**
- `save_processed_data()` - Save to `data/processed/` (for analysis-ready data)
- `save_artifacts()` - Save to `data/artifacts/` (for model outputs)

**Database:**
- `get_db_client()` - Get DuckDB connection for custom queries
- `get_table_info()` - View table schema

All functions return pandas DataFrames and handle connection management automatically.

## Data Loading from DuckDB

The project now uses DuckDB as the primary data source instead of CSV files. This provides:
- **Faster queries**: Columnar storage optimized for analytics
- **Structured data**: SQL queries for flexible data retrieval
- **Single source of truth**: All data consolidated in one database
- **Better data types**: No type inference issues from CSV

The database contains these tables:
- `movies` - Core movie information
- `tmdb_movies` - TMDB API data (details, ratings, genres)
- `omdb_movies` - OMDB API data (cast, crew, reviews)
- `numbers_movies` - Financial data from The Numbers

The `load_full_dataset()` function automatically joins all tables.

In [None]:
# Load data from DuckDB
# The DuckDB database contains all movie data from TMDB, OMDB, and The Numbers
# This replaces the old approach of loading from CSV files

try:
    print("Loading movie data from DuckDB...")
    print(f"Database location: {settings.duckdb_path}")  # type: ignore
    
    # Load complete dataset with all joined tables
    data = load_full_dataset(include_nulls=True)
    
    print(f"✅ Successfully loaded {len(data)} movies")
    print(f"Columns: {len(data.columns)}")
    
except Exception as e:
    print(f"❌ Error loading data: {e}")
    raise

In [None]:
data.info()

In [None]:
# Count missing values
data.isnull().sum()

## Exploration of missing data

I have gotten to know this data during the collection process but there are a few aspects I'd like to look into in more detail. Unfortunately, there seems to be quite a bit of missing data, especially in the `budget` and `revenue` columns which I intend to be my main prediction targets when training models. It is not a huge surprise, however, since companies tend to be quite guarded around the financial data of their projects. I will be looking into other sources of financial data in future.

Let's have a look at the total amount of missing data. Hopefully the blanks between the two interect.

In [None]:
# Assuming 'data' is your DataFrame
missing_budget_revenue_count = data[data['budget'].isna() & data['revenue'].isna()].shape[0]
missing_budget_count = data['budget'].isna().sum() - missing_budget_revenue_count
missing_revenue_count = data['revenue'].isna().sum() - missing_budget_revenue_count

# Calculate the total number of rows with missing values in either 'budget' or 'revenue'
total_missing_count = missing_budget_count + missing_revenue_count + missing_budget_revenue_count

print(f"- Number of rows where both budget and revenue are missing: {missing_budget_revenue_count}")
print(f"- Number of additional rows where budget is missing: {missing_budget_count}")
print(f"- Number of additional rows where revenue is missing: {missing_revenue_count}")
print(f"Total number of rows with missing values in either budget or revenue: {total_missing_count}")

There is indeed an unfortunate number of blanks. Most of them intersxect between the two columns but we still end up with about 30-35% of missing between the two so that is something I'll have to work around.

Let's have a look at the amount of rows missing financial data throughout time. I have a sneaking suspicion that it will be more common to not report financial data in more recent years.

In [None]:
# Assuming 'data' is your DataFrame
plot_data = data.copy()
plot_data['release_date'] = pd.to_datetime(plot_data['release_date'])

# Create a new column to indicate if both budget and revenue are missing
plot_data['missing_budget_revenue'] = plot_data['budget'].isna() & plot_data['revenue'].isna()

# Group by release year and calculate the proportion of missing values
plot_data['release_year'] = plot_data['release_date'].dt.year
missing_by_year = plot_data.groupby('release_year')['missing_budget_revenue'].mean()

# Plot the proportion of missing values over time
plt.figure(figsize=(10, 6))
plt.plot(missing_by_year.index, missing_by_year.values, marker='o', linestyle='-')
plt.xlabel('Release Year')
plt.ylabel('Proportion of Missing Budget and Revenue')
plt.title('Proportion of Missing Budget and Revenue Over Time')
plt.grid(True)
plt.show()

I find it quite interesting that movie financial data has not been reported as frequently in recent years, especially around the COVID-19 pandemic. The movie industry has grown significantly, and so have the costs of producing films. Nowadays, blockbuster movies are often financed through external companies, making financial details a more sensitive topic. During the pandemic, the movie industry contracted, and the quality of movies declined. Consequently, movies have not been performing well financially of late, and companies like Disney may choose not to report this information to protect their stock value.

In [None]:
# data = data.dropna(subset=['imdb_votes'])
# data['imdb_votes'] = data['imdb_votes'].str.replace(',', '').astype(int)

# Limit to the top 20 movies
top_movies = data[data['revenue'].isnull()].sort_values(by='tmdb_vote_count', ascending=False).head(20)

# Create a pivot table
pivot_table = top_movies.pivot_table(index='title', values=['revenue', 'budget', 'tmdb_vote_count', 'release_date'], aggfunc='first')

# Sort the pivot table by tmdb_vote_count
sorted_pivot_table = pivot_table.sort_values(by='tmdb_vote_count', ascending=False)

# Print the result
print(sorted_pivot_table)

## Multilable Categorical Features

There are a nuymber of multilable categorical features that will need to be looked into.

In [None]:
def count_unique_values_for_feature(df: pd.DataFrame, feature: str, delimiter: str = ",") -> int:
    """
    Splits the specified feature column by the delimiter and returns the number of unique values.

    Args:
        df (pd.DataFrame): The DataFrame containing the data.
        feature (str): The name of the column to process.
        delimiter (str): The delimiter used to separate multiple values in the column.

    Returns:
        int: The number of unique values.
    """
    return len(df[feature].dropna().str.split(rf"{delimiter}\s*").explode().unique())

# List of features you want to analyze:
features = [
    "genre_names", 
    "production_company_name", "production_country_name", 
            "spoken_languages", "director", "writer", "actors"]

# Create a dictionary with the counts for each feature:
unique_counts = {feature: count_unique_values_for_feature(data, feature) for feature in features}

# Display the results:
for feature, count in unique_counts.items():
    print(f"{feature}: {count} unique values")

Some of these features contain thousands of unique values, making it impractical to encode them directly. A good approach is to analyze the distribution of observations per category and retain only the most frequent categories. The less frequent categories can be grouped into a single **"Other"** category. This method helps to reduce the dimensionality of the data while preserving the most significant information.

Let's take a look at the distributions per category:

In [None]:
def print_top_categories(df: pd.DataFrame, column: str, top_n: int, delimiter: str = ",", others_label: str = "Others") -> None:
    """
    Prints the top_n unique values from a multi-label column and the total count of values 
    that fall outside the top_n (which would be grouped as 'Others').

    Args:
        df (pd.DataFrame): The DataFrame containing your data.
        column (str): The name of the multi-label column.
        top_n (int): The number of top categories to display.
        delimiter (str): The delimiter separating multiple values (default is a comma).
        others_label (str): The label used for less frequent values.
    """
    # Split the column into individual values and count frequencies
    exploded = df[column].dropna().str.split(rf"{delimiter}\s*").explode().str.strip()
    counts = exploded.value_counts()
    
    # Get the top N categories and the sum for the rest
    top_categories = counts.head(top_n)
    others_count = counts[counts.index.difference(top_categories.index)].sum()
    
    print("--------------------------------------------------||")
    print(f"Top {top_n} unique values for '{column}':")
    print(top_categories)
    print(f"Total count of all other values (will be grouped as '{others_label}'): {others_count}")
    print("--------------------------------------------------||\n")


top_values = {
    "genre_names": 20,
    "production_company_name": 20,
    "production_country_name": 10,
    "spoken_languages": 10,
    "director": 20,
    "writer": 20,
    "actors": 20
}

for feature, top_n in top_values.items():
    print_top_categories(data, feature, top_n)

- `genre_names`: Will be kept as is since the number of unique categories is perfectly manageable.
- `production_company_name`: Has a relatively even distribution, making it more useful for specific company investigations rather than model training. Therefore, it will be discarded for model training.
- `production_country_name`: Most values are concentrated in the US and the UK, making it a great candidate for consolidating less frequent categories into an "Other" category.
- `spoken_languages`: Another candidate for consolidation, potentially grouping into the top 5 most common languages.
- `director`: Similar to `production_company_name`, it has a wide distribution and will be discarded for model training.

## Preprocessing Pipeline

For this project, I would like to make extensive use of the pipeline functionality in SciKit-Learn to re-familiarize myself with the tool. I will also be making use of `FunctionTransformer` to have custom steps be part of the final pipeline.

### NaNImputer from verstack

The `NaNImputer` from the `verstack` library is a tool designed to handle missing values in a DataFrame. It provides various strategies for imputing missing values, including simple statistical methods and more advanced techniques. It automates the entire process and makes decisions on its own about the best approach for each column.

Due to the nature of the data, each observation in columns like `budget`, `revenue`, and the various critic scores are very individual, and imputation strategies like **median** and **mean** will not be appropriate options. Therefore, I want to make use of machine learning algorithms. `NaNImputer` will make use of `IterativeImputer` for such values, making it a more robust option.

In [None]:
# Define a function to add missing indicators for certain columns.
def impute_data(df: pd.DataFrame, colums_to_exclude: list = None) -> pd.DataFrame:
    if colums_to_exclude:
        df = df.drop(columns=colums_to_exclude).copy()
    else:
        df.copy()
    imputer = NaNImputer()
    df = imputer.impute(df)
    return df

imputation_transformer = FunctionTransformer(impute_data, validate=False)

#### Data Types cionversion transformers

Some features need to be reformatted and below functions will deal with that.

In [None]:
# Some columns need to get converted to numeric
def convert_to_numeric(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for col in df.columns:
        # Convert to string, remove commas, then convert to numeric
        df[col] = pd.to_numeric(df[col].astype(str).str.replace(',', ''), errors='coerce')
    return df

to_numeric = FunctionTransformer(convert_to_numeric, validate=False)

#### Missing indicator features
I want to set up binary features for the columns where I have a large number of blanks to be able to investigate the imputed data later on.

In [None]:
# Define a function to add missing indicators for certain columns.
def add_missing_indicators(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for col in df.columns:
        df[col + "_missing"] = df[col].isnull().astype(int)
    return df

missing_indicator_transformer = FunctionTransformer(add_missing_indicators, validate=False)

# iter_cols = ['metascore', 'rotten_tomatoes_rating', 'meta_critic_rating', 'budget', 'revenue']

### Feature Engineering

#### Award features
The `awards` column can be used to extract various columns for BAFTAs, Oscars and total awards and nominations.

In [None]:
def extract_awards_info(awards_str):
    """
    Extracts numerical awards information from a text string.

    Parameters
    ----------
    awards_str : str
        The awards description string.

    Returns
    -------
    pd.Series
        A Series with the following index:
        ["total_wins", "total_noms", "oscar_wins", "oscar_noms", "bafta_wins", "bafta_noms"]
    """
    # Handle missing or "N/A" values.
    if pd.isna(awards_str) or awards_str.strip() in ["N/A", ""]:
        return pd.Series([0, 0, 0, 0, 0, 0],
                         index=["total_wins", "total_noms", "oscar_wins", "oscar_noms", "bafta_wins", "bafta_noms"])
    
    # Extract overall totals.
    # Look for a pattern like "56 wins" (we use negative lookahead to avoid picking up Oscar wins)
    total_wins_match = re.search(r'(\d+)\s+wins?(?!.*Oscars)', awards_str, flags=re.IGNORECASE)
    total_noms_match = re.search(r'(\d+)\s+nominations', awards_str, flags=re.IGNORECASE)
    total_wins = int(total_wins_match.group(1)) if total_wins_match else 0
    total_noms = int(total_noms_match.group(1)) if total_noms_match else 0

    # Oscar-specific extraction:
    oscar_noms_match = re.search(r'Nominated for\s+(\d+)\s+Oscars?', awards_str, flags=re.IGNORECASE)
    oscar_noms = int(oscar_noms_match.group(1)) if oscar_noms_match else 0
    # Look for something like "Oscars. 56 wins" or "Oscars 56 wins" (using non-digit separator)
    oscar_wins_match = re.search(r'Oscars?[\W_]+(\d+)\s+wins?', awards_str, flags=re.IGNORECASE)
    oscar_wins = int(oscar_wins_match.group(1)) if oscar_wins_match else 0

    # BAFTA-specific extraction:
    # For nominations, sometimes the text might run together (e.g. "BAFTA Award28 nominations total")
    bafta_noms_match = re.search(r'Nominated for\s+(\d+)\s*BAFTA', awards_str, flags=re.IGNORECASE)
    bafta_noms = int(bafta_noms_match.group(1)) if bafta_noms_match else 0
    # For wins, allow an optional "Award" word after BAFTA.
    bafta_wins_match = re.search(r'BAFTA(?:\s+Award)?[\D_]+(\d+)\s+wins?', awards_str, flags=re.IGNORECASE)
    bafta_wins = int(bafta_wins_match.group(1)) if bafta_wins_match else 0

    return pd.Series([total_wins, total_noms, oscar_wins, oscar_noms, bafta_wins, bafta_noms],
                     index=["total_wins", "total_noms", "oscar_wins", "oscar_noms", "bafta_wins", "bafta_noms"])


def transform_awards(X):
    """
    Expects X to be a DataFrame with a single column (e.g., 'awards').
    Applies extract_awards_info row-wise and returns a DataFrame.
    """
    # Apply the function to the first (and only) column
    return X.iloc[:, 0].apply(extract_awards_info)

# Wrap the function in a FunctionTransformer
awards_transformer = FunctionTransformer(transform_awards, validate=False)

#### Multi-lable categorical features adjustment

Below ``FunctionTransformer`` will group the given multi-lable feature into a top N + Others categories.

In [None]:
def transform_top_categories(X, column, top_n, delimiter=",", others_label="Others"):
    """
    Transforms a multi-label column by keeping only the top_n categories (based on frequency)
    and replacing all other categories with a generic label.
    
    Parameters:
        X (pd.DataFrame): Input DataFrame.
        column (str): The name of the multi-label column to process.
        top_n (int): Number of top categories to keep.
        delimiter (str): Delimiter separating the values.
        others_label (str): Label to assign to categories not among the top_n.
    
    Returns:
        pd.DataFrame: A DataFrame with one column (the processed column).
    """
    X = X.copy()
    # Split the column values, explode, and count frequencies.
    exploded = X[column].dropna().str.split(rf"{delimiter}\s*").explode().str.strip()
    counts = exploded.value_counts()
    top_categories = counts.head(top_n).index.tolist()
    
    def map_categories(cell):
        if pd.isna(cell):
            return cell
        # Split and strip each value.
        cats = [cat.strip() for cat in cell.split(delimiter)]
        # Replace values not in top_categories with others_label.
        new_cats = [cat if cat in top_categories else others_label for cat in cats]
        # Remove duplicates while preserving order.
        seen = set()
        new_cats = [x for x in new_cats if x not in seen and not seen.add(x)]
        return delimiter.join(new_cats)
    
    X[column] = X[column].apply(map_categories)
    # Return a DataFrame with just the transformed column.
    return X[[column]]

# Now, to create a FunctionTransformer for, say, the 'production_country_name' column with top_n=5:
transformer_prod_country = FunctionTransformer(
    func=partial(transform_top_categories, column="production_country_name", top_n=5, delimiter=",", others_label="Others"),
    validate=False
)

# Similarly, for 'spoken_languages' column with top_n=5:
transformer_spoken_lang = FunctionTransformer(
    func=partial(transform_top_categories, column="spoken_languages", top_n=5, delimiter=",", others_label="Others"),
    validate=False
)

#### Release date rework

`release_date` can be split int oseparate integer columns to be used during training later but binary features like `is_weekend` and `is_holiday` can also be derived.

In [None]:
def add_date_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['release_date'] = pd.to_datetime(df['release_date'])
    df['release_year'] = df['release_date'].dt.year
    df['release_month'] = df['release_date'].dt.month
    df['release_day'] = df['release_date'].dt.day
    df['is_weekend'] = (df['release_date'].dt.weekday >= 4).astype(int)
    df['is_holiday_season'] = df['release_month'].isin([6, 7, 11, 12]).astype(int)
    df['movie_age'] = 2025 - df['release_year']
    return df

# Wrap the function as a transformer
date_features_transformer = FunctionTransformer(add_date_features, validate=False)

#### Return on Investment
Great option for looking into the budget-revenue relationship

In [None]:
def calculate_roi(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['roi'] = (df['revenue'] - df['budget']) / df['budget']
    return df

# Wrap the function as a transformer
roi_transformer = FunctionTransformer(calculate_roi, validate=False)

#### `actors`, `directors` and `writers`

These features are quite valuable I think but they need to be reworked. I do know that IMDB has these ones ordered by importance and billing so we can get the most relevant people in each category.

In [None]:
def extract_actors(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['first_billing_actor'] = df['actors'].apply(lambda x: x.split(',')[0].strip() if pd.notnull(x) and len(x.split(',')) > 0 else None)
    df['second_billing_actor'] = df['actors'].apply(lambda x: x.split(',')[1].strip() if pd.notnull(x) and len(x.split(',')) > 1 else None)
    df['third_billing_actor'] = df['actors'].apply(lambda x: x.split(',')[2].strip() if pd.notnull(x) and len(x.split(',')) > 2 else None)
    df['forth_billing_actor'] = df['actors'].apply(lambda x: x.split(',')[3].strip() if pd.notnull(x) and len(x.split(',')) > 3 else None)
    # df['fifth_billing_actor'] = df['actors'].apply(lambda x: x.split(',')[4].strip() if pd.notnull(x) and len(x.split(',')) > 4 else None)
    return df.drop(columns=['actors'])

def extract_directors(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['main_director'] = df['director'].apply(lambda x: x.split(',')[0].strip() if pd.notnull(x) and len(x.split(',')) > 0 else None)
    df['secondary_director'] = df['director'].apply(lambda x: x.split(',')[1].strip() if pd.notnull(x) and len(x.split(',')) > 1 else None)
    return df.drop(columns=['director'])

def extract_writers(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['main_writer'] = df['writer'].apply(lambda x: x.split(',')[0].strip() if pd.notnull(x) and len(x.split(',')) > 0 else None)
    df['secondary_writer'] = df['writer'].apply(lambda x: x.split(',')[1].strip() if pd.notnull(x) and len(x.split(',')) > 1 else None)
    return df.drop(columns=['writer'])

# Create FunctionTransformers
actors_transformer = FunctionTransformer(extract_actors)
directors_transformer = FunctionTransformer(extract_directors)
writers_transformer = FunctionTransformer(extract_writers)

#### Dropper
Seetting up a dropper transformer that can be used in a pipeline

In [None]:
# Define a function transformer to drop unwanted columns.
def drop_unwanted_columns(df: pd.DataFrame, columns: list) -> pd.DataFrame:
    return df.drop(columns=columns, errors='ignore')

columns_to_drop = ['production_company_name', 'director', 'writer', 'actors', 'title', 'release_date']

dropper = FunctionTransformer(drop_unwanted_columns, kw_args={'columns': columns_to_drop})

### Main ColumnTransformer
This `ColumnTransformer` will apply the included steps simultaniously. Each step relies on its own set of features so there is no overlap and no chance of issues. `Pipeline` applies the step in order which I will make use of later on.

In [None]:
main_transformer = ColumnTransformer(
    transformers=[
        ('missing_indicator', missing_indicator_transformer, ['metascore', 'rotten_tomatoes_rating', 'meta_critic_rating', 'budget', 'revenue']),
        ('awards', awards_transformer, ['awards']),
        ('date_feature_engineering', date_features_transformer, ['release_date']),
        ('top_n_prod_country', transformer_prod_country, ['production_country_name']),
        ('top_n_spoken_lang', transformer_spoken_lang, ['spoken_languages']),
        ('actors', actors_transformer, ['actors']),
        ('directors', directors_transformer, ['director']),
        ('writers', writers_transformer, ['writer']),
        ('to_numeric', to_numeric, ['imdb_rating', 'imdb_votes'])
    ],
    remainder='passthrough', 
    verbose_feature_names_out=False
)

# Set output to pandas dataframe
main_transformer.set_output(transform='pandas')

# Apply the preprocessor to the data
# clean_data = main_transformer.fit_transform(data)
# clean_data.head()

In [None]:
clean_data = main_transformer.fit_transform(data)
clean_data.info()

### Pipeline

#### Full data pipeline

The main transformer will be applied so that all the feature transformations and engineering gets applied on actual data and then all the missing data will be imputed. I like this order because it makes sure the generated features are all based on real data. 

I would like to experiment a bit with various inputs of the imputation to study the effects on model performance and prediction. Let's start with running the imputation on all available data simultaniously.

In [None]:
# Setting up pipeline
imputation_pipeline = Pipeline(steps=[
    # ('main_transformer', main_transformer),
    ('impute_data', imputation_transformer),
    # ('roi_feature_engineering', roi_transformer),
    # ('dropper', dropper)
])

# Set output to pandas dataframe
imputation_pipeline.set_output(transform='pandas')

In [None]:
imputed_data_full = imputation_pipeline.fit_transform(clean_data)

In [None]:
imputed_data_full.info()

In [None]:
# Save processed dataset using modern utilities
# Parquet format: faster, smaller, preserves types
save_processed_data(imputed_data_full, "imputed_data_full", format="parquet")
print("✅ Saved to data/processed/imputed_data_full.parquet")

#### No Revenue Pipeline
Let's exclude the target of my predictive modelimg which will be `revenue` to make sure the model cannot gain any information due to the reelations created by the imputation.

In [None]:
imputed_data_no_revenue = imputation_pipeline.fit_transform(clean_data.drop(columns=['revenue', 'revenue_missing']))

In [None]:
imputed_data_no_revenue['revenue'] = clean_data['revenue']
imputed_data_no_revenue.info()

In [None]:
# Save processed dataset
save_processed_data(imputed_data_no_revenue, "imputed_data_no_revenue", format="parquet")
print("✅ Saved to data/processed/imputed_data_no_revenue.parquet")

### Split data Imputation

In [None]:
# clean_data = clean_data[clean_data['revenue'] >= 50000000]

In [None]:
clean_data = clean_data.dropna(subset=['revenue'])

target = 'revenue'
X = clean_data.drop(columns=[target, 'revenue_missing'], axis=1)
y = clean_data[target]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [None]:
split_pipeline = imputation_pipeline.fit(X, y)
X_train_imputed = split_pipeline.transform(X_train)
X_test_imputed = split_pipeline.transform(X_test)

In [None]:
X_train_imputed.info()

In [None]:
X_test_imputed.info()

In [None]:
# Save train/test splits as artifacts
# These are model-specific outputs, so they go in artifacts/
save_artifacts(X_train_imputed, "X_train_imputed", format="parquet")
save_artifacts(X_test_imputed, "X_test_imputed", format="parquet")
save_artifacts(y_train, "y_train", format="parquet")
save_artifacts(y_test, "y_test", format="parquet")

print("✅ Saved training artifacts to data/artifacts/")
print(f"   - X_train: {len(X_train_imputed)} rows × {len(X_train_imputed.columns)} columns")
print(f"   - X_test: {len(X_test_imputed)} rows × {len(X_test_imputed.columns)} columns")
print(f"   - y_train: {len(y_train)} rows")
print(f"   - y_test: {len(y_test)} rows")