# ODI Complaints EDA

This notebook will be for our exploration and first cleaning steps of the complaint dataset. Other notebooks will be used to keep things like different models separate.

The `src/` folder will be where the finalized workflow for each step is done so a main file can run them all in order and keep everything clean and reproducible. The files there won't be setup for exploring and getting outputs.

## Quick Setup

1. Run cells from top to bottom, first few cells setup the pathing
2. The notebook loads the combined processed parquet by default
3. A DataFrame named `df` is created for the combined with options for the separate datasets

If the file is missing, run the pipeline first:
- Windows: `./scripts/run_pipeline_windows.ps1`
- macOS/Linux: `./scripts/run_pipeline_mac_linux.sh`

In [None]:
# Imports
from pathlib import Path

import pandas as pd
from IPython.display import display

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 220)
pd.set_option("display.max_colwidth", 120)

Load the combined processed dataset

In [None]:
# This notebook assumes the project setup matches the repo defaults

PROJECT_ROOT = Path.cwd()
if not (PROJECT_ROOT / "data" / "processed").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
COMBINED_PATH = PROCESSED_DIR / "odi_complaints_combined.parquet"

if not COMBINED_PATH.exists():
    raise FileNotFoundError(
        f"Could not find {COMBINED_PATH}. Run the pipeline first to create processed outputs."
    )

df = pd.read_parquet(COMBINED_PATH)
df = df.drop(columns=["source_zip", "source_file"], errors="ignore")

# Uncomment the following lines to load the individual year datasets if you want to compare them separately.
# OLD_PATH = PROCESSED_DIR / "COMPLAINTS_RECEIVED_2020-2024_processed.parquet"
# NEW_PATH = PROCESSED_DIR / "COMPLAINTS_RECEIVED_2025-2026_processed.parquet"
# df_2020_2024 = pd.read_parquet(OLD_PATH)
# df_2025_2026 = pd.read_parquet(NEW_PATH)

print("Loaded:", COMBINED_PATH.name)
print("Shape:", df.shape)

## Initial Exploration

In [None]:
print("Column names (first 25)")
display(pd.Series(df.columns, name="column").head(25).to_frame())

print("Data types (first 25)")
display(df.dtypes.rename("dtype").reset_index(name="column").head(25))

print("First 5 rows")
display(df.head(5))

Get a quick count and percentage of null values by column

In [None]:
# Null summary (overall)
null_summary = (
    df.isna()
      .sum()
      .sort_values(ascending=False)
      .rename("null_count")
      .to_frame()
)
null_summary["null_pct"] = (null_summary["null_count"] / len(df) * 100).round(2)

display(null_summary)

There are a lot of null values, but some of it is because many columns are fields for a specific product type. So if 25% of the products are Type A and Type A products have five fields specific to it, then they'd have a minimum null percentage of 75%. This checks the null percentages based on the product type.

In [None]:
# Show the distribution of rows by product type
df["prod_type"] = df["prod_type"].astype("string").fillna("<NA>")

prod_type_counts = (
    df["prod_type"].value_counts(dropna=False)
    .reset_index(name="row_count")
)
prod_type_counts["row_pct"] = (prod_type_counts["row_count"] / len(df) * 100).round(2)

print("Rows by product type")
display(prod_type_counts)


# Analyze null percentages by product type for columns with >5% nulls
null_by_prod = pd.DataFrame({"overall_null_pct": (df.isna().mean() * 100).round(2)})

for prod_value in prod_type_counts["prod_type"].tolist():
    mask = df["prod_type"] == prod_value
    null_by_prod[f"null_pct_{prod_value}"] = (df.loc[mask].isna().mean() * 100).round(2)
    null_by_prod[f"non_null_pct_{prod_value}"] = (df.loc[mask].notna().mean() * 100).round(2)

null_by_prod = null_by_prod.sort_values("overall_null_pct", ascending=False)
null_by_prod = null_by_prod[null_by_prod["overall_null_pct"] > 5]

print("Null by product type (more than 5% overall null)")
display(null_by_prod)


# Identify columns that are potentially sparse
print("Potentially sparse columns (overall null >= 80% and at least one product type with non-null >= 20%)")
sparse_candidates = null_by_prod[
    (null_by_prod["overall_null_pct"] >= 80)
    & (null_by_prod.filter(regex=r"^non_null_pct_").max(axis=1) >= 20)
]
display(sparse_candidates)


## Scratchpad 

In [None]:
# Vehicle-only view (likely the main analysis cohort)
df_vehicle = df[df["prod_type"] == "V"].copy()

print("Vehicle-only rows:", len(df_vehicle))

vehicle_null_summary = (
    df_vehicle.isna()
    .sum()
    .sort_values(ascending=False)
    .rename("null_count")
    .to_frame()
)
vehicle_null_summary["null_pct"] = (vehicle_null_summary["null_count"] / len(df_vehicle) * 100).round(2)

display(vehicle_null_summary.head(20))


# display(df_vehicle[["maketxt", "modeltxt", "compdesc"]].head(20))
# display(df_vehicle.groupby("compdesc").size().sort_values(ascending=False).head(30))
