# Experiment: ODI Complaints EDA

Objective:
- Load processed ODI complaint outputs into pandas DataFrames for manual exploration in VS Code notebooks
- Make it easy for teammates to inspect columns, values, and data quality without command-line usage


## Notebook workflow (beginner-friendly)

Run cells from top to bottom.

This notebook will:
1. Find the repository root
2. Load parquet/csv files from `data/processed/`
3. Create a default DataFrame named `df` (prefers the combined complaints dataset)
4. Show quick previews so you can continue exploring manually


In [2]:
# Imports and display setup
import re
from pathlib import Path

import pandas as pd

try:
    from IPython.display import display
except Exception:
    display = print

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 220)
pd.set_option("display.max_colwidth", 120)


In [None]:
# Find the repository root (works whether the notebook runs from repo root or notebooks/)
def find_repo_root(start=None):
    start_path = Path.cwd() if start is None else Path(start)
    for candidate in [start_path, *start_path.parents]:
        if (candidate / ".git").exists():
            return candidate
    raise FileNotFoundError("Could not find repo root (.git folder) from current working directory")

REPO_ROOT = find_repo_root()
DATA_DIR = REPO_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"
OUTPUTS_DIR = DATA_DIR / "outputs"

# Discover processed datasets (prefer parquet, fallback to csv)
parquet_paths = sorted(PROCESSED_DIR.glob("*.parquet"))
csv_paths = sorted(PROCESSED_DIR.glob("*.csv")) if not parquet_paths else []
processed_paths = parquet_paths or csv_paths

if not processed_paths:
    raise FileNotFoundError(
        "No processed parquet/csv files found in data/processed. Run the pipeline first."
    )

print("Found processed files:")
for path in processed_paths:
    print(" -", path.name)

REPO_ROOT: c:\Users\davis\Documents\VCCode Repos\NHTSA-ODI-Complaint-Analytics
PROCESSED_DIR: c:\Users\davis\Documents\VCCode Repos\NHTSA-ODI-Complaint-Analytics\data\processed


In [5]:
# Load processed files into a dictionary of pandas DataFrames
# Keys are normalized dataset names for easy access (example: 'odi_complaints_combined')
def dataset_key(path):
    text = path.stem.strip().lower()
    text = re.sub(r"[^a-z0-9]+", "_", text)
    return text.strip("_") or "dataset"

TABLES = {}
inventory_rows = []

for path in processed_paths:
    key = dataset_key(path)
    if path.suffix.lower() == ".parquet":
        data = pd.read_parquet(path)
    else:
        data = pd.read_csv(path)

    TABLES[key] = data
    inventory_rows.append(
        {
            "dataset_name": key,
            "file_name": path.name,
            "rows": len(data),
            "columns": len(data.columns),
            "format": path.suffix.lower().lstrip(".")
        }
    )

TABLE_INVENTORY = pd.DataFrame(inventory_rows).sort_values("dataset_name").reset_index(drop=True)
display(TABLE_INVENTORY)


Unnamed: 0,dataset_name,file_name,rows,columns,format
0,complaints_received_2020_2024_complaints_received_2020_2024_processed,COMPLAINTS_RECEIVED_2020-2024_COMPLAINTS_RECEIVED_2020-2024_processed.parquet,418789,51,parquet
1,complaints_received_2025_2026_complaints_received_2025_2026_processed,COMPLAINTS_RECEIVED_2025-2026_COMPLAINTS_RECEIVED_2025-2026_processed.parquet,126442,51,parquet
2,odi_complaints_combined,odi_complaints_combined.parquet,545231,51,parquet


In [6]:
# Pick a default DataFrame for manual exploration
PREFERRED_PRIMARY = "odi_complaints_combined"

if PREFERRED_PRIMARY in TABLES:
    PRIMARY_NAME = PREFERRED_PRIMARY
else:
    combined_candidates = [name for name in TABLES if "combined" in name]
    PRIMARY_NAME = sorted(combined_candidates)[0] if combined_candidates else sorted(TABLES.keys())[0]

df = TABLES[PRIMARY_NAME]

print("Primary dataset:", PRIMARY_NAME)
print("Shape:", df.shape)
print("Columns:", len(df.columns))
display(pd.DataFrame({"column": df.columns}).head(20))


Primary dataset: odi_complaints_combined
Shape: (545231, 51)
Columns: 51


Unnamed: 0,column
0,cmplid
1,odino
2,mfr_name
3,maketxt
4,modeltxt
5,yeartxt
6,crash
7,faildate
8,fire
9,injured


In [None]:
# Quick preview
# You can rerun this cell anytime after changing df filters/transforms
print("dtypes preview")
display(df.dtypes.rename("dtype").reset_index(name="column").head(20))

print("sample rows")
display(df.head(5))


dtypes preview


Unnamed: 0,index,column
0,cmplid,string
1,odino,string
2,mfr_name,string
3,maketxt,string
4,modeltxt,string
5,yeartxt,string
6,crash,string
7,faildate,datetime64[us]
8,fire,string
9,injured,string


sample rows


Unnamed: 0,cmplid,odino,mfr_name,maketxt,modeltxt,yeartxt,crash,faildate,fire,injured,deaths,compdesc,city,state,vin,datea,ldate,miles,occurences,cdescr,cmpl_type,police_rpt_yn,purch_dt,orig_owner_yn,anti_brakes_yn,cruise_cont_yn,num_cyls,drive_train,fuel_sys,fuel_type,trans_type,veh_speed,dot,tire_size,loc_of_tire,tire_fail_type,orig_equip_yn,manuf_dt,seat_type,restraint_type,dealer_name,dealer_tel,dealer_city,dealer_state,dealer_zip,prod_type,repaired_yn,medical_attn,vehicles_towed_yn,source_zip,source_file
0,1633421,11292384,Honda (American Honda Motor Co.),HONDA,ACCORD,2018,N,2019-12-21,N,0,0,SERVICE BRAKES,PHILADELPHIA,PA,1HGCV2F38JA,2020-01-01,2020-01-01,4,,"DRIVING AT THE HIGHWAY, CAR SUDDENLY SLOW DOWN FROM 70MPH TO 40-50MPH,THERE WERE ANY CAR IN FRONT OF ME! I HAVE NOTH...",IVOQ,N,,N,N,N,,,,,,68,,,,,,,,,,,,,,V,,N,N,COMPLAINTS_RECEIVED_2020-2024.zip,COMPLAINTS_RECEIVED_2020-2024.txt
1,1633422,11292384,Honda (American Honda Motor Co.),HONDA,ACCORD,2018,N,2019-12-21,N,0,0,ELECTRICAL SYSTEM,PHILADELPHIA,PA,1HGCV2F38JA,2020-01-01,2020-01-01,4,,"DRIVING AT THE HIGHWAY, CAR SUDDENLY SLOW DOWN FROM 70MPH TO 40-50MPH,THERE WERE ANY CAR IN FRONT OF ME! I HAVE NOTH...",IVOQ,N,,N,N,N,,,,,,68,,,,,,,,,,,,,,V,,N,N,COMPLAINTS_RECEIVED_2020-2024.zip,COMPLAINTS_RECEIVED_2020-2024.txt
2,1633423,11292384,Honda (American Honda Motor Co.),HONDA,ACCORD,2018,N,2019-12-21,N,0,0,ENGINE,PHILADELPHIA,PA,1HGCV2F38JA,2020-01-01,2020-01-01,4,,"DRIVING AT THE HIGHWAY, CAR SUDDENLY SLOW DOWN FROM 70MPH TO 40-50MPH,THERE WERE ANY CAR IN FRONT OF ME! I HAVE NOTH...",IVOQ,N,,N,N,N,,,,,,68,,,,,,,,,,,,,,V,,N,N,COMPLAINTS_RECEIVED_2020-2024.zip,COMPLAINTS_RECEIVED_2020-2024.txt
3,1633424,11292385,Ford Motor Company,FORD,EXPLORER,2020,N,2019-12-26,N,0,0,ELECTRICAL SYSTEM,MEHERRIN,VA,1FM5K8GC8LG,2020-01-01,2020-01-01,5300,,DEEP SLEEP MODE ACTIVATES AFTER 2 DAYS. MOST RECENT EXPERIENCE INVOLVED HAVING TO JUMP START THE CAR AFTER SETTING ...,IVOQ,N,,N,N,N,,,,,,0,,,,,,,,,,,,,,V,,N,N,COMPLAINTS_RECEIVED_2020-2024.zip,COMPLAINTS_RECEIVED_2020-2024.txt
4,1633425,11292386,"General Motors, LLC",CHEVROLET,VOLT,2017,N,2019-07-12,N,0,0,SERVICE BRAKES,SAN ANTONIO,TX,1G1RB6S52HU,2020-01-01,2020-01-01,15000,,"WHILE DRIVING ON CITY STREETS AND HIGHWAYS, THE ADAPTIVE CRUISE CONTROL WILL NOT ENGAGE 25% OF THE TIME AND WILL DIS...",IVOQ,N,,N,N,N,,,,,,70,,,,,,,,,,,,,,V,,N,N,COMPLAINTS_RECEIVED_2020-2024.zip,COMPLAINTS_RECEIVED_2020-2024.txt


## EDA Sandbox

Start here for hands-on exploration
- `df.columns.tolist()`
- `df[["maketxt", "modeltxt", "compdesc"]].head(20)`
- `df["compdesc"].value_counts(dropna=False).head(20)`
- `df.groupby("maketxt").size().sort_values(ascending=False).head(20)`
- `df[df["crash"] == "Y"].shape`


In [12]:
# Example scratch queries (edit or replace)

# Null summary (top 20 columns by null count)
null_summary = (
    df.isna()
      .sum()
      .sort_values(ascending=False)
      .rename("null_count")
      .to_frame()
)
null_summary["null_pct"] = (null_summary["null_count"] / len(df) * 100).round(2)
display(null_summary)

# Uncomment examples below as needed
# display(df["compdesc"].value_counts(dropna=False).head(30))
# display(df[["maketxt", "modeltxt", "yeartxt", "compdesc"]].head(20))
# display(df.groupby(["maketxt", "compdesc"]).size().sort_values(ascending=False).head(30))


Unnamed: 0,null_count,null_pct
fuel_sys,544940,99.95
tire_size,544882,99.94
tire_fail_type,544483,99.86
manuf_dt,544323,99.83
restraint_type,544283,99.83
seat_type,544188,99.81
trans_type,543621,99.7
orig_equip_yn,543089,99.61
dot,542505,99.5
purch_dt,542416,99.48


In [13]:
# Missingness by product type
# Many ODI complaint columns are product-type-specific (vehicle/tire/equipment/child restraint),
# so overall null counts can be misleading without stratifying by prod_type

if "prod_type" not in df.columns:
    raise KeyError("Column 'prod_type' not found in df")

prod_series = df["prod_type"].astype("string").fillna("<NA>")
prod_counts = (
    prod_series.value_counts(dropna=False)
    .rename_axis("prod_type")
    .reset_index(name="row_count")
)
prod_counts["row_pct"] = (prod_counts["row_count"] / len(df) * 100).round(2)

print("Rows by product type")
display(prod_counts)

missingness_by_prod = pd.DataFrame({
    "overall_null_pct": (df.isna().mean() * 100).round(2)
})

for prod_value in prod_counts["prod_type"].tolist():
    mask = prod_series == prod_value
    label = str(prod_value).lower().replace(" ", "_")
    missingness_by_prod[f"null_pct_{label}"] = (df.loc[mask].isna().mean() * 100).round(2)
    missingness_by_prod[f"non_null_pct_{label}"] = (df.loc[mask].notna().mean() * 100).round(2)

missingness_by_prod = missingness_by_prod.sort_values(
    by="overall_null_pct",
    ascending=False
)

print("Missingness by product type (all columns)")
display(missingness_by_prod)

print("Columns with high overall nulls but much better coverage in at least one product type (structural sparsity clues)")
structural_sparse_candidates = missingness_by_prod[
    (missingness_by_prod["overall_null_pct"] >= 80)
    & (
        missingness_by_prod.filter(regex=r"^non_null_pct_").max(axis=1) >= 20
    )
]
display(structural_sparse_candidates)


Rows by product type


Unnamed: 0,prod_type,row_count,row_pct
0,V,537936,98.66
1,T,3820,0.7
2,E,2142,0.39
3,C,1326,0.24
4,,7,0.0


Missingness by product type (all columns)


Unnamed: 0,overall_null_pct,null_pct_v,non_null_pct_v,null_pct_t,non_null_pct_t,null_pct_e,non_null_pct_e,null_pct_c,non_null_pct_c,null_pct_<na>,non_null_pct_<na>
fuel_sys,99.95,99.95,0.05,100.0,0.0,100.0,0.0,100.0,0.0,100.0,0.0
tire_size,99.94,100.0,0.0,90.86,9.14,100.0,0.0,100.0,0.0,100.0,0.0
tire_fail_type,99.86,100.0,0.0,80.42,19.58,100.0,0.0,100.0,0.0,100.0,0.0
manuf_dt,99.83,100.0,0.0,100.0,0.0,100.0,0.0,31.52,68.48,100.0,0.0
restraint_type,99.83,100.0,0.0,100.0,0.0,100.0,0.0,28.51,71.49,100.0,0.0
seat_type,99.81,100.0,0.0,100.0,0.0,100.0,0.0,21.34,78.66,100.0,0.0
trans_type,99.7,99.7,0.3,100.0,0.0,100.0,0.0,100.0,0.0,100.0,0.0
orig_equip_yn,99.61,100.0,0.0,100.0,0.0,0.0,100.0,100.0,0.0,100.0,0.0
dot,99.5,100.0,0.0,28.64,71.36,100.0,0.0,100.0,0.0,100.0,0.0
purch_dt,99.48,99.7,0.3,100.0,0.0,44.63,55.37,97.81,2.19,100.0,0.0


Columns with high overall nulls but much better coverage in at least one product type (structural sparsity clues)


Unnamed: 0,overall_null_pct,null_pct_v,non_null_pct_v,null_pct_t,non_null_pct_t,null_pct_e,non_null_pct_e,null_pct_c,non_null_pct_c,null_pct_<na>,non_null_pct_<na>
manuf_dt,99.83,100.0,0.0,100.0,0.0,100.0,0.0,31.52,68.48,100.0,0.0
restraint_type,99.83,100.0,0.0,100.0,0.0,100.0,0.0,28.51,71.49,100.0,0.0
seat_type,99.81,100.0,0.0,100.0,0.0,100.0,0.0,21.34,78.66,100.0,0.0
orig_equip_yn,99.61,100.0,0.0,100.0,0.0,0.0,100.0,100.0,0.0,100.0,0.0
dot,99.5,100.0,0.0,28.64,71.36,100.0,0.0,100.0,0.0,100.0,0.0
purch_dt,99.48,99.7,0.3,100.0,0.0,44.63,55.37,97.81,2.19,100.0,0.0
loc_of_tire,99.42,100.0,0.0,17.91,82.09,100.0,0.0,100.0,0.0,100.0,0.0
repaired_yn,99.3,100.0,0.0,0.0,100.0,100.0,0.0,100.0,0.0,100.0,0.0


## Notes

- `TABLES` contains all loaded processed datasets
- `df` is the default DataFrame for exploration
- If you rerun the pipeline, rerun the loading cells in this notebook to refresh the data
