# Notebook 1 – Data Cleaning, Feature Engineering, & Entity Resolution
**Project:** Judicial Vacancy → Nomination/Confirmation Pipeline

*Initial draft generated via ChatGPT model o3 on 2025-07-12T02:40:38.399372Z*

In [1]:

import sys
from pathlib import Path

import pandas as pd
from loguru import logger
from rapidfuzz import fuzz, process

# Add the project root to the path so we can import our modules
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from nomination_predictor.config import INTERIM_DATA_DIR, RAW_DATA_DIR
from nomination_predictor.congress_api_utils import \
    enrich_congress_nominees_dataframe

# Setup logging
logger.remove()  # Remove default handler
logger.add(sys.stderr, format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level}</level> | <cyan>{function}</cyan> - <level>{message}</level>", level="INFO")

[32m2025-07-12 16:03:33.706[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m103[0m - [1mProject root: /home/wsl2ubuntuuser/nomination_predictor[0m
[32m2025-07-12 16:03:33.708[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m127[0m - [1mConfiguration loaded[0m


ImportError: cannot import name 'enrich_congress_nominees_dataframe' from partially initialized module 'nomination_predictor.congress_api_utils' (most likely due to a circular import) (/home/wsl2ubuntuuser/nomination_predictor/nomination_predictor/congress_api_utils.py)

## Make a MyPy-friendly, typed container for all data-frames loaded

In [None]:
from dataclasses import dataclass
from typing import Dict, Iterator, Tuple

import pandas as pd

from nomination_predictor.features import load_and_prepare_dataframes

dfs_dict: Dict[str, pd.DataFrame] = load_and_prepare_dataframes(RAW_DATA_DIR)

In [None]:
@dataclass
class Frames:
    fjc_judges:                         pd.DataFrame
    fjc_federal_judicial_service:       pd.DataFrame
    fjc_demographics:                   pd.DataFrame
    fjc_education:                      pd.DataFrame
    fjc_other_federal_judicial_service: pd.DataFrame
    fjc_other_nominations_recess:       pd.DataFrame
    seat_timeline:                      pd.DataFrame
    cong_nominees:                      pd.DataFrame
    cong_nominations:                   pd.DataFrame

    # Allow:  for name, df in dfs:
    def __iter__(self) -> Iterator[Tuple[str, pd.DataFrame]]:
        return iter(self.__dict__.items())

In [None]:
# Instantiate from the dict coming back from your loader so MyPy can flag error if a key is missing
dfs = Frames(**dfs_dict)          # mypy will flag an error if a key is missing

In [None]:
# Loop over every frame once (bulk profiling, etc.)
for name, df in dfs:
    print(f"{name:<35} → {df.shape}")

# Crystal-clear single access with IDE autocompletion
dfs.cong_nominees.head()

[32m2025-07-12 14:29:31[0m | [1mINFO[0m | [36mload_and_prepare_dataframes[0m - [1mLoaded 4022 judges, 4720 service records, 745 congress nominees, 766 nominations[0m


Loaded: 4022 judges 4720 federal judicial service records 4022 demographics 8040 education 611 other federal judicial service 828 other nominations recess 4720 seat timeline 745 congress nominees 766 nominations


In [None]:
print("=== Column Names ===")
print("\nFJC Judges:", dfs["fjc_judges"].columns.tolist())
print("\nFJC Federal Judicial Service:", dfs["fjc_federal_judicial_service"].columns.tolist())
print("\nCongress Nominees:", cong_nominees.columns.tolist())
print("\nCongress Nominations:", cong_nominations.columns.tolist())

=== Column Names ===

FJC Judges: ['jid', 'last_name', 'first_name', 'middle_name', 'suffix', 'birth_month', 'birth_day', 'birth_year', 'birth_city', 'birth_state', 'death_month', 'death_day', 'death_year', 'death_city', 'death_state', 'gender', 'race_or_ethnicity', 'court_type_(1)', 'court_name_(1)', 'appointment_title_(1)', 'appointing_president_(1)', 'party_of_appointing_president_(1)', 'reappointing_president_(1)', 'party_of_reappointing_president_(1)', 'aba_rating_(1)', 'seat_id_(1)', 'statute_authorizing_new_seat_(1)', 'recess_appointment_date_(1)', 'nomination_date_(1)', 'committee_referral_date_(1)', 'hearing_date_(1)', 'judiciary_committee_action_(1)', 'committee_action_date_(1)', 'senate_vote_type_(1)', 'ayes/nays_(1)', 'confirmation_date_(1)', 'commission_date_(1)', 'service_as_chief_judge,_begin_(1)', 'service_as_chief_judge,_end_(1)', '2nd_service_as_chief_judge,_begin_(1)', '2nd_service_as_chief_judge,_end_(1)', 'senior_status_date_(1)', 'termination_(1)', 'termination_da

In [None]:
# Show basic info about the dataframes
print("=== Basic Info ===")
print("\nFJC Judges shape:", fjc_judges.shape)
print("FJC Federal Judicial Service shape:", fjc_federal_judicial_service.shape)
print("Congress Nominees shape:", cong_nominees.shape)
print("Congress Nominations shape:", cong_nominations.shape)

# Show first few rows of key dataframes
print("\nFirst few FJC Judges:")
display(fjc_judges.head())

print("\nFirst few Congress Nominees:")
display(cong_nominees.head())

=== Basic Info ===

FJC Judges shape: (4022, 202)
FJC Federal Judicial Service shape: (4720, 30)
Congress Nominees shape: (745, 24)
Congress Nominations shape: (766, 27)

First few FJC Judges:


Unnamed: 0,jid,last_name,first_name,middle_name,suffix,birth_month,birth_day,birth_year,birth_city,birth_state,...,school_(4),degree_(4),degree_year_(4),school_(5),degree_(5),degree_year_(5),professional_career,other_nominations/recess_appointments,name_full,full_name_clean
0,13761857,Abelson,Adam,Ben,,,,1982,Cleveland,OH,...,,,,,,,"Law clerk, Hon. Catherine C. Blake, U.S. Distr...",,Adam Ben Abelson,ADAM BEN ABELSON
1,3419,Abrams,Ronnie,,,,,1968,New York,NY,...,,,,,,,"Law clerk, Hon. Thomas P. Griesa, U.S. Distric...",,Ronnie Abrams,RONNIE ABRAMS
2,1,Abruzzo,Matthew,T.,,4.0,30.0,1889,Brooklyn,NY,...,,,,,,,"Private practice, Brooklyn, New York, 1910-1936",,Matthew T. Abruzzo,MATTHEW T ABRUZZO
3,13651551,Abudu,Nancy,Gbana,,,,1974,Alexandria,VA,...,,,,,,,"Private practice, New York City, 1999-2001; Ex...",Nominated to U.S. Court of Appeals for the Ele...,Nancy Gbana Abudu,NANCY GBANA ABUDU
4,2,Acheson,Marcus,Wilson,,6.0,7.0,1828,Washington,PA,...,,,,,,,"Private practice, Pittsburgh, Pennsylvania, 18...",,Marcus Wilson Acheson,MARCUS WILSON ACHESON



First few Congress Nominees:


Unnamed: 0,firstname,lastname,middlename,ordinal,state,congress,number,nominee_url,citation,nominee_id,...,full_name,full_name_clean,first,middle,last,organization,court_from_description,nomination_description,nomination_date,court_clean
0,James,Lake,Graham,1,DC,118,2012,https://api.congress.gov/v3/nomination/118/201...,PN2012,118-2012-1,...,James Graham Lake,JAMES GRAHAM LAKE,James,Graham,Lake,The Judiciary,Superior Court of the District of Columbia,"James Graham Lake, of the District of Columbia...",2024-07-31,SUPERIOR COURT OF THE DISTRICT OF COLUMBIA
1,Nicholas,Miranda,George,1,DC,118,2013,https://api.congress.gov/v3/nomination/118/201...,PN2013,118-2013-1,...,Nicholas George Miranda,NICHOLAS GEORGE MIRANDA,Nicholas,George,Miranda,The Judiciary,Southern District of New York,"Valerie E. Caproni, of the District of Columbi...",2012-11-14,SOUTHERN DISTRICT OF NEW YORK
2,Lisa,Wang,W.,1,DC,118,814,https://api.congress.gov/v3/nomination/118/814...,PN814,118-814-1,...,Lisa W. Wang,LISA W WANG,Lisa,W.,Wang,The Judiciary,United States Court of International Trade,"Lisa W. Wang, of the District of Columbia, to ...",2023-07-11,COURT OF INTERNATIONAL TRADE
3,Brandon,Long,S.,1,LA,118,771,https://api.congress.gov/v3/nomination/118/771...,PN771,118-771-1,...,Brandon S. Long,BRANDON S LONG,Brandon,S.,Long,The Judiciary,Eastern District of Louisiana,"Brandon S. Long, of Louisiana, to be United St...",2023-06-08,EASTERN DISTRICT OF LOUISIANA
4,Jerry,Edwards,,1,LA,118,769,https://api.congress.gov/v3/nomination/118/769...,PN769,118-769-1,...,Jerry Edwards Jr.,JERRY EDWARDS JR,Jerry,,Edwards,The Judiciary,Western District of Louisiana,"Jerry Edwards, Jr., of Louisiana, to be United...",2023-06-08,WESTERN DISTRICT OF LOUISIANA


Drop non-judge roles from nominations & nominees list

In [None]:
# Filter out non-judicial nominations using the function from features.py
from nomination_predictor.features import filter_non_judicial_nominations

# Define non-judicial titles to filter out
non_judicial_titles = [
    "Attorney", "Board", "Commission", "Director", "Marshal",
    "Assistant", "Representative", "Secretary of", "Member of"
]

# Apply the filter
cong_nominations, cong_nominees = filter_non_judicial_nominations(
    cong_nominations,
    cong_nominees,
    non_judicial_titles=non_judicial_titles
)

[32m2025-07-12 14:29:31[0m | [1mINFO[0m | [36mfilter_non_judicial_nominations[0m - [1mFound 256 unique citations with non-judicial titles[0m
[32m2025-07-12 14:29:31[0m | [1mINFO[0m | [36mfilter_non_judicial_nominations[0m - [1mRemoved 338/766 non-judicial nominations and 338/745 corresponding nominee records[0m


In [None]:
print("=== Missing Values ===")
print("\nFJC Judges:")
print(fjc_judges.isnull().sum())

print("\nFJC Federal Judicial Service:")
print(fjc_federal_judicial_service.isnull().sum())

print("\nCongress Nominees:")
print(cong_nominees.isnull().sum())

=== Missing Values ===

FJC Judges:
jid                                         0
last_name                                   0
first_name                                  0
middle_name                                35
suffix                                    407
                                         ... 
degree_year_(5)                          4017
professional_career                         4
other_nominations/recess_appointments    3307
name_full                                   0
full_name_clean                             0
Length: 202, dtype: int64

FJC Federal Judicial Service:
nid                                     0
sequence                                0
judge_name                              0
court_type                              0
court_name                              0
appointment_title                       0
appointing_president                    0
party_of_appointing_president          39
reappointing_president               4710
party_of_reappointing_p

In [None]:
# For the dataframes that have unique IDs, set them as the index to optimize lookups/joins
for name, df in all_dataframes.items():
    df = all_dataframes[name]
    if name in uniqueness_results and uniqueness_results[name].get('is_unique', True):
        if 'nid' in df.columns:
            logger.info(f"Setting 'nid' as index for {name} (unique ID confirmed)")
            all_dataframes[name] = df.set_index('nid', verify_integrity=True)
        elif 'citation' in df.columns:
            logger.info(f"Setting 'citation' as index for {name} (unique ID confirmed)")
            all_dataframes[name] = df.set_index('citation', verify_integrity=True)
        else:
            logger.warning(f"No unique ID found for {name} to set as its index, so left it alone.  Have fun data cleaning 🙃")

In [None]:
# --- Clean Congress nominees ------------------------------------------------
cong_nominees["full_name_clean"] = cong_nominees["full_name"].apply(clean_name)
cong_nominees[["first","middle","last"]] = cong_nominees["full_name_clean"].apply(
    lambda n: pd.Series(split_name(n)))

cong_nominees["court_clean"] = cong_nominees["organization"].apply(normalised_court)
cong_nominees["nomination_date"] = pd.to_datetime(cong_nominees["nomination_date"])

# --- Clean FJC judges -------------------------------------------------------
fjc_judges["full_name_clean"] = fjc_judges["name_full"].apply(clean_name)
fjc_judges[["first","middle","last"]] = fjc_judges["full_name_clean"].apply(
    lambda n: pd.Series(split_name(n)))

# We'll need a mapping from nid to service records for date & court validation
fjc_service["court_clean"] = fjc_service["court_name"].apply(normalised_court)
fjc_service["nomination_date"] = pd.to_datetime(fjc_service["nomination_date"], errors="coerce")
fjc_service["commission_date"] = pd.to_datetime(fjc_service["commission_date"], errors="coerce")

NameError: name 'fjc_service' is not defined

In [None]:

# Block by last name exact match
blocks = {}
for lname, group in fjc_judges.groupby("last"):
    blocks[lname] = group

def candidate_fjc_rows(row):
    return blocks.get(row["last"], pd.DataFrame())

In [None]:

def best_match(row):
    candidates = candidate_fjc_rows(row)
    if candidates.empty:
        return pd.NA, 0.0
    # Compute combined score: name similarity + court similarity + date proximity
    best_score = 0.0
    best_nid = pd.NA
    for _, cand in candidates.iterrows():
        name_score = fuzz.token_set_ratio(row["full_name_clean"], cand["full_name_clean"])
        # Use service records to find any matching nomination date
        entries = fjc_service[fjc_service["nid"] == cand["nid"]]
        date_score = 0
        court_score = 0
        if not entries.empty:
            # Smallest absolute diff in days
            diffs = (entries["nomination_date"] - row["nomination_date"]).abs().dt.days
            date_score = 100 - diffs.min() if diffs.notna().any() else 0
            # any court string overlap
            if row["court_clean"]:
                if any(row["court_clean"] in c for c in entries["court_clean"]):
                    court_score = 100
                else:
                    court_score = max(fuzz.partial_ratio(row["court_clean"], c) for c in entries["court_clean"])
        total = 0.6*name_score + 0.3*date_score + 0.1*court_score
        if total > best_score:
            best_score, best_nid = total, cand["nid"]
    return best_nid, round(best_score,1)

In [None]:
# Import the new filter_confirmed_nominees function
from nomination_predictor.features import (analyze_match_failures,
                                           filter_confirmed_nominees,
                                           load_and_prepare_dataframes)

# Load and prepare all dataframes
dfs = load_and_prepare_dataframes(RAW_DATA_DIR)
cong_nominees = dfs["cong_nominees"]  # This now has all the derived fields
fjc_judges = dfs["fjc_judges"]
fjc_service = dfs["fjc_service"]
cong_nominations = dfs["cong_nominations"]

# OPTIMIZATION: Filter to only confirmed nominees before matching
# This saves processing time by only matching nominees who were confirmed
confirmed_nominees = filter_confirmed_nominees(cong_nominees, cong_nominations)
print(f"Focusing on {len(confirmed_nominees)} confirmed nominees out of {len(cong_nominees)} total nominees")

# Only apply best_match to confirmed nominees
confirmed_nominees[["match_nid", "match_score"]] = confirmed_nominees.apply(
    best_match, axis=1, result_type="expand")

# Merge back with original dataframe to preserve all records
# Non-confirmed nominees will have NaN for match fields
cong_nominees = cong_nominees.merge(
    confirmed_nominees[["citation", "match_nid", "match_score"]], 
    on="citation", 
    how="left"
)

In [None]:

THRESHOLD = 80
matches = cong_nominees[cong_nominees["match_score"] >= THRESHOLD].copy()
print(f"Matched {len(matches)}/{len(cong_nominees)} nominees with score ≥ {THRESHOLD}")
matches.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_matches.csv", index=False)

# Save the cleaned interim datasets for downstream notebooks
cong_nominees.to_csv(INTERIM_DATA_DIR / "congress_nominees_cleaned.csv", index=False)
fjc_judges.to_csv(INTERIM_DATA_DIR / "fjc_judges_cleaned.csv", index=False)
fjc_service.to_csv(INTERIM_DATA_DIR / "fjc_service_cleaned.csv", index=False)

Matched 140/207 nominees with score ≥ 80


In [None]:
from nomination_predictor.features import analyze_match_failures

THRESHOLD = 80
matches = cong_nominees[cong_nominees["match_score"] >= THRESHOLD].copy()
print(f"Matched {len(matches)}/{len(cong_nominees)} nominees with score ≥ {THRESHOLD}")

# Analyze unmatched records to understand why they didn't match
unmatched_df, reason_summary, examples = analyze_match_failures(cong_nominees, THRESHOLD)

# Display summary of failure reasons
print("\nFailure Reason Summary:")
display(reason_summary)

# Display a few examples of each failure type
print("\nExample records for each failure type:")
for reason, example_df in examples.items():
    print(f"\n{reason}:")
    display(example_df)

# Save both matched and unmatched datasets for further analysis
matches.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_matches.csv", index=False)
unmatched_df.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_unmatched.csv", index=False)

# Save the cleaned interim datasets for downstream notebooks
cong_nominees.to_csv(INTERIM_DATA_DIR / "congress_nominees_cleaned.csv", index=False)
fjc_judges.to_csv(INTERIM_DATA_DIR / "fjc_judges_cleaned.csv", index=False)
fjc_service.to_csv(INTERIM_DATA_DIR / "fjc_service_cleaned.csv", index=False)

Matched 140/207 nominees with score ≥ 80

Failure Reason Summary:


Unnamed: 0,Failure Reason,Count
0,No potential match candidates found,48
1,Very low similarity - likely different person,3
2,Marginal match (score 76.5) - check name and c...,3
3,Marginal match (score 66.6) - check name and c...,2
4,Marginal match (score 78.5) - check name and c...,2
5,Marginal match (score 51.7) - check name and c...,1
6,Marginal match (score 61.8) - check name and c...,1
7,Marginal match (score 67.4) - check name and c...,1
8,Marginal match (score 77.3) - check name and c...,1
9,Marginal match (score 78.4) - check name and c...,1



Example records for each failure type:

No potential match candidates found:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
0,James Graham Lake,THE JUDICIARY,0.0,No potential match candidates found
1,Nicholas George Miranda,THE JUDICIARY,0.0,No potential match candidates found
5,Philip S. Hadji,THE JUDICIARY,0.0,No potential match candidates found



Very low similarity - likely different person:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
2,Lisa W. Wang,THE JUDICIARY,32.9,Very low similarity - likely different person
20,Joshua Paul Kolar,THE JUDICIARY,45.7,Very low similarity - likely different person
21,Eumi K. Lee,THE JUDICIARY,40.4,Very low similarity - likely different person



Marginal match (score 76.5) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
13,David Seymour Leibowitz,THE JUDICIARY,76.5,Marginal match (score 76.5) - check name and c...
24,Jacqueline Becerra,THE JUDICIARY,76.5,Marginal match (score 76.5) - check name and c...
26,Melissa Damian,THE JUDICIARY,76.5,Marginal match (score 76.5) - check name and c...



Marginal match (score 66.6) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
23,Edward Sunyol Kiel,THE JUDICIARY,66.6,Marginal match (score 66.6) - check name and c...
25,Sarah French Russell,THE JUDICIARY,66.6,Marginal match (score 66.6) - check name and c...



Marginal match (score 78.5) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
34,Gretchen S. Lund,THE JUDICIARY,78.5,Marginal match (score 78.5) - check name and c...
36,Nicole G. Berner,THE JUDICIARY,78.5,Marginal match (score 78.5) - check name and c...



Marginal match (score 51.7) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
6,Joseph Albert Laroski Jr.,THE JUDICIARY,51.7,Marginal match (score 51.7) - check name and c...



Marginal match (score 61.8) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
16,Mustafa Taher Kasubhai,THE JUDICIARY,61.8,Marginal match (score 61.8) - check name and c...



Marginal match (score 67.4) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
14,Seth Robert Aframe,THE JUDICIARY,67.4,Marginal match (score 67.4) - check name and c...



Marginal match (score 77.3) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
31,Amy M. Baggio,THE JUDICIARY,77.3,Marginal match (score 77.3) - check name and c...



Marginal match (score 78.4) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
35,Kirk Edward Sherriff,THE JUDICIARY,78.4,Marginal match (score 78.4) - check name and c...



Marginal match (score 75.7) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
37,Julie Simone Sneed,THE JUDICIARY,75.7,Marginal match (score 75.7) - check name and c...



Marginal match (score 55.1) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
65,Carmen G. Iguina Gonzalez,THE JUDICIARY,55.1,Marginal match (score 55.1) - check name and c...



Marginal match (score 58.0) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
153,Charles J. Willoughby Jr.,THE JUDICIARY,58.0,Marginal match (score 58.0) - check name and c...



Marginal match (score 52.4) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
160,Charles J. Willoughby Jr.,THE JUDICIARY,52.4,Marginal match (score 52.4) - check name and c...


### Build predecessor lookup table

In [None]:
# Create the predecessor lookup table
predecessor_lookup = get_predecessor_info(seat_timeline_df)
print(f"Created predecessor lookup: {len(predecessor_lookup)} records")

# Preview the predecessor lookup
print(predecessor_lookup.head())
all_dataframes['predecessor_lookup'] = predecessor_lookup