# NotebookÂ 1 â€“ Data Cleaning, Feature Engineering, & Entity Resolution
**Project:** Judicial Vacancy â†’ Nomination/Confirmation Pipeline

*Initial draft generated via ChatGPT model o3 on 2025-07-12T02:40:38.399372Z*

In [None]:

import sys
from pathlib import Path

import pandas as pd
from loguru import logger
from rapidfuzz import fuzz, process

# Add the project root to the path so we can import our modules
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from nomination_predictor.config import INTERIM_DATA_DIR, RAW_DATA_DIR

# Setup logging
logger.remove()  # Remove default handler
logger.add(sys.stderr, format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level}</level> | <cyan>{function}</cyan> - <level>{message}</level>", level="INFO")

[32m2025-07-12 18:06:25.502[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m103[0m - [1mProject root: /home/wsl2ubuntuuser/nomination_predictor[0m
[32m2025-07-12 18:06:25.504[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m127[0m - [1mConfiguration loaded[0m


5

## Load dataframes from Raw data folder

In [None]:
from dataclasses import dataclass
from typing import Dict, Iterator, Tuple

import pandas as pd

from nomination_predictor.features import load_and_prepare_dataframes

dfs_dict: Dict[str, pd.DataFrame] = load_and_prepare_dataframes(RAW_DATA_DIR)

[32m2025-07-12 18:06:26[0m | [1mINFO[0m | [36mload_and_prepare_dataframes[0m - [1mLoaded 4022 judges, 4720 service records, 80 congress nominees, 80 nominations[0m


#### Make a MyPy-friendly, typed container for all data-frames loaded

In [None]:
@dataclass
class Frames:# order matters for when a later iterator left-joins FJC dataframes; items listed first take precedence in conflicts
    fjc_judges:                         pd.DataFrame 
    fjc_federal_judicial_service:       pd.DataFrame
    fjc_demographics:                   pd.DataFrame
    fjc_education:                      pd.DataFrame
    fjc_other_federal_judicial_service: pd.DataFrame
    fjc_other_nominations_recess:       pd.DataFrame
    seat_timeline:                      pd.DataFrame
    cong_nominees:                      pd.DataFrame
    cong_nominations:                   pd.DataFrame

    # Allows later notebook cells to iterate across these for bulk operations using syntax such as `for name, df in dfs:`
    def __iter__(self) -> Iterator[Tuple[str, pd.DataFrame]]:
        return iter(self.__dict__.items())

In [None]:
# Instantiate from the dict coming back from your loader so MyPy can flag error if a key is missing
dfs = Frames(**dfs_dict)          # mypy will flag an error if a key is missing

In [None]:
for name, df in dfs:
    print(f"{name:<35} â†’ {df.shape}")
    print(df.head())

fjc_judges                          â†’ (4022, 201)
        nid       jid last_name first_name middle_name suffix  birth_month  \
0  13761857  13761857   Abelson       Adam         Ben    NaN          NaN   
1   1393931      3419    Abrams     Ronnie                             NaN   
2   1376976         1   Abruzzo    Matthew          T.                 4.0   
3  13651551  13651551     Abudu      Nancy       Gbana    NaN          NaN   
4   1376981         2   Acheson     Marcus      Wilson                 6.0   

   birth_day birth_year  birth_city  ... degree_(3)  degree_year_(3)  \
0        NaN       1982   Cleveland  ...        NaN              NaN   
1        NaN       1968    New York  ...        NaN              NaN   
2       30.0       1889    Brooklyn  ...        NaN              NaN   
3        NaN       1974  Alexandria  ...        NaN              NaN   
4        7.0       1828  Washington  ...        NaN              NaN   

   school_(4)  degree_(4) degree_year_(4) scho

## Handling nominees' education and job history

Before we combine FJC data, we have to consider whether/how to handle judges' education, job history, age, ABA rating, etc., because the only other table in the FJC data which handles nid uniquely is "demographics," which are unchanging.
The simplest way to handle the non-unique-nid tables it would be to left-merge on "nid" and only take the most recently-dated row.  In most cases this would likely land on keeping the most prestigious degree or job.

However, it is entirely likely a judge's education or job history has changed substantially since their first nomination, and affected their qualifications for each later nomination.

All of these indicate to me that it's worth considering the judge's position, education, etc., not as of the most recent records available, but instead _as of when they were nominated._

That means we can't do a simple left-join of all of our FJC data.  Instead, we have to -- using a combination of names, court locations, and vacancy dates -- fuzzy-match to find which "nid" corresponds to each "citation" in the Congress data, as our way of bridging between FJC judges and congress' nominee data. Then use the "received date" for that citation as a cutoff date for when we lookup education and job records by "nid" -- so we can avoid mistakenly linking to a citation any employemnt & job records dated after that cutoff date.

Thankfully we do have the school, degree, and degree_year in the education record, for both their bachelors and their masters and their associate degree(s) and LLB and J.D. etc., so we can look that up.  The education dataframe even comes with a "sequence" number for each education record, which is an even easier-to-use indicator of chronological order than the degree_year for any given "nid" lookup for a judge.

Job history is more challenging to deal with because literally every row entry in that dataframe lists it uniquely, but we do have the data available.  On early attempts, it may be simplest to ignore it; then feature-engineer basic booleans for whether they did/didn't have experience in common-phrase-identifiable positions such as "Private practice" or "Attorney general" or "Navy" or "Army" etc.; eventually a parser can look for the year spreads listed there as a rough indicator of amounts of experience gleaned from each professional role.

### Combining FJC data

In [None]:
# Left-joins all dataframes whose names start with "fjc", joining them on their columns named "nid"
# Warns if any shared column names contain non-identical data

from loguru import logger

from nomination_predictor.features import left_join_fjc_dataframes

# Execute the function with our dataframes
try:
    fjc_combined = left_join_fjc_dataframes(dfs)
    
    if fjc_combined is not None:
        logger.info(f"Successfully created combined FJC dataframe with {len(fjc_combined)} rows and {len(fjc_combined.columns)} columns")
        # Display the first few rows of the result
        fjc_combined.head()
    else:
        logger.error("Failed to create combined FJC dataframe")
except Exception as e:
    logger.error(f"Error joining FJC dataframes: {str(e)}")
    raise

[32m2025-07-12 18:06:26[0m | [1mINFO[0m | [36mleft_join_fjc_dataframes[0m - [1mStarting join with fjc_judges (4022 rows)[0m
[32m2025-07-12 18:06:26[0m | [1mINFO[0m | [36mleft_join_fjc_dataframes[0m - [1mJoined fjc_federal_judicial_service - merged dataframe now has 4720 rows, 230 columns[0m
[32m2025-07-12 18:06:26[0m | [1mINFO[0m | [36mleft_join_fjc_dataframes[0m - [1mJoined fjc_demographics - merged dataframe now has 4720 rows, 230 columns[0m
[32m2025-07-12 18:06:26[0m | [1mINFO[0m | [36mleft_join_fjc_dataframes[0m - [1mJoined fjc_education - merged dataframe now has 9400 rows, 233 columns[0m
[32m2025-07-12 18:06:26[0m | [1mINFO[0m | [36mleft_join_fjc_dataframes[0m - [1mJoined fjc_other_federal_judicial_service - merged dataframe now has 9573 rows, 261 columns[0m
[32m2025-07-12 18:06:27[0m | [1mINFO[0m | [36mleft_join_fjc_dataframes[0m - [1mJoined fjc_other_nominations_recess - merged dataframe now has 9892 rows, 261 columns[0m
[32m2025

## Normalize column names for DataFrames

In [None]:
print("=== Column Names Before ===")

for name, df in dfs:
    print(f"{name:<35} â†’ {df.columns.tolist()}")

=== Column Names Before ===
fjc_judges                          â†’ ['nid', 'jid', 'last_name', 'first_name', 'middle_name', 'suffix', 'birth_month', 'birth_day', 'birth_year', 'birth_city', 'birth_state', 'death_month', 'death_day', 'death_year', 'death_city', 'death_state', 'gender', 'race_or_ethnicity', 'court_type_(1)', 'court_name_(1)', 'appointment_title_(1)', 'appointing_president_(1)', 'party_of_appointing_president_(1)', 'reappointing_president_(1)', 'party_of_reappointing_president_(1)', 'aba_rating_(1)', 'seat_id_(1)', 'statute_authorizing_new_seat_(1)', 'recess_appointment_date_(1)', 'nomination_date_(1)', 'committee_referral_date_(1)', 'hearing_date_(1)', 'judiciary_committee_action_(1)', 'committee_action_date_(1)', 'senate_vote_type_(1)', 'ayes/nays_(1)', 'confirmation_date_(1)', 'commission_date_(1)', 'service_as_chief_judge,_begin_(1)', 'service_as_chief_judge,_end_(1)', '2nd_service_as_chief_judge,_begin_(1)', '2nd_service_as_chief_judge,_end_(1)', 'senior_status_date

In [None]:
# call features.py's normalize_columns functios
from nomination_predictor.features import normalize_dataframe_columns

for name, df in dfs:
    df = normalize_dataframe_columns(df)

In [None]:
print("=== Column Names After ===")

for name, df in dfs:
    print(f"{name:<35} â†’ {df.columns.tolist()}")

=== Column Names After ===
fjc_judges                          â†’ ['nid', 'jid', 'last_name', 'first_name', 'middle_name', 'suffix', 'birth_month', 'birth_day', 'birth_year', 'birth_city', 'birth_state', 'death_month', 'death_day', 'death_year', 'death_city', 'death_state', 'gender', 'race_or_ethnicity', 'court_type_(1)', 'court_name_(1)', 'appointment_title_(1)', 'appointing_president_(1)', 'party_of_appointing_president_(1)', 'reappointing_president_(1)', 'party_of_reappointing_president_(1)', 'aba_rating_(1)', 'seat_id_(1)', 'statute_authorizing_new_seat_(1)', 'recess_appointment_date_(1)', 'nomination_date_(1)', 'committee_referral_date_(1)', 'hearing_date_(1)', 'judiciary_committee_action_(1)', 'committee_action_date_(1)', 'senate_vote_type_(1)', 'ayes/nays_(1)', 'confirmation_date_(1)', 'commission_date_(1)', 'service_as_chief_judge,_begin_(1)', 'service_as_chief_judge,_end_(1)', '2nd_service_as_chief_judge,_begin_(1)', '2nd_service_as_chief_judge,_end_(1)', 'senior_status_date_

## Drop non-judge roles from nominations & nominees list

In [None]:
# Filter out non-judicial nominations using the function from features.py
from nomination_predictor.features import filter_non_judicial_nominations

# Define non-judicial titles to filter out
non_judicial_titles = [
    "Attorney", "Board", "Commission", "Director", "Marshal",
    "Assistant", "Representative", "Secretary of", "Member of"
]

# Apply the filter
dfs.cong_nominations, dfs.cong_nominees = filter_non_judicial_nominations(
    dfs.cong_nominations,
    dfs.cong_nominees,
    non_judicial_titles=non_judicial_titles
)

[32m2025-07-12 18:06:27[0m | [1mINFO[0m | [36mfilter_non_judicial_nominations[0m - [1mFound 27 unique citations with non-judicial titles[0m
[32m2025-07-12 18:06:27[0m | [1mINFO[0m | [36mfilter_non_judicial_nominations[0m - [1mRemoved 27/80 non-judicial nominations and 27/80 corresponding nominee records[0m


### Supplementing dataframes with additional columns

In [None]:
# Enrich the nominees dataframe with name fields and court information from nominations
from nomination_predictor.features import (enrich_congress_nominees_dataframe,
                                           enrich_fjc_judges)

dfs.cong_nominees = enrich_congress_nominees_dataframe(dfs.cong_nominees, dfs.cong_nominations)

# Enrich the FJC judges dataframe with full name fields
fjc_combined = enrich_fjc_judges(fjc_combined)

KeyError: 'firstname'

In [None]:
print("=== Missing Values ===")
print("\nFJC Judges:")
print(fjc_combined.isnull().sum())

print("\nCongress Nominees:")
print(cong_nominees.isnull().sum())

=== Missing Values ===

FJC Judges:
nid               0
jid               0
last_name         0
first_name        0
middle_name      92
               ... 
unnamed:_26    9892
unnamed:_27    9892
unnamed:_28    9892
unnamed:_29    9892
unnamed:_30    9892
Length: 263, dtype: int64

Congress Nominees:


NameError: name 'cong_nominees' is not defined

In [None]:
# For the dataframes that have unique IDs, set them as the index to optimize lookups/joins
for name, df in all_dataframes.items():
    df = all_dataframes[name]
    if name in uniqueness_results and uniqueness_results[name].get('is_unique', True):
        if 'nid' in df.columns:
            logger.info(f"Setting 'nid' as index for {name} (unique ID confirmed)")
            all_dataframes[name] = df.set_index('nid', verify_integrity=True)
        elif 'citation' in df.columns:
            logger.info(f"Setting 'citation' as index for {name} (unique ID confirmed)")
            all_dataframes[name] = df.set_index('citation', verify_integrity=True)
        else:
            logger.warning(f"No unique ID found for {name} to set as its index, so left it alone.  Have fun data cleaning ðŸ™ƒ")

In [None]:
# --- Clean Congress nominees ------------------------------------------------
cong_nominees["full_name_clean"] = cong_nominees["full_name"].apply(clean_name)
cong_nominees[["first","middle","last"]] = cong_nominees["full_name_clean"].apply(
    lambda n: pd.Series(split_name(n)))

cong_nominees["court_clean"] = cong_nominees["organization"].apply(normalised_court)
cong_nominees["nomination_date"] = pd.to_datetime(cong_nominees["nomination_date"])

# --- Clean FJC judges -------------------------------------------------------
fjc_judges["full_name_clean"] = fjc_judges["name_full"].apply(clean_name)
fjc_judges[["first","middle","last"]] = fjc_judges["full_name_clean"].apply(
    lambda n: pd.Series(split_name(n)))

# We'll need a mapping from nid to service records for date & court validation
fjc_service["court_clean"] = fjc_service["court_name"].apply(normalised_court)
fjc_service["nomination_date"] = pd.to_datetime(fjc_service["nomination_date"], errors="coerce")
fjc_service["commission_date"] = pd.to_datetime(fjc_service["commission_date"], errors="coerce")

NameError: name 'fjc_service' is not defined

In [None]:

# Block by last name exact match
blocks = {}
for lname, group in fjc_judges.groupby("last"):
    blocks[lname] = group

def candidate_fjc_rows(row):
    return blocks.get(row["last"], pd.DataFrame())

In [None]:

def best_match(row):
    candidates = candidate_fjc_rows(row)
    if candidates.empty:
        return pd.NA, 0.0
    # Compute combined score: name similarity + court similarity + date proximity
    best_score = 0.0
    best_nid = pd.NA
    for _, cand in candidates.iterrows():
        name_score = fuzz.token_set_ratio(row["full_name_clean"], cand["full_name_clean"])
        # Use service records to find any matching nomination date
        entries = fjc_service[fjc_service["nid"] == cand["nid"]]
        date_score = 0
        court_score = 0
        if not entries.empty:
            # Smallest absolute diff in days
            diffs = (entries["nomination_date"] - row["nomination_date"]).abs().dt.days
            date_score = 100 - diffs.min() if diffs.notna().any() else 0
            # any court string overlap
            if row["court_clean"]:
                if any(row["court_clean"] in c for c in entries["court_clean"]):
                    court_score = 100
                else:
                    court_score = max(fuzz.partial_ratio(row["court_clean"], c) for c in entries["court_clean"])
        total = 0.6*name_score + 0.3*date_score + 0.1*court_score
        if total > best_score:
            best_score, best_nid = total, cand["nid"]
    return best_nid, round(best_score,1)

In [None]:
# Import the new filter_confirmed_nominees function
from nomination_predictor.features import (analyze_match_failures,
                                           filter_confirmed_nominees,
                                           load_and_prepare_dataframes)

# Load and prepare all dataframes
dfs = load_and_prepare_dataframes(RAW_DATA_DIR)
cong_nominees = dfs["cong_nominees"]  # This now has all the derived fields
fjc_judges = dfs["fjc_judges"]
fjc_service = dfs["fjc_service"]
cong_nominations = dfs["cong_nominations"]

# OPTIMIZATION: Filter to only confirmed nominees before matching
# This saves processing time by only matching nominees who were confirmed
confirmed_nominees = filter_confirmed_nominees(cong_nominees, cong_nominations)
print(f"Focusing on {len(confirmed_nominees)} confirmed nominees out of {len(cong_nominees)} total nominees")

# Only apply best_match to confirmed nominees
confirmed_nominees[["match_nid", "match_score"]] = confirmed_nominees.apply(
    best_match, axis=1, result_type="expand")

# Merge back with original dataframe to preserve all records
# Non-confirmed nominees will have NaN for match fields
cong_nominees = cong_nominees.merge(
    confirmed_nominees[["citation", "match_nid", "match_score"]], 
    on="citation", 
    how="left"
)

In [None]:

THRESHOLD = 80
matches = cong_nominees[cong_nominees["match_score"] >= THRESHOLD].copy()
print(f"Matched {len(matches)}/{len(cong_nominees)} nominees with score â‰¥ {THRESHOLD}")
matches.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_matches.csv", index=False)

# Save the cleaned interim datasets for downstream notebooks
cong_nominees.to_csv(INTERIM_DATA_DIR / "congress_nominees_cleaned.csv", index=False)
fjc_judges.to_csv(INTERIM_DATA_DIR / "fjc_judges_cleaned.csv", index=False)
fjc_service.to_csv(INTERIM_DATA_DIR / "fjc_service_cleaned.csv", index=False)

Matched 140/207 nominees with score â‰¥ 80


In [None]:
from nomination_predictor.features import analyze_match_failures

THRESHOLD = 80
matches = cong_nominees[cong_nominees["match_score"] >= THRESHOLD].copy()
print(f"Matched {len(matches)}/{len(cong_nominees)} nominees with score â‰¥ {THRESHOLD}")

# Analyze unmatched records to understand why they didn't match
unmatched_df, reason_summary, examples = analyze_match_failures(cong_nominees, THRESHOLD)

# Display summary of failure reasons
print("\nFailure Reason Summary:")
display(reason_summary)

# Display a few examples of each failure type
print("\nExample records for each failure type:")
for reason, example_df in examples.items():
    print(f"\n{reason}:")
    display(example_df)

# Save both matched and unmatched datasets for further analysis
matches.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_matches.csv", index=False)
unmatched_df.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_unmatched.csv", index=False)

# Save the cleaned interim datasets for downstream notebooks
cong_nominees.to_csv(INTERIM_DATA_DIR / "congress_nominees_cleaned.csv", index=False)
fjc_judges.to_csv(INTERIM_DATA_DIR / "fjc_judges_cleaned.csv", index=False)
fjc_service.to_csv(INTERIM_DATA_DIR / "fjc_service_cleaned.csv", index=False)

Matched 140/207 nominees with score â‰¥ 80

Failure Reason Summary:


Unnamed: 0,Failure Reason,Count
0,No potential match candidates found,48
1,Very low similarity - likely different person,3
2,Marginal match (score 76.5) - check name and c...,3
3,Marginal match (score 66.6) - check name and c...,2
4,Marginal match (score 78.5) - check name and c...,2
5,Marginal match (score 51.7) - check name and c...,1
6,Marginal match (score 61.8) - check name and c...,1
7,Marginal match (score 67.4) - check name and c...,1
8,Marginal match (score 77.3) - check name and c...,1
9,Marginal match (score 78.4) - check name and c...,1



Example records for each failure type:

No potential match candidates found:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
0,James Graham Lake,THE JUDICIARY,0.0,No potential match candidates found
1,Nicholas George Miranda,THE JUDICIARY,0.0,No potential match candidates found
5,Philip S. Hadji,THE JUDICIARY,0.0,No potential match candidates found



Very low similarity - likely different person:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
2,Lisa W. Wang,THE JUDICIARY,32.9,Very low similarity - likely different person
20,Joshua Paul Kolar,THE JUDICIARY,45.7,Very low similarity - likely different person
21,Eumi K. Lee,THE JUDICIARY,40.4,Very low similarity - likely different person



Marginal match (score 76.5) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
13,David Seymour Leibowitz,THE JUDICIARY,76.5,Marginal match (score 76.5) - check name and c...
24,Jacqueline Becerra,THE JUDICIARY,76.5,Marginal match (score 76.5) - check name and c...
26,Melissa Damian,THE JUDICIARY,76.5,Marginal match (score 76.5) - check name and c...



Marginal match (score 66.6) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
23,Edward Sunyol Kiel,THE JUDICIARY,66.6,Marginal match (score 66.6) - check name and c...
25,Sarah French Russell,THE JUDICIARY,66.6,Marginal match (score 66.6) - check name and c...



Marginal match (score 78.5) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
34,Gretchen S. Lund,THE JUDICIARY,78.5,Marginal match (score 78.5) - check name and c...
36,Nicole G. Berner,THE JUDICIARY,78.5,Marginal match (score 78.5) - check name and c...



Marginal match (score 51.7) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
6,Joseph Albert Laroski Jr.,THE JUDICIARY,51.7,Marginal match (score 51.7) - check name and c...



Marginal match (score 61.8) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
16,Mustafa Taher Kasubhai,THE JUDICIARY,61.8,Marginal match (score 61.8) - check name and c...



Marginal match (score 67.4) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
14,Seth Robert Aframe,THE JUDICIARY,67.4,Marginal match (score 67.4) - check name and c...



Marginal match (score 77.3) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
31,Amy M. Baggio,THE JUDICIARY,77.3,Marginal match (score 77.3) - check name and c...



Marginal match (score 78.4) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
35,Kirk Edward Sherriff,THE JUDICIARY,78.4,Marginal match (score 78.4) - check name and c...



Marginal match (score 75.7) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
37,Julie Simone Sneed,THE JUDICIARY,75.7,Marginal match (score 75.7) - check name and c...



Marginal match (score 55.1) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
65,Carmen G. Iguina Gonzalez,THE JUDICIARY,55.1,Marginal match (score 55.1) - check name and c...



Marginal match (score 58.0) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
153,Charles J. Willoughby Jr.,THE JUDICIARY,58.0,Marginal match (score 58.0) - check name and c...



Marginal match (score 52.4) - check name and court:


Unnamed: 0,full_name,court_clean,match_score,failure_reason
160,Charles J. Willoughby Jr.,THE JUDICIARY,52.4,Marginal match (score 52.4) - check name and c...


### Build predecessor lookup table

In [None]:
# Create the predecessor lookup table
predecessor_lookup = get_predecessor_info(seat_timeline_df)
print(f"Created predecessor lookup: {len(predecessor_lookup)} records")

# Preview the predecessor lookup
print(predecessor_lookup.head())
all_dataframes['predecessor_lookup'] = predecessor_lookup