# Notebook 1 – Data Cleaning, Feature Engineering, & Entity Resolution
**Project:** Judicial Vacancy → Nomination/Confirmation Pipeline

*Initial draft generated via ChatGPT model o3 on 2025-07-12T02:40:38.399372Z*

In [1]:

import sys
from pathlib import Path

import pandas as pd
from loguru import logger
from rapidfuzz import fuzz, process

# Add the project root to the path so we can import our modules
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))


# Setup logging
logger.remove()  # Remove default handler
logger.add(sys.stderr, format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level}</level> | <cyan>{function}</cyan> - <level>{message}</level>", level="INFO")

1

## Load dataframes from Raw data folder

Start with loading simpler, non-JSON-containing CSV files

In [2]:
from nomination_predictor.config import INTERIM_DATA_DIR, RAW_DATA_DIR

# load FJC dataframes (and derived seat timeline)
fjc_judges = pd.read_csv(RAW_DATA_DIR / "judges.csv")
fjc_federal_judicial_service = pd.read_csv(RAW_DATA_DIR / "federal_judicial_service.csv")
fjc_demographics = pd.read_csv(RAW_DATA_DIR / "demographics.csv")
fjc_education = pd.read_csv(RAW_DATA_DIR / "education.csv")
fjc_other_federal_judicial_service = pd.read_csv(
    RAW_DATA_DIR / "other_federal_judicial_service.csv"
)
fjc_other_nominations_recess = pd.read_csv(RAW_DATA_DIR / "other_nominations_recess.csv")
seat_timeline = pd.read_csv(RAW_DATA_DIR / "seat_timeline.csv")

[32m2025-07-13 23:22:56.840[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m103[0m - [1mProject root: /home/wsl2ubuntuuser/nomination_predictor[0m
[32m2025-07-13 23:22:56.841[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m134[0m - [1mConfiguration loaded[0m


Flatten JSON-containing congress DataFrames into separate DataFrames

In [3]:
from nomination_predictor.features import flatten_json_dataframe

# Load Congress API dataframes
cong_nominations_raw = pd.read_csv(RAW_DATA_DIR / "nominations.csv")
cong_nominees_raw = pd.read_csv(RAW_DATA_DIR / "nominees.csv")

cong_nominations = flatten_json_dataframe(
    df=cong_nominations_raw,
    json_col="nomination",  # column containing the JSON data
    max_list_index=10,      # maximum number of list items to extract
    separator="_"           # separator for nested keys
)

cong_nominees= flatten_json_dataframe(
    df=cong_nominees_raw,
    json_col="nominee",
    max_list_index=5
)

[32m2025-07-13 23:22:57.145[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mflatten_json_dataframe[0m:[36m282[0m - [1mFlattening JSON data from column 'nomination' in 5746 rows[0m
[32m2025-07-13 23:22:59.428[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mflatten_json_dataframe[0m:[36m308[0m - [1mFlattening complete. Original columns: 4, New columns: 37[0m
[32m2025-07-13 23:22:59.430[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mflatten_json_dataframe[0m:[36m282[0m - [1mFlattening JSON data from column 'nominee' in 5671 rows[0m
[32m2025-07-13 23:23:33.390[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mflatten_json_dataframe[0m:[36m308[0m - [1mFlattening complete. Original columns: 3, New columns: 34[0m


In [4]:
# Combine all dataframes into a single dictionary for bulk operations
# Start with FJC dataframes
dfs = {
    # FJC dataframes
    "fjc_judges": fjc_judges,
    "fjc_federal_judicial_service": fjc_federal_judicial_service,
    "fjc_demographics": fjc_demographics,
    "fjc_education": fjc_education,
    "fjc_other_federal_judicial_service": fjc_other_federal_judicial_service,
    "fjc_other_nominations_recess": fjc_other_nominations_recess,
    "seat_timeline": seat_timeline,
    
    # Congress dataframes
    "cong_nominations": cong_nominations,
    "cong_nominees": cong_nominees,
}

In [5]:
# Print summary of available dataframes
print("Available dataframes:")
for name, df in dfs.items():
    print(f"- {name}: {len(df)} rows × {len(df.columns)} columns")

Available dataframes:
- fjc_judges: 4022 rows × 201 columns
- fjc_federal_judicial_service: 4720 rows × 30 columns
- fjc_demographics: 4022 rows × 18 columns
- fjc_education: 8040 rows × 6 columns
- fjc_other_federal_judicial_service: 611 rows × 31 columns
- fjc_other_nominations_recess: 828 rows × 4 columns
- seat_timeline: 4720 rows × 31 columns
- cong_nominations: 5746 rows × 37 columns
- cong_nominees: 5671 rows × 34 columns


Cong_nominee_orgs and cong_nominee_edu

JSON-containing files we can explode and/or flatten several different ways.  Whichever one is best depends on the use case.  Below is the method I settled on so far:

In [6]:
# commented out because function this calls would throw warnings for what is by now known and tolerated table conditions

#from nomination_predictor.dataset import check_id_uniqueness
## Check each DataFrame for uniqueness of citation field
#print("Checking uniqueness of nomination/nominee identifiers...")
#for name, df in dfs.items():
#    if name.startswith("fjc_"):
#        logger.info(f"\n- Checking {name}...")
#        col="nid"
#        if col in df.columns:
#            check_id_uniqueness(df, id_field=col)
#        else:
#            logger.info(f"  Skipped: {col} column not found in {name}")
#    if name.startswith("cong_"):
#        logger.info(f"\n- Checking {name}...")
#        col="citation"
#        if col in df.columns:
#            check_id_uniqueness(df, id_field=col)
#        else:
#            logger.info(f"  Skipped: {col} column not found in {name}")

In [7]:
# commented this cell out because IMO it's too early in this notebook to be worthwhile to save these as CSVs

## Save extracted tables to interim directory
#for name, df in dfs.items():
#    if len(df) > 0:  # Only save non-empty DataFrames
#        output_path = INTERIM_DATA_DIR / f"{name}.csv"
#        df.to_csv(output_path, index=False)
#        print(f"Saved {len(df)} records to {output_path}")

#### Quick peek at all loaded dataframes

In [8]:
logger.info("Checking for general shape and first handfuls of rows")
for name, df in dfs.items():
    print(f"{name:<35} → {df.shape}")
    print(df.head())  

[32m2025-07-13 23:23:00.936[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mChecking for general shape and first handfuls of rows[0m
fjc_judges                          → (4022, 201)
        nid       jid last_name first_name middle_name suffix  birth_month  \
0  13761857  13761857   Abelson       Adam         Ben    NaN          NaN   
1   1393931      3419    Abrams     Ronnie                             NaN   
2   1376976         1   Abruzzo    Matthew          T.                 4.0   
3  13651551  13651551     Abudu      Nancy       Gbana    NaN          NaN   
4   1376981         2   Acheson     Marcus      Wilson                 6.0   

   birth_day birth_year  birth_city  ... degree_(3)  degree_year_(3)  \
0        NaN       1982   Cleveland  ...        NaN              NaN   
1        NaN       1968    New York  ...        NaN              NaN   
2       30.0       1889    Brooklyn  ...        NaN              NaN   
3        NaN       1974  Ale

In [9]:
logger.info("Checking for null values")
    
for name, df in dfs.items():
    print(df.isnull().sum())

[32m2025-07-13 23:23:01.012[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mChecking for null values[0m
nid                                         0
jid                                         0
last_name                                   0
first_name                                  0
middle_name                                35
                                         ... 
school_(5)                               4017
degree_(5)                               4018
degree_year_(5)                          4017
professional_career                         4
other_nominations/recess_appointments    3307
Length: 201, dtype: int64
nid                                     0
sequence                                0
judge_name                              0
court_type                              0
court_name                              0
appointment_title                       0
appointing_president                    0
party_of_appointing_president         

## Data cleaning

## Normalize column names for DataFrames

In [10]:
print("=== Column Names Before ===")

for name, df in dfs.items():
    print(f"{name:<35} → {df.columns.tolist()}")

=== Column Names Before ===
fjc_judges                          → ['nid', 'jid', 'last_name', 'first_name', 'middle_name', 'suffix', 'birth_month', 'birth_day', 'birth_year', 'birth_city', 'birth_state', 'death_month', 'death_day', 'death_year', 'death_city', 'death_state', 'gender', 'race_or_ethnicity', 'court_type_(1)', 'court_name_(1)', 'appointment_title_(1)', 'appointing_president_(1)', 'party_of_appointing_president_(1)', 'reappointing_president_(1)', 'party_of_reappointing_president_(1)', 'aba_rating_(1)', 'seat_id_(1)', 'statute_authorizing_new_seat_(1)', 'recess_appointment_date_(1)', 'nomination_date_(1)', 'committee_referral_date_(1)', 'hearing_date_(1)', 'judiciary_committee_action_(1)', 'committee_action_date_(1)', 'senate_vote_type_(1)', 'ayes/nays_(1)', 'confirmation_date_(1)', 'commission_date_(1)', 'service_as_chief_judge,_begin_(1)', 'service_as_chief_judge,_end_(1)', '2nd_service_as_chief_judge,_begin_(1)', '2nd_service_as_chief_judge,_end_(1)', 'senior_status_date_(

In [11]:
# call features.py's normalize_columns function on all DataFrames in dfs, and strip leading and trailing whitespace in all strings
from nomination_predictor.features import normalize_dataframe_columns

for name, df in dfs.items():
    df = normalize_dataframe_columns(df)
    df = df.map(lambda x: x.strip() if isinstance(x, str) else x)
    dfs[name] = df

In [12]:
print("=== Column Names After ===")

for name, df in dfs.items():
    print(f"{name:<35} → {df.columns.tolist()}")

=== Column Names After ===
fjc_judges                          → ['nid', 'jid', 'last_name', 'first_name', 'middle_name', 'suffix', 'birth_month', 'birth_day', 'birth_year', 'birth_city', 'birth_state', 'death_month', 'death_day', 'death_year', 'death_city', 'death_state', 'gender', 'race_or_ethnicity', 'court_type_(1)', 'court_name_(1)', 'appointment_title_(1)', 'appointing_president_(1)', 'party_of_appointing_president_(1)', 'reappointing_president_(1)', 'party_of_reappointing_president_(1)', 'aba_rating_(1)', 'seat_id_(1)', 'statute_authorizing_new_seat_(1)', 'recess_appointment_date_(1)', 'nomination_date_(1)', 'committee_referral_date_(1)', 'hearing_date_(1)', 'judiciary_committee_action_(1)', 'committee_action_date_(1)', 'senate_vote_type_(1)', 'ayes/nays_(1)', 'confirmation_date_(1)', 'commission_date_(1)', 'service_as_chief_judge,_begin_(1)', 'service_as_chief_judge,_end_(1)', '2nd_service_as_chief_judge,_begin_(1)', '2nd_service_as_chief_judge,_end_(1)', 'senior_status_date_(1

Left-merge nominees table onto nominations table

In [13]:
from nomination_predictor.features import merge_nominees_onto_nominations

try:
    # Assuming cong_nominations and cong_nominees dataframes are already loaded
    cong_noms = merge_nominees_onto_nominations(dfs["cong_nominations"], dfs["cong_nominees"])
    
    # Show sample of the merged dataframe
    display(cong_noms.head())
    
    # Report on the merge results
    logger.info(f"Original nominations shape: {cong_nominations.shape}")
    logger.info(f"Original nominees shape: {cong_nominees.shape}")
    logger.info(f"Merged dataframe shape: {cong_noms.shape}")
    
    dfs["cong_noms"] = cong_noms
    
except NameError:
    logger.error("Required dataframes (cong_nominations, cong_nominees) are not defined")
except Exception as e:
    logger.error(f"Error in merge process: {e}")

[32m2025-07-13 23:23:01.571[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mmerge_nominees_onto_nominations[0m:[36m551[0m - [1mExtracted 5671 URLs from nominees request column (100.0% of rows)[0m
[32m2025-07-13 23:23:01.572[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mmerge_nominees_onto_nominations[0m:[36m575[0m - [1mNominations dataframe has 5667 non-null URLs (98.6% of rows)[0m
[32m2025-07-13 23:23:01.590[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mmerge_nominees_onto_nominations[0m:[36m591[0m - [1mMerged dataframe has 6150 rows[0m
[32m2025-07-13 23:23:01.590[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mmerge_nominees_onto_nominations[0m:[36m592[0m - [1mSuccessfully matched 6071 nominations with nominees (98.7%)[0m


Unnamed: 0,request,retrieval_date,is_full_detail,actions_count,actions_url,authoritydate,citation,committees_count,committees_url,congress,...,nominees_3_ordinal,nominees_3_state,nominees_4_firstname,nominees_4_lastname,nominees_4_ordinal,nominees_4_state,nominees_3_suffix,nominees_1_middlename,nominees_3_middlename,nominees_4_middlename
0,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/201...,2025-05-12,PN2013,1.0,https://api.congress.gov/v3/nomination/118/201...,118,...,,,,,,,,,,
1,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/201...,2025-05-12,PN2012,1.0,https://api.congress.gov/v3/nomination/118/201...,118,...,,,,,,,,,,
2,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,11.0,https://api.congress.gov/v3/nomination/118/813...,2025-03-28,PN813,1.0,https://api.congress.gov/v3/nomination/118/813...,118,...,,,,,,,,,,
3,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,14.0,https://api.congress.gov/v3/nomination/118/903...,2025-03-28,PN903,1.0,https://api.congress.gov/v3/nomination/118/903...,118,...,,,,,,,,,,
4,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,20.0,https://api.congress.gov/v3/nomination/118/816...,2025-03-28,PN816,1.0,https://api.congress.gov/v3/nomination/118/816...,118,...,,,,,,,,,,


[32m2025-07-13 23:23:01.603[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m11[0m - [1mOriginal nominations shape: (5746, 37)[0m
[32m2025-07-13 23:23:01.603[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m12[0m - [1mOriginal nominees shape: (5671, 34)[0m
[32m2025-07-13 23:23:01.603[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m13[0m - [1mMerged dataframe shape: (6150, 70)[0m


### Drop non-judge nominations based on position title

In [14]:
# Filter out non-judicial nominations using the function from features.py
from nomination_predictor.features import filter_non_judicial_nominations

# Define non-judicial titles to filter out
non_judicial_titles = [
    "Attorney", "Board", "Commission", "Director", "Marshal",
    "Assistant", "Representative", "Secretary of", "Member of"
]

dfs["cong_noms"] = filter_non_judicial_nominations(dfs["cong_noms"], non_judicial_titles=non_judicial_titles)
cong_noms = dfs["cong_noms"]

[32m2025-07-13 23:23:01.636[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mfilter_non_judicial_nominations[0m:[36m185[0m - [1mFound 1331 unique citations with non-judicial titles[0m
[32m2025-07-13 23:23:01.641[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mfilter_non_judicial_nominations[0m:[36m191[0m - [1mRemoved 4393/6150 corresponding records[0m


### Convert date strings to datetime objects

In [15]:
# for any columns which contain certain keywords in their column name and contain string values, convert from string to datetime
datetime_related_keywords = ("date", "year", "month")

for name, df in dfs.items():
    for col in df.columns:
        if any(keyword in col for keyword in datetime_related_keywords) and df[col].dtype == "object":
            logger.info(f"Converting {col} to datetime for {name}")
            df[col] = pd.to_datetime(df[col], errors="coerce")

[32m2025-07-13 23:23:01.653[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mConverting birth_year to datetime for fjc_judges[0m
[32m2025-07-13 23:23:01.661[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mConverting recess_appointment_date_(1) to datetime for fjc_judges[0m
[32m2025-07-13 23:23:01.664[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mConverting nomination_date_(1) to datetime for fjc_judges[0m
[32m2025-07-13 23:23:01.668[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mConverting committee_referral_date_(1) to datetime for fjc_judges[0m
[32m2025-07-13 23:23:01.671[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mConverting hearing_date_(1) to datetime for fjc_judges[0m
[32m2025-07-13 23:23:01.676[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mConverting committee_action_date_(1) to date

### Normalize all string values we'll later have to fuzzy-match

In [16]:
keywords_which_denote_string_columns_to_normalize = ("court", "circuit", "district", "description", "name")

for name, df in dfs.items():
    for col in df.columns:
        if any(keyword in col.casefold() for keyword in keywords_which_denote_string_columns_to_normalize) and df[col].dtype == object:
            logger.info(F"Normalizing all values within column named {col} in {name}")
            df[col] = df[col].str.casefold()

[32m2025-07-13 23:23:01.891[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mNormalizing all values within column named last_name in fjc_judges[0m
[32m2025-07-13 23:23:01.893[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mNormalizing all values within column named first_name in fjc_judges[0m
[32m2025-07-13 23:23:01.895[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mNormalizing all values within column named middle_name in fjc_judges[0m
[32m2025-07-13 23:23:01.898[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mNormalizing all values within column named court_type_(1) in fjc_judges[0m
[32m2025-07-13 23:23:01.901[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mNormalizing all values within column named court_name_(1) in fjc_judges[0m
[32m2025-07-13 23:23:01.904[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6

### Count and display unique values under each column

In [17]:
# display counts of unique values in DataFrame columns:
for name, df in dfs.items():
    for col in sorted(df.columns):
     print(f"{name} - {col}: {df[col].nunique()} unique values")

fjc_judges - 2nd_service_as_chief_judge,_begin_(1): 5 unique values
fjc_judges - 2nd_service_as_chief_judge,_begin_(2): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_begin_(3): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_begin_(4): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_begin_(5): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_begin_(6): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_end_(1): 4 unique values
fjc_judges - 2nd_service_as_chief_judge,_end_(2): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_end_(3): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_end_(4): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_end_(5): 0 unique values
fjc_judges - 2nd_service_as_chief_judge,_end_(6): 0 unique values
fjc_judges - aba_rating_(1): 6 unique values
fjc_judges - aba_rating_(2): 5 unique values
fjc_judges - aba_rating_(3): 3 unique values
fjc_judges - aba_rating_(4): 1 unique values
fjc_judges - aba

### Set nid as index (for the couple of FJC dataframes designed to use 'nid' uniquely)

In [18]:
# For the dataframes that have unique nid, set them as the index to optimize lookups/joins
dfs["fjc_judges"].set_index('nid', drop=False, inplace=True, verify_integrity=True)
dfs["fjc_demographics"].set_index('nid', drop=False, inplace=True, verify_integrity=True)

## Fuzzy-matching FJC judges to Congress.gov nominees

### Preparing columns to aid matching

In [19]:
# add a "full_name_concatenated" column to the fjc_federal_judicial_service dataframe which is composed by flipping its judge_name column values 
# from "lastname, firstname middleNameOrMiddleInitial (, optional comma and suffix)" to "firstname lastname middle suffix"
from nomination_predictor.features import \
    convert_judge_name_format_from_last_comma_first_to_first_then_last

try:
    dfs["fjc_federal_judicial_service"]["full_name_concatenated"] = dfs["fjc_federal_judicial_service"]["judge_name"].apply(convert_judge_name_format_from_last_comma_first_to_first_then_last)
    
    # Show some examples to verify the conversion
    sample = dfs["fjc_federal_judicial_service"][['judge_name', 'full_name_concatenated']].head(10)
    display(sample)
    
    # Count null values to check for any conversion failures
    null_count = dfs["fjc_federal_judicial_service"]["full_name_concatenated"].isna().sum()
    empty_count = (dfs["fjc_federal_judicial_service"]["full_name_concatenated"] == '').sum()
    
    if null_count > 0 or empty_count > 0:
        print(f"Warning: Found {null_count} null values and {empty_count} empty strings in the converted names.")
        
    print(f"Successfully added 'full_name_concatenated' column to fjc_federal_judicial_service dataframe with {len(dfs["fjc_federal_judicial_service"])}) entries.")
    
except Exception as e:
    logger.error(f"Error creating full_name_concatenated column: {e}")
    # If there's an error, display the first few rows of fjc_federal_judicial_service to help diagnose
    logger.info("\nSample of fjc_federal_judicial_service dataframe:")
    display(dfs["fjc_federal_judicial_service"].head(3))
    logger.info(f"Columns available: {dfs["fjc_federal_judicial_service"].columns.tolist()}")

Unnamed: 0,judge_name,full_name_concatenated
0,"abelson, adam ben",adam ben abelson
1,"abrams, ronnie",ronnie abrams
2,"abruzzo, matthew t.",matthew t. abruzzo
3,"abudu, nancy gbana",nancy gbana abudu
4,"acheson, marcus wilson",marcus wilson acheson
5,"acheson, marcus wilson",marcus wilson acheson
6,"acheson, marcus wilson",marcus wilson acheson
7,"acker, william marsh, jr.",william marsh acker jr.
8,"ackerman, harold arnold",harold arnold ackerman
9,"ackerman, james waldo",james waldo ackerman


Successfully added 'full_name_concatenated' column to fjc_federal_judicial_service dataframe with 4720) entries.


In [20]:
# add a "full_name_from_description" and a "location_of_origin_from_description" columns to the dfs["cong_noms"] dataframe which regex-captures the first segments of the same dfs["cong_noms"] dataframe row's "description" string, 
# i.e. captures name before the first appearances of the phrases ", of " or ", of the "
# and captures location from the second segment of the same dfs["cong_noms"] dataframe row's "description" string
# i.e. captures between the above-seen phrase ", of " or ", of the " through to the phrase ", to be "
# examples: 
# melissa damian, of florida, to be ...  gets captured into those new columns as "melissa damian" and "florida"
# nicole g. bernerr of maryland, to be united... gets captured into those new columns as "nicole g. bernerr" and "maryland"
# kirk edward sherriff, of california, to be united... gets captured into those new columns as "kirk edward sherriff" and "california"
# sherri malloy beatty-arthur, of the district of columbia, for... gets captured into those new columns as "sherri malloy beatty-arthur" and "district of columbia"

# Extract full_name_from_description and location_of_origin_from_description from description field
from nomination_predictor.features import extract_name_and_location_columns

# Apply the extraction function to cong_noms dataframe
if 'cong_noms' in dfs:
    dfs['cong_noms'] = extract_name_and_location_columns(dfs['cong_noms'])
    
    # Display sample results to verify extraction
    sample_cols = ['description', 'full_name_from_description', 'location_of_origin_from_description']
    display(dfs['cong_noms'][sample_cols].head(10))
    
    # Report extraction statistics
    total_rows = len(dfs['cong_noms'])
    name_filled = dfs['cong_noms']['full_name_from_description'].notna().sum()
    location_filled = dfs['cong_noms']['location_of_origin_from_description'].notna().sum()
    
    print(f"Extracted names for {name_filled}/{total_rows} records ({name_filled/total_rows:.1%})")
    print(f"Extracted locations for {location_filled}/{total_rows} records ({location_filled/total_rows:.1%})")
else:
    print("Error: 'cong_noms' dataframe not found in dfs dictionary.")

[32m2025-07-13 23:23:02.474[0m | [1mINFO    [0m | [36mnomination_predictor.features[0m:[36mextract_name_and_location_columns[0m:[36m720[0m - [1mExtracted 1757/1757 (100.0%) names and 1757/1757 (100.0%) locations[0m


Unnamed: 0,description,full_name_from_description,location_of_origin_from_description
0,"nicholas george miranda, of the district of co...",nicholas george miranda,district of columbia
1,"james graham lake, of the district of columbia...",james graham lake,district of columbia
16,"mustafa taher kasubhai, of oregon, to be unite...",mustafa taher kasubhai,oregon
17,"mustafa taher kasubhai, of oregon, to be unite...",mustafa taher kasubhai,oregon
21,"jacqueline becerra, of florida, to be united s...",jacqueline becerra,florida
22,"mustafa taher kasubhai, of oregon, to be unite...",mustafa taher kasubhai,oregon
23,"mustafa taher kasubhai, of oregon, to be unite...",mustafa taher kasubhai,oregon
24,"jacquelyn d. austin, of south carolina, to be ...",jacquelyn d. austin,south carolina
31,"melissa damian, of florida, to be united state...",melissa damian,florida
36,"nicole g. berner, of maryland, to be united st...",nicole g. berner,maryland


Extracted names for 1757/1757 records (100.0%)
Extracted locations for 1757/1757 records (100.0%)


In [21]:
# Add a "last_name_from_full_name" column to dfs["cong_noms"] with only the last name

from nomination_predictor.features import extract_last_name

# Add the last_name_from_full_name column
dfs['cong_noms']['last_name_from_full_name'] = dfs['cong_noms']['full_name_from_description'].apply(extract_last_name)

# Display sample results to verify extraction
sample_cols = ['full_name_from_description', 'last_name_from_full_name']
display(dfs['cong_noms'][sample_cols].head(10))

# Count non-null values
last_name_count = dfs['cong_noms']['last_name_from_full_name'].notna().sum()
print(f"Extracted last names for {last_name_count}/{len(dfs['cong_noms'])} records ({last_name_count/len(dfs['cong_noms']):.1%})")

Unnamed: 0,full_name_from_description,last_name_from_full_name
0,nicholas george miranda,miranda
1,james graham lake,lake
16,mustafa taher kasubhai,kasubhai
17,mustafa taher kasubhai,kasubhai
21,jacqueline becerra,becerra
22,mustafa taher kasubhai,kasubhai
23,mustafa taher kasubhai,kasubhai
24,jacquelyn d. austin,austin
31,melissa damian,damian
36,nicole g. berner,berner


Extracted last names for 1757/1757 records (100.0%)


In [22]:
# add a column "last_name" derived from judge_name to the dfs["fjc_federal_judicial_service"] dataframe because fuzzy-matcher function will look for that as the name of a blocking column later
from nomination_predictor.features import extract_last_name

if 'fjc_federal_judicial_service' in dfs:
    dfs['fjc_federal_judicial_service']['last_name'] = dfs['fjc_federal_judicial_service']['judge_name'].apply(extract_last_name)
    display(dfs['fjc_federal_judicial_service'][['judge_name', 'last_name']].head(10))
else:
    print("Error: 'fjc_federal_judicial_service' dataframe not found in dfs dictionary.")

Unnamed: 0,judge_name,last_name
0,"abelson, adam ben",abelson
1,"abrams, ronnie",abrams
2,"abruzzo, matthew t.",abruzzo
3,"abudu, nancy gbana",abudu
4,"acheson, marcus wilson",acheson
5,"acheson, marcus wilson",acheson
6,"acheson, marcus wilson",acheson
7,"acker, william marsh, jr.",acker
8,"ackerman, harold arnold",ackerman
9,"ackerman, james waldo",ackerman


In [23]:
# perform date matching of fjc's nomination_date vs. congress' received_date +/- some threshold, e.g. 45 days

In [24]:
# add a court_name_from_description column to the dfs["cong_noms"] dataframe which regex-captures the second segment of the same dataframe row's "description" string, 
# i.e. anything in between the phrase  ", to be " (the ones which used as the end-signifier of the "location" above) through to the first appearance of ", vice"

#### Blocking-based fuzzy matching

In [25]:
# Import the entity matching module
from nomination_predictor.entity_matching import (
    generate_matching_summary, update_dataframe_with_matches)

# Set up the matching parameters
MATCH_THRESHOLD = 80  # Minimum score to consider a high-confidence match

In [26]:
# utilize the capabilities provided by fuzzy_matching.py to identify which rows in dfs["fjc_federal_judicial_service"] correspond most closely with rows in dfs["cong_noms"].
# for any rows where a high-confidence-enough unambiguous match is found, copy the 'nid" from that row in dfs["fjc_federal_judicial_service"] to the corresponding row in dfs["cong_noms"]
# for any dfs["fjc_federal_judicial_service"] rows where only one possible row in dfs["cong_noms"] appears to be a possible match, but the match confidence is lower than our cutoff threshold, present those to the user.
# for any dfs["fjc_federal_judicial_service"] rows where matches appear ambiguous about which of multiple rows in dfs["cong_noms"] it could correlate to, present those to the user.
# for any dfs["fjc_federal_judicial_service"] rows where no matches are found, separately present those to the user.
# for any dfs["cong_noms"] rows where no matches are found, separately present those to the user.

In [27]:
from nameparser import HumanName

# next analysis to try: name matching only.  
# Use nameparser.  
# Take dfs['cong_noms']['full_name_from_description'] column and dfs['fjc_federal_judicial_service']['judge_name'] column.
# compare last name only, using exact string matching, after a plain casefold() and whitespace-stripping.
# if exact match, show me both, and show me the fjc dataframe's 'nid' for that person.
# if multiple matches, show a logger.info() listing the multiple matches.
# then for anyone with multiple matches, compare first name (obtained via nameparser) only.
# if exact match, show me both, and show me the fjc dataframe's 'nid' for that person.
# if multiple matches, show a logger.info() listing the multiple matches.

In [None]:
from nomination_predictor.name_matching import perform_exact_name_matching

results = perform_exact_name_matching(
    congress_df=dfs["cong_noms"],
    fjc_df=dfs["fjc_federal_judicial_service"],
    congress_name_col="full_name_from_description",
    fjc_name_col="judge_name"
)

# Show results
results.head()

KeyError: 'fjc_df'

In [None]:
from nomination_predictor.entity_matching import (COURT_WEIGHT, DATE_WEIGHT,
                                                  MATCH_THRESHOLD, NAME_WEIGHT,
                                                  perform_fuzzy_matching)

# Prepare the DataFrames for matching
logger.info("Preparing DataFrames for fuzzy matching")

# Add last name extraction for Congress data if needed
if "last_name_from_full_name" not in dfs["cong_noms"].columns and "full_name" in dfs["cong_noms"].columns:
    from nomination_predictor.features import extract_last_name
    logger.info("Extracting last names from full names")
    dfs["cong_noms"]["last_name_from_full_name"] = dfs["cong_noms"]["full_name"].apply(extract_last_name)

# Run the fuzzy matching pipeline
logger.info(f"Running fuzzy matching with threshold {MATCH_THRESHOLD}")
match_results = perform_fuzzy_matching(
    dfs["cong_noms"],
    dfs["fjc_federal_judicial_service"],
    threshold=MATCH_THRESHOLD,
    name_weight=NAME_WEIGHT,
    court_weight=COURT_WEIGHT,
    date_weight=DATE_WEIGHT
)

# Show a sample of the match results
logger.info("Sample of match results:")
match_results.head()

[32m2025-07-13 22:53:32.886[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mPreparing DataFrames for fuzzy matching[0m
[32m2025-07-13 22:53:32.887[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m15[0m - [1mRunning fuzzy matching with threshold 0.7[0m
[32m2025-07-13 22:53:32.887[0m | [1mINFO    [0m | [36mnomination_predictor.entity_matching[0m:[36mperform_fuzzy_matching[0m:[36m49[0m - [1mRunning fuzzy matching with threshold 0.7[0m
[32m2025-07-13 22:53:32.888[0m | [1mINFO    [0m | [36mnomination_predictor.fuzzy_matching[0m:[36mfind_matches_with_blocking[0m:[36m173[0m - [1mStarting fuzzy matching with 1757 Congress records and 4720 FJC records[0m


Matching records: 100%|██████████| 1757/1757 [00:06<00:00, 264.43it/s]

[32m2025-07-13 22:53:39.613[0m | [1mINFO    [0m | [36mnomination_predictor.fuzzy_matching[0m:[36mfind_matches_with_blocking[0m:[36m278[0m - [1mMatching complete. Found 1060 matches out of 1757 records[0m
[32m2025-07-13 22:53:39.618[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m26[0m - [1mSample of match results:[0m





Unnamed: 0,request,retrieval_date,is_full_detail,actions_count,actions_url,authoritydate,citation,committees_count,committees_url,congress,...,nominees_4_state,nominees_3_suffix,nominees_1_middlename,nominees_3_middlename,nominees_4_middlename,full_name_from_description,location_of_origin_from_description,last_name_from_full_name,match_score,nid
0,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/201...,2025-05-12,PN2013,1.0,https://api.congress.gov/v3/nomination/118/201...,118,...,,,,,,nicholas george miranda,district of columbia,miranda,0.0,
1,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/201...,2025-05-12,PN2012,1.0,https://api.congress.gov/v3/nomination/118/201...,118,...,,,,,,james graham lake,district of columbia,lake,39.52,1383571.0
2,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,11.0,https://api.congress.gov/v3/nomination/118/102...,2025-03-28,PN1024,1.0,https://api.congress.gov/v3/nomination/118/102...,118,...,,,,,,mustafa taher kasubhai,oregon,kasubhai,97.91,13761892.0
3,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,11.0,https://api.congress.gov/v3/nomination/118/102...,2025-03-28,PN1024,1.0,https://api.congress.gov/v3/nomination/118/102...,118,...,,,,,,mustafa taher kasubhai,oregon,kasubhai,97.91,13761892.0
4,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,3.0,https://api.congress.gov/v3/nomination/118/113...,2025-03-28,PN1131,1.0,https://api.congress.gov/v3/nomination/118/113...,118,...,,,,,,jacqueline becerra,florida,becerra,97.93,13761516.0


In [None]:
# Generate comprehensive matching summary
logger.info("Analyzing match results")
match_summary = generate_matching_summary(
    match_results,
    dfs["fjc_federal_judicial_service"],
    threshold=MATCH_THRESHOLD
)

# Print matching statistics
logger.info("=== Matching Statistics ===")
for key, value in match_summary["stats"].items():
    logger.info(f"{key}: {value}")

[32m2025-07-13 22:53:39.674[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mAnalyzing match results[0m
[32m2025-07-13 22:53:39.714[0m | [1mINFO    [0m | [36mnomination_predictor.entity_matching[0m:[36mgenerate_matching_summary[0m:[36m358[0m - [1mExcluded 844 already matched FJC records from ambiguity check[0m
[32m2025-07-13 22:53:39.715[0m | [1mINFO    [0m | [36mnomination_predictor.entity_matching[0m:[36mgenerate_matching_summary[0m:[36m363[0m - [1mProcessing 697 unmatched records instead of 1757 total records[0m
[32m2025-07-13 22:53:39.715[0m | [1mINFO    [0m | [36mnomination_predictor.entity_matching[0m:[36mgenerate_matching_summary[0m:[36m364[0m - [1mUsing 3788 relevant FJC records instead of 4720 total records[0m
[32m2025-07-13 22:53:39.856[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m10[0m - [1m=== Matching Statistics ===[0m
[32m2025-07-13 22:53:39.858[0m | [1mINFO    [0m | [36m__mai

In [None]:
# Display high confidence matches (sample)
print("\n=== High Confidence Matches ===")
print(f"Found {len(match_summary['high_confidence'])} high confidence matches (showing first 10)")
display(match_summary["high_confidence_sample"])


=== High Confidence Matches ===
Found 1060 high confidence matches (showing first 10)


Unnamed: 0,request,retrieval_date,is_full_detail,actions_count,actions_url,authoritydate,citation,committees_count,committees_url,congress,...,nominees_4_state,nominees_3_suffix,nominees_1_middlename,nominees_3_middlename,nominees_4_middlename,full_name_from_description,location_of_origin_from_description,last_name_from_full_name,match_score,nid
1682,"{'congress': '97', 'contentType': 'application...",2025-07-12,True,8.0,https://api.congress.gov/v3/nomination/97/586/...,1981-08-03,PN586,1.0,https://api.congress.gov/v3/nomination/97/586/...,97,...,,,,,,sandra day o'connor,arizona,o'connor,100.0,1385891.0
102,"{'congress': '117', 'contentType': 'applicatio...",2025-07-12,True,19.0,https://api.congress.gov/v3/nomination/117/178...,2024-02-20,PN1783,1.0,https://api.congress.gov/v3/nomination/117/178...,117,...,,,,,,ketanji brown jackson,district of columbia,jackson,100.0,1394151.0
250,"{'congress': '116', 'contentType': 'applicatio...",2025-07-12,True,14.0,https://api.congress.gov/v3/nomination/116/225...,2024-09-06,PN2252,1.0,https://api.congress.gov/v3/nomination/116/225...,116,...,,,,,,amy coney barrett,indiana,barrett,100.0,3979311.0
327,"{'congress': '116', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/116/214...,NaT,PN214,1.0,https://api.congress.gov/v3/nomination/116/214...,116,...,,,,,,m. miller baker,louisiana,baker,100.0,6841726.0
825,"{'congress': '109', 'contentType': 'applicatio...",2025-07-12,True,8.0,https://api.congress.gov/v3/nomination/109/106...,NaT,PN1060,1.0,https://api.congress.gov/v3/nomination/109/106...,109,...,,,,,,leo maury gordon,new jersey,gordon,100.0,1392861.0
903,"{'congress': '108', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/108/33/...,NaT,PN33,1.0,https://api.congress.gov/v3/nomination/108/33/...,108,...,,,,,,timothy c. stanceu,virginia,stanceu,100.0,1392921.0
1378,"{'congress': '102', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/102/314...,NaT,PN314,1.0,https://api.congress.gov/v3/nomination/102/314...,102,...,,,,,,timothy k. lewis,pennsylvania,lewis,99.94,1383891.0
1358,"{'congress': '102', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/102/522...,NaT,PN522,1.0,https://api.congress.gov/v3/nomination/102/522...,102,...,,,,,,stewart r. dalzell,pennsylvania,dalzell,99.94,1379736.0
161,"{'congress': '117', 'contentType': 'applicatio...",2025-07-12,True,11.0,https://api.congress.gov/v3/nomination/117/236...,2022-12-06,PN2365,1.0,https://api.congress.gov/v3/nomination/117/236...,117,...,,,,,,kelley brisbon hodge,pennsylvania,hodge,99.94,12911481.0
1183,"{'congress': '104', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/104/244...,NaT,PN244,1.0,https://api.congress.gov/v3/nomination/104/244...,104,...,,,,,,joseph robert goodwin,west virginia,goodwin,99.94,1381361.0


In [None]:
# Display medium confidence matches for user review
print("\n=== Medium Confidence Matches ===")
print(f"Found {len(match_summary['medium_confidence'])} medium confidence matches (showing first 10)")
if not match_summary["medium_confidence_sample"].empty:
    display(match_summary["medium_confidence_sample"])
else:
    print("No medium confidence matches found")

# Display ambiguous matches (multiple possible matches with similar scores)
print("\n=== Ambiguous Matches ===")
print(f"Found {len(match_summary['ambiguous_matches'])} ambiguous matches (showing first 10)")
if not match_summary["ambiguous_matches_sample"].empty:
    display(match_summary["ambiguous_matches_sample"])
else:
    print("No ambiguous matches found")


=== Medium Confidence Matches ===
Found 0 medium confidence matches (showing first 10)
No medium confidence matches found

=== Ambiguous Matches ===
Found 0 ambiguous matches (showing first 10)
No ambiguous matches found


In [None]:
# Display unmatched Congress records
print("\n=== Unmatched Congress Records ===")
print(f"Found {len(match_summary['cong_unmatched'])} unmatched Congress records (showing first 10)")
if not match_summary["cong_unmatched_sample"].empty:
    display(match_summary["cong_unmatched_sample"])
else:
    print("No unmatched Congress records")

# Display unmatched FJC records
print("\n=== Unmatched FJC Records ===")
print(f"Found {len(match_summary['fjc_unmatched'])} unmatched FJC records (showing first 10)")
if not match_summary["fjc_unmatched_sample"].empty:
    display(match_summary["fjc_unmatched_sample"])
else:
    print("No unmatched FJC records")


=== Unmatched Congress Records ===
Found 697 unmatched Congress records (showing first 10)


Unnamed: 0,request,retrieval_date,is_full_detail,actions_count,actions_url,authoritydate,citation,committees_count,committees_url,congress,...,nominees_4_state,nominees_3_suffix,nominees_1_middlename,nominees_3_middlename,nominees_4_middlename,full_name_from_description,location_of_origin_from_description,last_name_from_full_name,match_score,nid
0,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/201...,2025-05-12,PN2013,1.0,https://api.congress.gov/v3/nomination/118/201...,118,...,,,,,,nicholas george miranda,district of columbia,miranda,0.0,
11,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/115...,2025-01-03,PN1152,1.0,https://api.congress.gov/v3/nomination/118/115...,118,...,,,,,,sherri malloy beatty-arthur,district of columbia,beatty-arthur,0.0,
13,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,5.0,https://api.congress.gov/v3/nomination/118/125...,2025-01-03,PN1251,1.0,https://api.congress.gov/v3/nomination/118/125...,118,...,,,,,,adeel abdullah mangi,new jersey,mangi,0.0,
14,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,5.0,https://api.congress.gov/v3/nomination/118/135...,2025-01-03,PN1352,1.0,https://api.congress.gov/v3/nomination/118/135...,118,...,,,,,,kenechukwu onyemaechi okocha,district of columbia,okocha,0.0,
15,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,2.0,https://api.congress.gov/v3/nomination/118/140...,2025-01-03,PN1407,1.0,https://api.congress.gov/v3/nomination/118/140...,118,...,,,,,,rebecca suzanne kanter,california,kanter,0.0,
17,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/157...,2025-01-03,PN1572,1.0,https://api.congress.gov/v3/nomination/118/157...,118,...,,,,,,john cuong truong,district of columbia,truong,0.0,
18,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,2.0,https://api.congress.gov/v3/nomination/118/157...,2025-01-03,PN1575,1.0,https://api.congress.gov/v3/nomination/118/157...,118,...,,,,,,detra shaw-wilder,florida,shaw-wilder,0.0,
19,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/162...,2025-01-03,PN1627,1.0,https://api.congress.gov/v3/nomination/118/162...,118,...,,,,,,joseph russell palmore,district of columbia,palmore,0.0,
20,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,6.0,https://api.congress.gov/v3/nomination/118/162...,2025-01-03,PN1627,1.0,https://api.congress.gov/v3/nomination/118/162...,118,...,,,,,,joseph russell palmore,district of columbia,palmore,0.0,
21,"{'congress': '118', 'contentType': 'applicatio...",2025-07-12,True,4.0,https://api.congress.gov/v3/nomination/118/165...,2025-01-03,PN1653,1.0,https://api.congress.gov/v3/nomination/118/165...,118,...,,,,,,sarah netburn,new york,netburn,0.0,



=== Unmatched FJC Records ===
Found 3788 unmatched FJC records (showing first 10)


Unnamed: 0,nid,sequence,judge_name,court_type,court_name,appointment_title,appointing_president,party_of_appointing_president,reappointing_president,party_of_reappointing_president,...,commission_date,"service_as_chief_judge,_begin","service_as_chief_judge,_end","2nd_service_as_chief_judge,_begin","2nd_service_as_chief_judge,_end",senior_status_date,termination,termination_date,full_name_concatenated,last_name
2,1376976,1,"abruzzo, matthew t.",u.s. district court,u.s. district court for the eastern district o...,Judge,Franklin D. Roosevelt,Democratic,,,...,1936-02-15,,,,,1966-02-15,Death,1971-05-28,matthew t. abruzzo,abruzzo
4,1376981,1,"acheson, marcus wilson",u.s. district court,u.s. district court for the western district o...,Judge,Rutherford B. Hayes,Republican,,,...,1880-01-14,,,,,NaT,Appointment to Another Judicial Position,1891-02-09,marcus wilson acheson,acheson
5,1376981,2,"acheson, marcus wilson",u.s. circuit court (1869-1911),u.s. circuit courts for the third circuit,Judge,Benjamin Harrison,Republican,,,...,1891-02-03,,,,,NaT,Death,1906-06-21,marcus wilson acheson,acheson
6,1376981,3,"acheson, marcus wilson",u.s. court of appeals,u.s. court of appeals for the third circuit,Judge,None (assignment),None (assignment),,,...,1891-06-16,,,,,NaT,Death,1906-06-21,marcus wilson acheson,acheson
7,1376986,1,"acker, william marsh, jr.",u.s. district court,u.s. district court for the northern district ...,Judge,Ronald Reagan,Republican,,,...,1982-08-18,,,,,1996-05-31,Death,2018-06-21,william marsh acker jr.,acker
8,1376991,1,"ackerman, harold arnold",u.s. district court,u.s. district court for the district of new je...,Judge,Jimmy Carter,Democratic,,,...,1979-11-02,,,,,1994-02-15,Death,2009-12-02,harold arnold ackerman,ackerman
9,1376996,1,"ackerman, james waldo",u.s. district court,u.s. district court for the southern district ...,Judge,Gerald Ford,Republican,,,...,1976-07-02,,,,,NaT,Reassignment,1979-03-31,james waldo ackerman,ackerman
10,1376996,2,"ackerman, james waldo",u.s. district court,u.s. district court for the central district o...,Judge,None (reassignment),None (reassignment),,,...,1979-03-31,1982.0,1984.0,,,NaT,Death,1984-11-23,james waldo ackerman,ackerman
11,1377001,1,"acosta, raymond l.",u.s. district court,u.s. district court for the district of puerto...,Judge,Ronald Reagan,Republican,,,...,1982-09-30,,,,,1994-06-01,Death,2014-12-23,raymond l. acosta,acosta
12,1377006,1,"adair, j[ackson] leroy",u.s. district court,u.s. district court for the southern district ...,Judge,Franklin D. Roosevelt,Democratic,,,...,1937-04-27,,,,,NaT,Death,1956-01-19,j[ackson] leroy adair,adair


In [None]:
# Update Congress nominations DataFrame with FJC NIDs from high-confidence matches
logger.info("Updating Congress nominations DataFrame with matched FJC NIDs")
dfs["cong_noms"] = update_dataframe_with_matches(
    dfs["cong_noms"], 
    match_results,
    threshold=MATCH_THRESHOLD,
    column_name="fjc_nid"
)

we_can_haz_nid = dfs["cong_noms"]

# Show updated Congress nominations DataFrame with new fjc_nid column
print("\n=== Updated Congress Nominations DataFrame ===")
print("Added fjc_nid column with matched FJC NIDs")
print(f"Matched {dfs['cong_noms']['fjc_nid'].notna().sum()} out of {len(dfs['cong_noms'])} records")
display(dfs["cong_noms"][["citation", "full_name", "court_name", "receiveddate", "fjc_nid"]].head(10))

[32m2025-07-13 22:53:40.061[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mUpdating Congress nominations DataFrame with matched FJC NIDs[0m

=== Updated Congress Nominations DataFrame ===
Added fjc_nid column with matched FJC NIDs
Matched 1349 out of 1757 records


KeyError: "['full_name', 'court_name'] not in index"

#### cells below this line come from legacy implementation

In [None]:
# Block by last name exact match
blocks = {}
for lname, group in fjc_judges.groupby("last"):
    blocks[lname] = group

def candidate_fjc_rows(row):
    return blocks.get(row["last"], pd.DataFrame())

In [None]:
# commented out because cells above are being use dfor cleaning now instead, 
# and now doing less column creation-just-because-it's-easy and more column-creation-only-if-use-is-clear, which is less error-prone

## --- Clean Congress nominees ------------------------------------------------
#cong_nominees["full_name_clean"] = cong_nominees["full_name"].apply(clean_name)
#cong_nominees[["first","middle","last"]] = cong_nominees["full_name_clean"].apply(
#    lambda n: pd.Series(split_name(n)))
#
#cong_nominees["court_clean"] = cong_nominees["organization"].apply(normalised_court)
#cong_nominees["nomination_date"] = pd.to_datetime(cong_nominees["nomination_date"])
#
## --- Clean FJC judges -------------------------------------------------------
#fjc_judges["full_name_clean"] = fjc_judges["name_full"].apply(clean_name)
#fjc_judges[["first","middle","last"]] = fjc_judges["full_name_clean"].apply(
#    lambda n: pd.Series(split_name(n)))
#
## We'll need a mapping from nid to service records for date & court validation
#fjc_service["court_clean"] = fjc_service["court_name"].apply(normalised_court)
#fjc_service["nomination_date"] = pd.to_datetime(fjc_service["nomination_date"], errors="coerce")
#fjc_service["commission_date"] = pd.to_datetime(fjc_service["commission_date"], errors="coerce")

In [None]:

def best_match(row):
    candidates = candidate_fjc_rows(row)
    if candidates.empty:
        return pd.NA, 0.0
    # Compute combined score: name similarity + court similarity + date proximity
    best_score = 0.0
    best_nid = pd.NA
    for _, cand in candidates.iterrows():
        name_score = fuzz.token_set_ratio(row["full_name_clean"], cand["full_name_clean"])
        # Use service records to find any matching nomination date
        entries = fjc_service[fjc_service["nid"] == cand["nid"]]
        date_score = 0
        court_score = 0
        if not entries.empty:
            # Smallest absolute diff in days
            diffs = (entries["nomination_date"] - row["nomination_date"]).abs().dt.days
            date_score = 100 - diffs.min() if diffs.notna().any() else 0
            # any court string overlap
            if row["court_clean"]:
                if any(row["court_clean"] in c for c in entries["court_clean"]):
                    court_score = 100
                else:
                    court_score = max(fuzz.partial_ratio(row["court_clean"], c) for c in entries["court_clean"])
        total = 0.6*name_score + 0.3*date_score + 0.1*court_score
        if total > best_score:
            best_score, best_nid = total, cand["nid"]
    return best_nid, round(best_score,1)

In [None]:
# Import the new filter_confirmed_nominees function
from nomination_predictor.features import (analyze_match_failures,
                                           filter_confirmed_nominees,
                                           load_simpler_dataframes)

# Load and prepare all dataframes
dfs = load_simpler_dataframes(RAW_DATA_DIR)
cong_nominees = dfs["cong_nominees"]  # This now has all the derived fields
fjc_judges = dfs["fjc_judges"]
fjc_service = dfs["fjc_service"]
cong_nominations_raw = dfs["cong_nominations"]

# OPTIMIZATION: Filter to only confirmed nominees before matching
# This saves processing time by only matching nominees who were confirmed
confirmed_nominees = filter_confirmed_nominees(cong_nominees, cong_nominations_raw)
print(f"Focusing on {len(confirmed_nominees)} confirmed nominees out of {len(cong_nominees)} total nominees")

# Only apply best_match to confirmed nominees
confirmed_nominees[["match_nid", "match_score"]] = confirmed_nominees.apply(
    best_match, axis=1, result_type="expand")

# Merge back with original dataframe to preserve all records
# Non-confirmed nominees will have NaN for match fields
cong_nominees = cong_nominees.merge(
    confirmed_nominees[["citation", "match_nid", "match_score"]], 
    on="citation", 
    how="left"
)

In [None]:

THRESHOLD = 80
matches = cong_nominees[cong_nominees["match_score"] >= THRESHOLD].copy()
print(f"Matched {len(matches)}/{len(cong_nominees)} nominees with score ≥ {THRESHOLD}")
matches.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_matches.csv", index=False)

In [None]:
## FIXME: decide whether to save as separate vs. overwrite in interim folder
## Save the cleaned interim datasets for downstream notebooks
#cong_nominees.to_csv(INTERIM_DATA_DIR / "congress_nominees_cleaned.csv", index=False)
#fjc_judges.to_csv(INTERIM_DATA_DIR / "fjc_judges_cleaned.csv", index=False)
#fjc_service.to_csv(INTERIM_DATA_DIR / "fjc_service_cleaned.csv", index=False)

In [None]:
from nomination_predictor.features import analyze_match_failures

THRESHOLD = 80
matches = cong_nominees[cong_nominees["match_score"] >= THRESHOLD].copy()
print(f"Matched {len(matches)}/{len(cong_nominees)} nominees with score ≥ {THRESHOLD}")

# Analyze unmatched records to understand why they didn't match
unmatched_df, reason_summary, examples = analyze_match_failures(cong_nominees, THRESHOLD)

# Display summary of failure reasons
print("\nFailure Reason Summary:")
display(reason_summary)

# Display a few examples of each failure type
print("\nExample records for each failure type:")
for reason, example_df in examples.items():
    print(f"\n{reason}:")
    display(example_df)

# Save both matched and unmatched datasets for further analysis
matches.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_matches.csv", index=False)
unmatched_df.to_csv(INTERIM_DATA_DIR / "congress_fjc_nominee_unmatched.csv", index=False)

In [None]:
## FIXME: decide whether to save as separate vs. overwrite in interim folder
## Save the cleaned interim datasets for downstream notebooks
#cong_nominees.to_csv(INTERIM_DATA_DIR / "congress_nominees_cleaned.csv", index=False)
#fjc_judges.to_csv(INTERIM_DATA_DIR / "fjc_judges_cleaned.csv", index=False)
#fjc_service.to_csv(INTERIM_DATA_DIR / "fjc_service_cleaned.csv", index=False)

## Combining FJC data

### Handling nominees' education and job history

Before we combine FJC data, we have to consider whether/how to handle judges' education, job history, age, ABA rating, etc., because the only other table in the FJC data which handles nid uniquely is "demographics," which are unchanging.
The simplest way to handle the non-unique-nid tables it would be to left-merge on "nid" and only take the most recently-dated row.  In most cases this would likely land on keeping the most prestigious degree or job.

However, it is entirely likely a judge's education or job history has changed substantially since their first nomination, and affected their qualifications for each later nomination.

All of these indicate to me that it's worth considering the judge's position, education, etc., not as of the most recent records available, but instead _as of when they were nominated._

That means we can't do a simple left-join of all of our FJC data.  Instead, we have to -- using a combination of names, court locations, and vacancy dates -- fuzzy-match to find which "nid" corresponds to each "citation" in the Congress data, as our way of bridging between FJC judges and congress' nominee data. Then use the "received date" for that citation as a cutoff date for when we lookup education and job records by "nid" -- so we can avoid mistakenly linking to a citation any employemnt & job records dated after that cutoff date.

Thankfully we do have the school, degree, and degree_year in the education record, for both their bachelors and their masters and their associate degree(s) and LLB and J.D. etc., so we can look that up.  The education dataframe even comes with a "sequence" number for each education record, which is an even easier-to-use indicator of chronological order than the degree_year for any given "nid" lookup for a judge.

Job history is more challenging to deal with because literally every row entry in that dataframe lists it uniquely, but we do have the data available.  On early attempts, it may be simplest to ignore it; then feature-engineer basic booleans for whether they did/didn't have experience in common-phrase-identifiable positions such as "Private practice" or "Attorney general" or "Navy" or "Army" etc.; eventually a parser can look for the year spreads listed there as a rough indicator of amounts of experience gleaned from each professional role.

### Build predecessor lookup table

In [None]:
# Create the predecessor lookup table
predecessor_lookup = get_predecessor_info(seat_timeline_df)
print(f"Created predecessor lookup: {len(predecessor_lookup)} records")

# Preview the predecessor lookup
print(predecessor_lookup.head())
all_dataframes['predecessor_lookup'] = predecessor_lookup