# Will the bill make it through capitol hill? (Bills Text Sample Data)

## Critical issues in this notebook: 

**Problem 1: Label leakage**

The text after cleaning still includes post-approval metadata that directly reveals the outcome.

Leakage sources look like:

* “Public Law ###”

* “Approved January …”

* “considered and passed …”

These must be taken care accordingly.

**Problem 2: Validation instability**

We tune thresholds on 6 positive samples (validation set). That is nowhere near enough for:

* Threshold optimization

* Reliable F1 calibration

* Metrics fluctuate wildly.

In [6]:
import os
import csv
import pandas as pd
import numpy as np
import re # Import re for regular expressions
from bs4 import BeautifulSoup
# from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack
from scipy.sparse import csr_matrix


# 1a. Load the Data

In [8]:
# Set the working directory
p = r"C:/Users/saram/Desktop/Erdos_Institute/project-2025/"

In [9]:
bills_all = pd.read_csv(p + r"bill_id_law_text.csv", dtype=str)

print(bills_all.shape)
bills_all

(13812, 4)


Unnamed: 0,id,date,law,full_text
0,118.HR.9124,2024-07-24 00:00:00,True,<html><body><pre>\n[118th Congress Public Law ...
1,118.HR.8667,2024-06-07 00:00:00,True,<html><body><pre>\n[118th Congress Public Law ...
2,119.HJRES.9,2025-01-03 00:00:00,False,<html><body><pre>\n[Congressional Bills 119th ...
3,119.HJRES.8,2025-01-03 00:00:00,False,<html><body><pre>\n[Congressional Bills 119th ...
4,119.HJRES.2,2025-01-03 00:00:00,False,<html><body><pre>\n[Congressional Bills 119th ...
...,...,...,...,...
13807,113.HJRES.54,2013-07-24 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...
13808,113.HJRES.53,2013-07-24 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...
13809,113.HJRES.52,2013-07-24 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...
13810,113.S.1504,2013-09-12 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...


In [10]:
bills_all.iloc[0]['full_text'][:10000]  # Display the first 10000 characters of the law text

"<html><body><pre>\n[118th Congress Public Law 259]\n[From the U.S. Government Publishing Office]\n\n\n\n[[Page 138 STAT. 2973]]\n\nPublic Law 118-259\n118th Congress\n\n                                 An Act\n\n\n \n To name the Department of Veterans Affairs community-based outpatient \n       clinic in Auburn, California, as the ``Louis A. Conter VA \n            Clinic''. &lt;&lt;NOTE: Jan. 4, 2025 -  [H.R. 9124]&gt;&gt; \n\n    Be it enacted by the Senate and House of Representatives of the \nUnited States of America in Congress assembled,\nSECTION 1. FINDINGS.\n\n    Congress finds the following:\n            (1) Louis ``Lou'' Anthony Conter was born on September 13, \n        1921, in Ojibwa, Wisconsin.\n            (2) Lt. Commander Lou Conter, the last remaining survivor of \n        the attack on the USS Arizona at Pearl Harbor, was an American \n        hero.\n            (3) On that fearful day, Petty Officer Conter helped \n        evacuate shipmates who were blinded, wou

In [11]:
bills_all["law"].value_counts()  # Check the distribution of the target variable

law
False    13767
True        45
Name: count, dtype: int64

In [12]:
bills_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13812 entries, 0 to 13811
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         13812 non-null  object
 1   date       13812 non-null  object
 2   law        13812 non-null  object
 3   full_text  13812 non-null  object
dtypes: object(4)
memory usage: 431.8+ KB


# 1b. Sample Data

In [27]:
# original counts
counts = bills_all["law"].value_counts()

total = counts.sum()
law_ratio = counts["True"] / total   # ~0.0033

n_total = 1000

# target sizes preserving true imbalance
n_law = max(1, int(law_ratio * n_total))
n_nolaw = n_total - n_law

# split classes
law = bills_all[bills_all["law"] == "True"]
nolaw = bills_all[bills_all["law"] == "False"]

# sample
law_s = law.sample(n=n_law, random_state=42, replace=False)
nolaw_s = nolaw.sample(n=n_nolaw, random_state=42, replace=False)

# merge + shuffle
sample = (
    pd.concat([law_s, nolaw_s])
    .sample(frac=1, random_state=42)
    .reset_index(drop=True)
)

print(sample.shape)

(1000, 4)


In [28]:
print(sample["law"].value_counts())

law
False    997
True       3
Name: count, dtype: int64


In [29]:
print(sample["law"].value_counts(normalize=True))

law
False    0.997
True     0.003
Name: proportion, dtype: float64


In [30]:
sample.to_csv(p + r"bill_id_law_text_sample_5000.csv", index=False)

# 2. Data engineering

Already have `id, date, law, full_text`. Add structured stuff from id and date.

## 2.1 Parse id into structured columns

id examples:

`118.HR.9124`

`113.S.1504`

`119.HJRES.9`

I want:

* congress (int)

* chamber (H or S)

* bill_type (HR, HJRES, S, etc.)

* bill_number (int)

In [52]:
def parse_id(bill_id):
    # Example patterns: 118.HR.9124  or  119.HJRES.9
    m = re.match(r"(\d+)\.([A-Z]+)\.(\d+)", bill_id)
    if not m:
        return pd.Series([None, None, None, None], 
                         index=["congress", "bill_type", "bill_number", "chamber"])
    congress = int(m.group(1))
    bill_type = m.group(2)
    bill_number = int(m.group(3))
    
    # crude chamber mapping
    if bill_type.startswith("H"):
        chamber = "House"
    elif bill_type.startswith("S"):
        chamber = "Senate"
    else:
        chamber = None
    
    return pd.Series([congress, bill_type, bill_number, chamber],
                     index=["congress", "bill_type", "bill_number", "chamber"])

parsed = bills["id"].apply(parse_id)
bills = pd.concat([bills, parsed], axis=1)
# bills = bills.loc[:, ~bills.columns.duplicated()]
bills

Unnamed: 0,id,date,law,full_text,congress,bill_type,bill_number,chamber
0,118.HR.9124,2024-07-24 00:00:00,True,<html><body><pre>\n[118th Congress Public Law ...,118,HR,9124,House
1,118.HR.8667,2024-06-07 00:00:00,True,<html><body><pre>\n[118th Congress Public Law ...,118,HR,8667,House
2,119.HJRES.9,2025-01-03 00:00:00,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,9,House
3,119.HJRES.8,2025-01-03 00:00:00,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,8,House
4,119.HJRES.2,2025-01-03 00:00:00,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,2,House
...,...,...,...,...,...,...,...,...
13807,113.HJRES.54,2013-07-24 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...,113,HJRES,54,House
13808,113.HJRES.53,2013-07-24 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...,113,HJRES,53,House
13809,113.HJRES.52,2013-07-24 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...,113,HJRES,52,House
13810,113.S.1504,2013-09-12 00:00:00,False,<html><body><pre>\n[Congressional Bills 113th ...,113,S,1504,Senate


##  2.2 Explicit label column

In [53]:
print(bills["law"].dtype)
bills["law"]

object


0         True
1         True
2        False
3        False
4        False
         ...  
13807    False
13808    False
13809    False
13810    False
13811    False
Name: law, Length: 13812, dtype: object

In [54]:
bills["label"] = bills["law"].astype(str).map({"True": 1, "False": 0}).astype(int) # True -> 1, False -> 0
bills["label"]

0        1
1        1
2        0
3        0
4        0
        ..
13807    0
13808    0
13809    0
13810    0
13811    0
Name: label, Length: 13812, dtype: int64

In [55]:
bills["label"].value_counts()

label
0    13767
1       45
Name: count, dtype: int64

## 2.3 Time-based features

In [56]:
bills["date"] = pd.to_datetime(bills["date"])
bills["year"] = bills["date"].dt.year
bills["month"] = bills["date"].dt.month
bills["day_of_year"] = bills["date"].dt.dayofyear

# 3. Text Preprocessing and Cleaning

In [57]:
def clean_bill_text(raw_html):
    # 1. Remove HTML tags
    text = BeautifulSoup(raw_html, "html.parser").get_text(separator=" ")

    # 2. Remove escaped entities (like &lt;&lt;, &gt;&gt;)
    text = re.sub(r"&[a-z]+;", " ", text)

    # 3. Remove bracketed junk (page numbers, notes, etc.)
    text = re.sub(r"\[\[.*?\]\]", " ", text)
    text = re.sub(r"\[.*?\]", " ", text)

    # 4. Collapse multiple whitespaces/newlines
    text = re.sub(r"\s+", " ", text)

    # 5. Trim
    return text.strip()

bills["clean_text"] = bills["full_text"].apply(clean_bill_text)
bills.head()

Unnamed: 0,id,date,law,full_text,congress,bill_type,bill_number,chamber,label,year,month,day_of_year,clean_text
0,118.HR.9124,2024-07-24,True,<html><body><pre>\n[118th Congress Public Law ...,118,HR,9124,House,1,2024,7,206,Public Law 118-259 118th Congress An Act To na...
1,118.HR.8667,2024-06-07,True,<html><body><pre>\n[118th Congress Public Law ...,118,HR,8667,House,1,2024,6,159,Public Law 118-251 118th Congress An Act To re...
2,119.HJRES.9,2025-01-03,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,9,House,0,2025,1,3,<DOC> 119th CONGRESS 1st Session H. J. RES. 9 ...
3,119.HJRES.8,2025-01-03,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,8,House,0,2025,1,3,<DOC> 119th CONGRESS 1st Session H. J. RES. 8 ...
4,119.HJRES.2,2025-01-03,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,2,House,0,2025,1,3,<DOC> 119th CONGRESS 1st Session H. J. RES. 2 ...


In [58]:
bills.iloc[0]['clean_text'][:10000]  # Display the first 10000 characters of the law text

"Public Law 118-259 118th Congress An Act To name the Department of Veterans Affairs community-based outpatient clinic in Auburn, California, as the ``Louis A. Conter VA Clinic''. <<NOTE: Jan. 4, 2025 - >> Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. FINDINGS. Congress finds the following: (1) Louis ``Lou'' Anthony Conter was born on September 13, 1921, in Ojibwa, Wisconsin. (2) Lt. Commander Lou Conter, the last remaining survivor of the attack on the USS Arizona at Pearl Harbor, was an American hero. (3) On that fearful day, Petty Officer Conter helped evacuate shipmates who were blinded, wounded, or burned, even restraining some of his fellow shipmates from jumping overboard into the burning sea. (4) In the days after the attack, he helped with recovering bodies and putting out fires. Lou Conter's heroic actions saved the lives of many of his shipmates on December 7, 1941. (5) Following Pearl Harbor, Conte

In [59]:
# 1. HTML → raw text

def strip_html(raw_html: str) -> str:
    if not isinstance(raw_html, str):
        return ""
    text = BeautifulSoup(raw_html, "html.parser").get_text(separator=" ")
    text = re.sub(r"\s+", " ", text)
    return text.strip()

# 2. Domain-specific cleanup

def normalize_bill_text(text: str) -> str:
    if not isinstance(text, str):
        return ""

    # Remove NOTE blocks like <<NOTE: ... >>
    text = re.sub(r"<<.*?>>", " ", text)

    # Remove HTML-ish debris like <all>
    text = re.sub(r"<all>", " ", text, flags=re.IGNORECASE)

    # Remove page/stat markers: [[Page 138 STAT. 2973]], [118th Congress Public Law ...], etc.
    text = re.sub(r"\[\[.*?\]\]", " ", text)
    text = re.sub(r"\[[^\]]*STAT\.[^\]]*\]", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\[From the U\.S\. Government Publishing Office\]", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\[118th Congress Public Law.*?\]", " ", text, flags=re.IGNORECASE)

    # Kill LEGISLATIVE HISTORY and everything after (procedural, not semantic)
    text = re.sub(r"LEGISLATIVE HISTORY[\s\S]*$", " ", text, flags=re.IGNORECASE)

    # Normalize repeated backticks/quotes
    text = text.replace("``", '"').replace("''", '"')

    # Mild boilerplate simplification:
    # compress "Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,"
    text = re.sub(
        r"Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,?",
        "Be it enacted,",
        text,
        flags=re.IGNORECASE
    )

    # Remove insane whitespace / leftover punctuation clutter
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"\s+([,.;:])", r"\1", text)

    return text.strip()

# 3. Apply to all bills

bills["clean_text"] = (
    bills["full_text"]
    .astype(str)
    .apply(strip_html)
    .apply(normalize_bill_text)
)


In [60]:
bills.head()

Unnamed: 0,id,date,law,full_text,congress,bill_type,bill_number,chamber,label,year,month,day_of_year,clean_text
0,118.HR.9124,2024-07-24,True,<html><body><pre>\n[118th Congress Public Law ...,118,HR,9124,House,1,2024,7,206,Public Law 118-259 118th Congress An Act To na...
1,118.HR.8667,2024-06-07,True,<html><body><pre>\n[118th Congress Public Law ...,118,HR,8667,House,1,2024,6,159,Public Law 118-251 118th Congress An Act To re...
2,119.HJRES.9,2025-01-03,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,9,House,0,2025,1,3,[Congressional Bills 119th Congress] [H.J. Res...
3,119.HJRES.8,2025-01-03,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,8,House,0,2025,1,3,[Congressional Bills 119th Congress] [H.J. Res...
4,119.HJRES.2,2025-01-03,False,<html><body><pre>\n[Congressional Bills 119th ...,119,HJRES,2,House,0,2025,1,3,[Congressional Bills 119th Congress] [H.J. Res...


In [61]:
bills.iloc[0]['clean_text'][:10000]  # Display the first 10000 characters of the law text

'Public Law 118-259 118th Congress An Act To name the Department of Veterans Affairs community-based outpatient clinic in Auburn, California, as the "Louis A. Conter VA Clinic". Be it enacted, SECTION 1. FINDINGS. Congress finds the following: (1) Louis "Lou" Anthony Conter was born on September 13, 1921, in Ojibwa, Wisconsin. (2) Lt. Commander Lou Conter, the last remaining survivor of the attack on the USS Arizona at Pearl Harbor, was an American hero. (3) On that fearful day, Petty Officer Conter helped evacuate shipmates who were blinded, wounded, or burned, even restraining some of his fellow shipmates from jumping overboard into the burning sea. (4) In the days after the attack, he helped with recovering bodies and putting out fires. Lou Conter\'s heroic actions saved the lives of many of his shipmates on December 7, 1941. (5) Following Pearl Harbor, Conter continued serving during WWII in New Guinea and in Europe as an enlisted naval aviation pilot assigned to VP-11, a "Black Ca

In [62]:
# Contractions (harmless to keep; rare in bills)
contractions = {
    "can't": "cannot", "won't": "will not", "i'm": "i am", "it's": "it is",
    "don't": "do not", "didn't": "did not", "doesn't": "does not",
    "i've": "i have", "i'd": "i would", "i'll": "i will", "you're": "you are",
    "we're": "we are", "they're": "they are", "isn't": "is not", "aren't": "are not",
    "wasn't": "was not", "weren't": "were not", "hasn't": "has not", "haven't": "have not",
    "hadn't": "had not", "shouldn't": "should not", "wouldn't": "would not",
    "couldn't": "could not", "mustn't": "must not", "mightn't": "might not",
    "shan't": "shall not", "let's": "let us", "that's": "that is", "who's": "who is",
    "what's": "what is", "here's": "here is", "there's": "there is", "how's": "how is"
}
contractions_re = re.compile("(%s)" % "|".join(map(re.escape, contractions.keys())), flags=re.IGNORECASE)

# Boilerplate / low-signal phrases specific to bills
boilerplate_phrases = [
    # enactment & stock openings
    "be it enacted by the senate and house of representatives of the united states of america in congress assembled",
    "be it enacted by the senate and house of representatives in congress assembled",
    "be it enacted,",
    "a bill to",
    "an act to",
    "an act",
    # headings that rarely add model-usable semantics
    "section 1. short title",
    "short title",
    "table of contents",
]

# Tail sections that are mostly procedural
tail_triggers = [
    "legislative history",
    "calendar no.",
    "attest:",
]

def expand_contractions(text: str) -> str:
    def repl(m):
        original = m.group(0)
        key = original.lower()
        out = contractions.get(key)
        if not out:
            return original
        # Preserve capitalization if contraction starts sentence / proper
        return out.capitalize() if original[0].isupper() else out
    return contractions_re.sub(repl, text)

def strip_html(raw_html: str) -> str:
    if not isinstance(raw_html, str):
        return ""
    text = BeautifulSoup(raw_html, "html.parser").get_text(separator=" ")
    return text

def clean_bill_text(text: str) -> str:
    if not isinstance(text, str):
        return ""

    # 1. Contractions (mostly no-ops here, but safe)
    text = expand_contractions(text)

    # 2. Normalize quotes
    text = text.replace("``", '"').replace("''", '"')

    # 3. Remove page/stat markers and generic brackets junk
    text = re.sub(r"\[\[.*?\]\]", " ", text)
    text = re.sub(r"\[From the U\.S\. Government Publishing Office\]", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\[\s*\d+th Congress Public Law[^\]]*\]", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\[\s*Congressional Bills[^\]]*\]", " ", text, flags=re.IGNORECASE)

    # 4. Remove NOTE-style artifacts like <<NOTE: ... >>
    text = re.sub(r"<<.*?>>", " ", text)

    # 5. Drop procedural tail (LEGISLATIVE HISTORY and friends)
    pattern_tail = r"(" + "|".join(tail_triggers) + r")[\s\S]*$"
    text = re.sub(pattern_tail, " ", text, flags=re.IGNORECASE)

    # 6. Remove low-signal boilerplate phrases
    for phrase in boilerplate_phrases:
        text = re.sub(re.escape(phrase), " ", text, flags=re.IGNORECASE)

    # 7. Remove repeated words like "sec sec" (rare but safe)
    text = re.sub(r"\b(\w+)( \1\b)+", r"\1", text, flags=re.IGNORECASE)

    # 8. Keep alphanumeric + basic punctuation; drop HTML/markup leftovers
    text = re.sub(r"[^A-Za-z0-9\s\.\,\;\:\?\!\-]", " ", text)

    # 9. Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

# Apply full pipeline to your dataset
bills["clean_text"] = (
    bills["full_text"]
    .astype(str)
    .apply(strip_html)
    .apply(clean_bill_text)
)

In [63]:
bills.iloc[0]['clean_text'][:10000]  # Display the first 10000 characters of the law text

'Public Law 118-259 118th Congress To name the Department of Veterans Affairs community-based outpatient clinic in Auburn, California, as the Louis A. Conter VA Clinic . Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. FINDINGS. Congress finds the following: 1 Louis Lou Anthony Conter was born on September 13, 1921, in Ojibwa, Wisconsin. 2 Lt. Commander Lou Conter, the last remaining survivor of the attack on the USS Arizona at Pearl Harbor, was an American hero. 3 On that fearful day, Petty Officer Conter helped evacuate shipmates who were blinded, wounded, or burned, even restraining some of his fellow shipmates from jumping overboard into the burning sea. 4 In the days after the attack, he helped with recovering bodies and putting out fires. Lou Conter s heroic actions saved the lives of many of his shipmates on December 7, 1941. 5 Following Pearl Harbor, Conter continued serving during WWII in New Guinea and 

If you use sentence-transformers (MPNet, all-mpnet-base-v2, etc.) or Hugging Face tokenizers, no need to lowercase. Those tokenizers handle case internally and were trained on cased text.

## 3.1 Extract a shorter “front” text

Full bills can be > 10k tokens; some models choke. Use a heuristic “front slice”. 
* Classic ML (tf-idf) can use full raw_text (within memory).
* Transformers may use snippet_text or chunked raw_text.

In [19]:
# def front_snippet(text, max_chars=5000):
#     text = text[:max_chars]
#     return text

# bills["snippet_text"] = bills["clean_text"].apply(front_snippet)

## 3.2 Document length features

We can extract quantitative properties of the cleaned text. These tell us distribution of bill lengths. It matters because:

* Extremely long bills

* Very short “joint resolutions”

* Text length correlates with probability of passage in some congresses

In [65]:
bills["len_chars"] = bills["clean_text"].str.len()
bills["len_words"] = bills["clean_text"].str.split().apply(len)
bills["len_sentences"] = bills["clean_text"].str.count(r"\.")

## 3.3 Section headers extraction

Many bills follow patterns:

“Be it enacted…”

“Section 1.”

“Short Title”

“Findings”

“Definitions”

Sometimes “real” legislation has more structural sections than symbolic resolutions.

In [66]:
bills["section_count"] = bills["clean_text"].str.count(r"SEC\.|Section")

# 4. EDA: class balance, per congress

Need to know how bad the imbalance is and how it varies over time.

Insights we can expect:

* Pass rate maybe around 5–20% depending on what exactly these `True` values represent.

* Bills of type `HJRES` vs `HR` vs `S` may have different probabilities.

* Extreme variation in length.

* Maybe treat bill type & congress as features or even build type-specific models later.


In [67]:
bills.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13812 entries, 0 to 13811
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             13812 non-null  object        
 1   date           13812 non-null  datetime64[ns]
 2   law            13812 non-null  object        
 3   full_text      13812 non-null  object        
 4   congress       13812 non-null  int64         
 5   bill_type      13812 non-null  object        
 6   bill_number    13812 non-null  int64         
 7   chamber        13812 non-null  object        
 8   label          13812 non-null  int64         
 9   year           13812 non-null  int32         
 10  month          13812 non-null  int32         
 11  day_of_year    13812 non-null  int32         
 12  clean_text     13812 non-null  object        
 13  len_chars      13812 non-null  int64         
 14  len_words      13812 non-null  int64         
 15  len_sentences  1381

In [68]:
print(f"How many values for each label: \n", bills["label"].value_counts(normalize=True))

How many values for each label: 
 label
0    0.996742
1    0.003258
Name: proportion, dtype: float64


The 113th Congress ran from 2013 to 2014, while the 119th Congress is currently in session from 2025 to 2026. The years between these two Congresses are 2015-2016 (114th), 2017-2018 (115th), 2019-2020 (116th), and 2021-2024 (117th and 118th). Each numbered Congress serves a two-year term, starting with the 113th Congress in 2013. 

113th Congress: 2013–2014

114th Congress: 2015–2016

115th Congress: 2017–2018

116th Congress: 2019–2020

117th Congress: 2021–2022

118th Congress: 2023–2024

119th Congress: 2025–2026 

In [69]:
print(f"Pass Rate by Congress:\n", bills.groupby("congress")["label"].mean())      

Pass Rate by Congress:
 congress
113    0.028037
114    0.010687
115    0.016923
116    0.000418
117    0.003599
118    0.001244
119    0.000000
Name: label, dtype: float64


In [70]:
print(f"Pass Rate by Type:\n", bills.groupby("bill_type")["label"].mean())

Pass Rate by Type:
 bill_type
HJRES    0.006192
HR       0.001428
S        0.008101
SJRES    0.000000
Name: label, dtype: float64


In [71]:
print(f"Documents Length Distribution:\n", bills["clean_text"].str.len().describe()) 

Documents Length Distribution:
 count    1.381200e+04
mean     7.208960e+03
std      2.793674e+04
min      3.300000e+01
25%      1.778750e+03
50%      3.301000e+03
75%      6.809000e+03
max      1.785918e+06
Name: clean_text, dtype: float64


# 5. Train / validation / test splitting 

## 5.1 Time-based split

We have congresses from 113 to 119 (2013–2025). We can use them like this:

Train: 113–116 (older)

Validation: 117

Test: 118–119 (latest)

This way model learns on history and is evaluated on near-future plus truly future congress.

**NOTE:** We’re using a time-based split, not a random split. This means we never stratify, because stratification shuffles across time and leaks future information backward.

In [81]:
# Check how many passing bills per time period
def summarize(mask, name):
    subset = bills[mask]
    print(f"--- {name} ---")
    print("Rows:", len(subset))
    print(subset["label"].value_counts())
    print()

mask_train = bills["congress"] <= 116
mask_val   = bills["congress"] == 117
mask_test  = bills["congress"] >= 118

summarize(mask_train, "TRAIN (<=116)")
summarize(mask_val,   "VAL (==117)")
summarize(mask_test,  "TEST (>=118)")


--- TRAIN (<=116) ---
Rows: 6516
label
0    6484
1      32
Name: count, dtype: int64

--- VAL (==117) ---
Rows: 1667
label
0    1661
1       6
Name: count, dtype: int64

--- TEST (>=118) ---
Rows: 5629
label
0    5622
1       7
Name: count, dtype: int64



In [73]:
# Train, validation and test sets
train_mask = bills["congress"] <= 116
val_mask   = bills["congress"] == 117
test_mask  = bills["congress"] >= 118
 
train_bills = bills[train_mask]
val_bills   = bills[val_mask]
test_bills  = bills[test_mask]

In [74]:
print("Train rows:", train_mask.sum())
print("Val rows:  ", val_mask.sum())
print("Test rows: ", test_mask.sum())

Train rows: 6516
Val rows:   1667
Test rows:  5629


In [43]:
# Targets
y_train = train_bills["label"].values
y_val   = val_bills["label"].values
y_test  = test_bills["label"].values

# 6. TF–IDF

## 6.1 Build a tf-idf vectorizer

In [30]:
# The vectorizer converts the long text documents into tf-idf features.

tfidf = TfidfVectorizer(
    ngram_range=(1, 2),     # unigrams + bigrams; captures basic structure
    min_df=5,               # drop tokens that appear in fewer than 5 bills
    max_df=0.9,             # drop tokens that appear in over 90% of bills
    max_features=200000      # cap vocabulary size for memory stability
)

# Fit on training full text only. This builds the vocabulary.
X_train_tfidf = tfidf.fit_transform(train_bills["clean_text"])

# Apply the same transformation to validation and test sets.
X_val_tfidf  = tfidf.transform(val_bills["clean_text"])
X_test_tfidf = tfidf.transform(test_bills["clean_text"])

In [None]:
# Baseline predictor
# Mean pass rate in the TRAIN set
base_rate = y_train.mean()

# Predict constant probability for every sample
p_train_base = np.full_like(y_train, fill_value=base_rate, dtype=float)
p_val_base   = np.full_like(y_val,   fill_value=base_rate, dtype=float)
p_test_base  = np.full_like(y_test,  fill_value=base_rate, dtype=float)

# Convert probabilities into predicted labels using 0.5 threshold
y_train_pred = (p_train_base >= 0.5).astype(int)
y_val_pred   = (p_val_base   >= 0.5).astype(int)
y_test_pred  = (p_test_base  >= 0.5).astype(int)

# METRICS

print("Base VAL  ROC-AUC:",  roc_auc_score(y_val,  p_val_base))
print("Base TEST ROC-AUC:",  roc_auc_score(y_test, p_test_base))

print("Base TEST PR-AUC:",   average_precision_score(y_test, p_test_base))

print("Base TEST Precision:", precision_score(y_test, y_test_pred, zero_division=0))
print("Base TEST Recall:",    recall_score(y_test, y_test_pred, zero_division=0))
print("Base TEST F1:",        f1_score(y_test, y_test_pred, zero_division=0))

Base VAL  ROC-AUC: 0.5
Base TEST ROC-AUC: 0.5
Base TEST PR-AUC: 0.0012435601350151003
Base TEST Precision: 0.0
Base TEST Recall: 0.0
Base TEST F1: 0.0


Baseline = “always predict the historical pass rate.” This is a dumb benchmark. AUC will be around 0.5; PR-AUC will be roughly the positive rate.

## 6.2 Add simple numeric features

We can later augment tf-idf with numeric columns like bill length or congress. For now, we're keeping it text-only for the first baseline.

## 6.3 Probability threshold tuning

Each classifier outputs a probability p = P(law=True). With an imbalanced class, the default threshold of 0.5 is useless.

**Note:**

The model gives a probability `p` that a bill will pass. We convert that probability into a final prediction (0 or 1) using a threshold. Usually we would use 0.5, but that makes no sense in an imbalanced problem.

The positive class (“bill passes”) is rare. So `p` is usually small, even when the model believes the bill is likely to pass relative to the dataset.

We fix this by choosing a threshold that maximizes a metric you care about. We do that on the validation set (never test).

Mechanics:

* Compute precision and recall at every possible cutoff.

* Compute F1 score for each cutoff:

* F1 = 2 * precision * recall / (precision + recall)

Find which threshold gives the best F1 (or best recall, or best precision, depending on goals).

Then you evaluate on the test set using that threshold: `y_test_pred = (p_test >= best_thresh).astype(int)`

In [104]:
from sklearn.metrics import precision_recall_curve, precision_score, recall_score, f1_score

def pick_best_threshold_f1(y_true, p_scores):
    """
    Given true labels and predicted probabilities on the VALIDATION set,
    compute precision/recall at all thresholds and return the threshold
    that maximizes F1.
    """
    precision, recall, thresholds = precision_recall_curve(y_true, p_scores)

    # last precision/recall point corresponds to threshold=+inf, ignore it
    precision = precision[:-1]
    recall = recall[:-1]
    thresholds = thresholds

    f1_scores = 2 * precision * recall / (precision + recall + 1e-8)
    best_idx = f1_scores.argmax()
    best_thresh = thresholds[best_idx]

    print("Best threshold (by F1):", best_thresh)
    print("VAL precision at best thresh:", precision[best_idx])
    print("VAL recall at best thresh:   ", recall[best_idx])
    print("VAL F1 at best thresh:       ", f1_scores[best_idx])

    return best_thresh


# 7. Baseline models on TF-IDF

## 7.1 Logistic regression + TF-IDF

In [None]:
# Linear classifier for high-dimensional sparse TF-IDF.
# Balanced weights counter the strong imbalance (few bills pass).
logreg = LogisticRegression(
    max_iter=2000,
    n_jobs=-1,
    class_weight="balanced"
)

# Train only on past congresses.
logreg.fit(X_train_tfidf, y_train)

# Probability of passage for validation and test sets.
p_val_lr  = logreg.predict_proba(X_val_tfidf)[:, 1]
p_test_lr = logreg.predict_proba(X_test_tfidf)[:, 1]

# Pick threshold on VALIDATION set
best_thresh_lr = pick_best_threshold_f1(y_val, p_val_lr)

# Turn probabilities into class predictions using best threshold.
y_val_pred_lr  = (p_val_lr  >= best_thresh_lr).astype(int)
y_test_pred_lr = (p_test_lr >= best_thresh_lr).astype(int)

# METRICS 

print("LogReg TEST precision:", precision_score(y_test, y_test_pred_lr))
print("LogReg TEST recall:   ", recall_score(y_test,  y_test_pred_lr))
print("LogReg TEST F1:       ", f1_score(y_test,     y_test_pred_lr))

# Ranking metrics (threshold-free); AUC and PR-AUC ignore the threshold
print("LogReg VAL AUC:",      roc_auc_score(y_val,  p_val_lr))
print("LogReg TEST AUC:",     roc_auc_score(y_test, p_test_lr))
print("LogReg TEST PR-AUC:",  average_precision_score(y_test, p_test_lr))

Best threshold (by F1): 0.29299534786519704
VAL precision at best thresh: 0.5714285714285714
VAL recall at best thresh:    0.6666666666666666
VAL F1 at best thresh:        0.6153846104142012
LogReg TEST precision: 0.14705882352941177
LogReg TEST recall:    0.7142857142857143
LogReg TEST F1:        0.24390243902439024
LogReg VAL AUC: 0.9969897652016858
LogReg TEST AUC: 0.9763937592112619
LogReg TEST PR-AUC: 0.4167401435086253


TEST ROC-AUC: 0.976

TEST PR-AUC: 0.417

Precision: 0.15

Recall: 0.71

F1: 0.24

Meaning:

Correctly catch ~70% of the true passing bills. But every 1 correct “pass” prediction comes with about 6 false alarms.

Ranking quality is strong; classification isn't good.

## 7.2 Linear SVM + tf-idf

In [107]:
# Linear SVM with class balancing.
svm = LinearSVC(class_weight="balanced")

# SVM does not output probabilities.
# CalibratedClassifierCV wraps it and learns a sigmoid to produce calibrated probabilities.
svm_cal = CalibratedClassifierCV(svm, method="sigmoid", cv=5)

# Fit on training full-text TF-IDF.
svm_cal.fit(X_train_tfidf, y_train)

# Probability outputs for validation and test.
p_val_svm  = svm_cal.predict_proba(X_val_tfidf)[:, 1]
p_test_svm = svm_cal.predict_proba(X_test_tfidf)[:, 1]

# Pick threshold on VALIDATION set.
best_thresh_svm = pick_best_threshold_f1(y_val, p_val_svm)

# Turn probabilities into class predictions using best threshold.
y_val_pred_svm  = (p_val_svm  >= best_thresh_svm).astype(int)
y_test_pred_svm = (p_test_svm >= best_thresh_svm).astype(int)

# METRICS 

# Thresholded point metrics on TEST at best_thresh_svm.
print("SVM TEST precision:", precision_score(y_test, y_test_pred_svm, zero_division=0))
print("SVM TEST recall:   ", recall_score(y_test,  y_test_pred_svm, zero_division=0))
print("SVM TEST F1:       ", f1_score(y_test,     y_test_pred_svm, zero_division=0))

# Ranking metrics (threshold-free); AUC and PR-AUC ignore the threshold.
print("SVM VAL AUC:",      roc_auc_score(y_val,  p_val_svm))
print("SVM TEST AUC:",     roc_auc_score(y_test, p_test_svm))
print("SVM TEST PR-AUC:",  average_precision_score(y_test, p_test_svm))

Best threshold (by F1): 0.0395114882929851
VAL precision at best thresh: 0.5454545454545454
VAL recall at best thresh:    1.0
VAL F1 at best thresh:        0.7058823483737025
SVM TEST precision: 0.14705882352941177
SVM TEST recall:    0.7142857142857143
SVM TEST F1:        0.24390243902439024
SVM VAL AUC: 0.9984948826008428
SVM TEST AUC: 0.9859226508105909
SVM TEST PR-AUC: 0.5393499331478957


Metrics almost identical to LR.

PR-AUC improves to 0.54.

Meaning: 

Same trade-off: high recall, awful precision. Still far from reliable yes/no predictions.

## 7.3 XGBoost + TF-IDF

In [108]:
# 1) Class imbalance: compute scale_pos_weight = N_neg / N_pos from TRAIN only.
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)
scale_pos = n_neg / n_pos

# XGBoost on TF-IDF features
xgb = XGBClassifier(
    objective="binary:logistic", # outputs probability bill passes
    eval_metric="logloss",
    n_estimators=500,
    learning_rate=0.01, # low learning_rate, shallow max_depth → more stable, less overfit
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos, # makes rare "pass" class matter
    n_jobs=-1,
    random_state=42
)

# Fit on past congresses only.
xgb.fit(X_train_tfidf, y_train)

# Predict probabilities of passage for VAL and TEST.
p_val_xgb  = xgb.predict_proba(X_val_tfidf)[:, 1]
p_test_xgb = xgb.predict_proba(X_test_tfidf)[:, 1]

# Pick threshold on VALIDATION set 
best_thresh_xgb = pick_best_threshold_f1(y_val, p_val_xgb)

# Convert probabilities into hard labels at best_thresh_xgb.
y_val_pred_xgb  = (p_val_xgb  >= best_thresh_xgb).astype(int)
y_test_pred_xgb = (p_test_xgb >= best_thresh_xgb).astype(int)

# METRICS 
print("XGB TEST precision:", precision_score(y_test, y_test_pred_xgb, zero_division=0))
print("XGB TEST recall:   ", recall_score(y_test,  y_test_pred_xgb, zero_division=0))
print("XGB TEST F1:       ", f1_score(y_test,     y_test_pred_xgb, zero_division=0))

print("XGB VAL ROC-AUC:",      roc_auc_score(y_val,  p_val_xgb))
print("XGB TEST ROC-AUC:",     roc_auc_score(y_test, p_test_xgb))
print("XGB TEST PR-AUC:",      average_precision_score(y_test, p_test_xgb))

Best threshold (by F1): 0.9944706
VAL precision at best thresh: 1.0
VAL recall at best thresh:    1.0
VAL F1 at best thresh:        0.999999995
XGB TEST precision: 1.0
XGB TEST recall:    0.8571428571428571
XGB TEST F1:        0.9230769230769231
XGB VAL ROC-AUC: 1.0
XGB TEST ROC-AUC: 1.0
XGB TEST PR-AUC: 0.9999999999999998


**Raw TF-IDF + XGB:**

VAL & TEST ROC-AUC: 1.00

TEST PR-AUC: ~1.00

Precision: 1.00

Recall: 0.86

F1: 0.92

**TF-IDF + Metadata + XGB:** Same near-perfect numbers.

This is statistically implausible and almost certainly indicates data leakage or target contamination. Real-world legislative prediction does not produce perfect classification from pure language alone.

**Why this happened:**

Passed bills contain:

“Public Law ###”

“Approved January …”

“considered and passed House / Senate”

Those phrases directly encode the label. The cleaned text contains post-vote signals. So the model isn’t predicting success from bill content. It’s simply detecting the bill already became law.

**The issues is** that our model is not a model of prediction, but a model of document state recognition. Completely invalid for forecasting.

# 8. Add simple structured features to TF-IDF

TF-IDF encodes text; numeric + one-hot features encode structural context (congress, chamber, bill type, length, etc.). Models now see both language and structural signals.

Create numeric matrix of these:

`congress`

`bill_type`

`chamber`

`year, month, etc.`

and combine with tf-idf.

In [None]:
#  Build metadata matrices 

# Numeric metadata columns already exist in `bills`
num_cols = ["congress", "year", "month", "len_words", "section_count"]

X_train_num = train_bills[num_cols].to_numpy()
X_val_num   = val_bills[num_cols].to_numpy()
X_test_num  = test_bills[num_cols].to_numpy()

# Categorical metadata
cat_cols = ["bill_type", "chamber"]

enc = OneHotEncoder(handle_unknown="ignore") # sparse=True default

X_train_cat = enc.fit_transform(train_bills[cat_cols])
X_val_cat   = enc.transform(val_bills[cat_cols])
X_test_cat  = enc.transform(test_bills[cat_cols])

# Combine TF-IDF + numeric + categorical
X_train_meta = hstack([X_train_tfidf, X_train_num, X_train_cat])
X_val_meta   = hstack([X_val_tfidf,   X_val_num,   X_val_cat])
X_test_meta  = hstack([X_test_tfidf,  X_test_num,  X_test_cat])

## 8.1 Logistic Regression + metadata

In [109]:
logreg_meta = LogisticRegression(
    max_iter=2000,
    n_jobs=-1,
    class_weight="balanced"  # imbalanced data
)

# Fit on combined features from past congresses
logreg_meta.fit(X_train_meta, y_train)

# Predict probabilities bill passes on val/test
p_val_lr_meta  = logreg_meta.predict_proba(X_val_meta)[:, 1]
p_test_lr_meta = logreg_meta.predict_proba(X_test_meta)[:, 1]

# Pick threshold on VALIDATION set (model-specific).
best_thresh_lr_meta = pick_best_threshold_f1(y_val, p_val_lr_meta)

# Convert probabilities into class predictions using best threshold.
y_val_pred_meta  = (p_val_lr_meta  >= best_thresh_lr_meta).astype(int)
y_test_pred_meta = (p_test_lr_meta >= best_thresh_lr_meta).astype(int)

# METRICS
# Thresholded metrics on TEST
print("LogReg+Meta TEST precision:", precision_score(y_test, y_test_pred_meta, zero_division=0))
print("LogReg+Meta TEST recall:   ", recall_score(y_test,  y_test_pred_meta, zero_division=0))
print("LogReg+Meta TEST F1:       ", f1_score(y_test,     y_test_pred_meta, zero_division=0))

# Ranking metrics that ignore threshold
print("LogReg+Meta VAL AUC:",      roc_auc_score(y_val,  p_val_lr_meta))
print("LogReg+Meta TEST AUC:",     roc_auc_score(y_test, p_test_lr_meta))
print("LogReg+Meta TEST PR-AUC:",  average_precision_score(y_test, p_test_lr_meta))

Best threshold (by F1): 0.6456334378706323
VAL precision at best thresh: 1.0
VAL recall at best thresh:    0.3333333333333333
VAL F1 at best thresh:        0.4999999962500001
LogReg+Meta TEST precision: 0.0
LogReg+Meta TEST recall:    0.0
LogReg+Meta TEST F1:        0.0
LogReg+Meta VAL AUC: 0.9960866947621915
LogReg+Meta TEST AUC: 0.9880571225288408
LogReg+Meta TEST PR-AUC: 0.36126111886760975


## 8.2 Calibrated Linear SVM + metadata

In [110]:
svm = LinearSVC(class_weight="balanced")  # margin-based classifier
svm_cal_meta = CalibratedClassifierCV(svm, method="sigmoid", cv=5)
# Calibration turns SVM scores into probabilities.

# Fit on combined features
svm_cal_meta.fit(X_train_meta, y_train)

# Predict calibrated probabilities
p_val_svm_meta  = svm_cal_meta.predict_proba(X_val_meta)[:, 1]
p_test_svm_meta = svm_cal_meta.predict_proba(X_test_meta)[:, 1]

# Choose the best threshold on VALIDATION set using F1.
best_thresh_svm_meta = pick_best_threshold_f1(y_val, p_val_svm_meta)

# Turn probabilities into class labels using the tuned threshold.
y_val_pred_meta  = (p_val_svm_meta  >= best_thresh_svm_meta).astype(int)
y_test_pred_meta = (p_test_svm_meta >= best_thresh_svm_meta).astype(int)

# METRICS 

# Thresholded metrics on TEST (actual classification behavior).
print("SVM+Meta TEST precision:", precision_score(y_test, y_test_pred_meta, zero_division=0))
print("SVM+Meta TEST recall:   ", recall_score(y_test,  y_test_pred_meta, zero_division=0))
print("SVM+Meta TEST F1:       ", f1_score(y_test,     y_test_pred_meta, zero_division=0))

# Ranking metrics (threshold-free).
print("SVM+Meta VAL ROC-AUC:",   roc_auc_score(y_val,  p_val_svm_meta))
print("SVM+Meta TEST ROC-AUC:",  roc_auc_score(y_test, p_test_svm_meta))
print("SVM+Meta TEST PR-AUC:",   average_precision_score(y_test, p_test_svm_meta))



Best threshold (by F1): 0.0057071349063341804
VAL precision at best thresh: 0.03333333333333333
VAL recall at best thresh:    0.5
VAL F1 at best thresh:        0.062499998828125014
SVM+Meta TEST precision: 0.03125
SVM+Meta TEST recall:    0.14285714285714285
SVM+Meta TEST F1:        0.05128205128205128
SVM+Meta VAL ROC-AUC: 0.8321292394140076
SVM+Meta TEST ROC-AUC: 0.8065253849672206
SVM+Meta TEST PR-AUC: 0.0165619836556158


## 8.3 XGBoost + metadata (with scale_pos_weight)

In [111]:
# Class imbalance: compute scale_pos_weight = N_neg / N_pos on TRAIN
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)
scale_pos = n_neg / n_pos

xgb_meta = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    n_estimators=500,
    learning_rate=0.01,  # low LR, more stable
    max_depth=4,         # shallow trees on high-dim sparse
    subsample=0.8,       # row subsampling
    colsample_bytree=0.8, # feature subsampling
    scale_pos_weight=scale_pos,  # corrects extreme imbalance
    n_jobs=-1,
    random_state=42
)

# Fit on combined features
xgb_meta.fit(X_train_meta, y_train)

# Predict probabilities of passage
p_val_xgb_meta  = xgb_meta.predict_proba(X_val_meta)[:, 1]
p_test_xgb_meta = xgb_meta.predict_proba(X_test_meta)[:, 1]

# Pick best threshold on VALIDATION using F1
best_thresh_xgb_meta = pick_best_threshold_f1(y_val, p_val_xgb_meta)

# Turn probabilities into class labels using tuned threshold
y_val_pred_xgb_meta  = (p_val_xgb_meta  >= best_thresh_xgb_meta).astype(int)
y_test_pred_xgb_meta = (p_test_xgb_meta >= best_thresh_xgb_meta).astype(int)

# METRICS 
# Thresholded metrics on TEST: how the classifier behaves in practice
print("XGB+Meta TEST Precision:", precision_score(y_test, y_test_pred_xgb_meta, zero_division=0))
print("XGB+Meta TEST Recall:",    recall_score(y_test,  y_test_pred_xgb_meta, zero_division=0))
print("XGB+Meta TEST F1:",        f1_score(y_test,     y_test_pred_xgb_meta, zero_division=0))

# Threshold-free ranking metrics (use probabilities)
print("XGB+Meta VAL ROC-AUC:",   roc_auc_score(y_val,  p_val_xgb_meta))
print("XGB+Meta TEST ROC-AUC:",  roc_auc_score(y_test, p_test_xgb_meta))
print("XGB+Meta TEST PR-AUC:",   average_precision_score(y_test, p_test_xgb_meta))

Best threshold (by F1): 0.9947003
VAL precision at best thresh: 1.0
VAL recall at best thresh:    1.0
VAL F1 at best thresh:        0.999999995
XGB+Meta TEST Precision: 1.0
XGB+Meta TEST Recall: 0.8571428571428571
XGB+Meta TEST F1: 0.9230769230769231
XGB+Meta VAL ROC-AUC: 1.0
XGB+Meta TEST ROC-AUC: 1.0
XGB+Meta TEST PR-AUC: 1.0


**Metadata hurt the linear models**

Logistic + metadata:

* High AUC

* Zero precision / recall at tuned threshold

Reason: Metadata (year, congress, length) creates extreme separation artifacts with so few positives.

Threshold optimization on 6 validation positives overfits instantly.

The validation set is too small for stable threshold tuning, so optimization becomes noise-fitting.

# 9. Transformers

## 9.1 DistilBERT Embeddings

In [82]:
# DistilBERT sentence-transformer model
distil_model = SentenceTransformer("distilbert-base-nli-mean-tokens")

# Encode full cleaned bill text into dense vectors (one per bill)
distil_emb = distil_model.encode(
    bills["clean_text"].tolist(),
    batch_size=16,
    show_progress_bar=True
)
distil_emb = np.array(distil_emb)

Batches:   0%|          | 0/864 [00:00<?, ?it/s]

In [84]:
# Split embeddings into train/val/test using same masks
X_train_distil = distil_emb[train_mask.values]
X_val_distil   = distil_emb[val_mask.values]
X_test_distil  = distil_emb[test_mask.values]

## 9.2 Logistic Regression on DistilBERT Embeddings 

In [112]:
logreg_distil = LogisticRegression(
    max_iter=2000,
    n_jobs=-1,
    class_weight="balanced"  
)

# Fit on past congresses' embedding vectors
logreg_distil.fit(X_train_distil, y_train)

# Predict probability bill will pass (class 1) on val/test
p_val_lr_distil  = logreg_distil.predict_proba(X_val_distil)[:, 1]
p_test_lr_distil = logreg_distil.predict_proba(X_test_distil)[:, 1]

# Pick best threshold on VALIDATION using F1
best_thresh_distil = pick_best_threshold_f1(y_val, p_val_lr_distil)

# Turn probabilities into hard labels using tuned threshold
y_val_pred_distil  = (p_val_lr_distil  >= best_thresh_distil).astype(int)
y_test_pred_distil = (p_test_lr_distil >= best_thresh_distil).astype(int)

# Thresholded metrics: how the classifier behaves in practice on TEST
print("DistilBERT+LogReg TEST Precision:",
      precision_score(y_test, y_test_pred_distil, zero_division=0))
print("DistilBERT+LogReg TEST Recall:",
      recall_score(y_test, y_test_pred_distil, zero_division=0))
print("DistilBERT+LogReg TEST F1:",
      f1_score(y_test, y_test_pred_distil, zero_division=0))

# Threshold-free ranking metrics (probability-based)
print("DistilBERT+LogReg VAL ROC-AUC:",
      roc_auc_score(y_val, p_val_lr_distil))
print("DistilBERT+LogReg TEST ROC-AUC:",
      roc_auc_score(y_test, p_test_lr_distil))
print("DistilBERT+LogReg TEST PR-AUC:",
      average_precision_score(y_test, p_test_lr_distil))

Best threshold (by F1): 0.9955002922986003
VAL precision at best thresh: 1.0
VAL recall at best thresh:    0.8333333333333334
VAL F1 at best thresh:        0.9090909041322315
DistilBERT+LogReg TEST Precision: 1.0
DistilBERT+LogReg TEST Recall: 0.5714285714285714
DistilBERT+LogReg TEST F1: 0.7272727272727273
DistilBERT+LogReg VAL ROC-AUC: 0.9997993176801123
DistilBERT+LogReg TEST ROC-AUC: 0.9999237688672054
DistilBERT+LogReg TEST PR-AUC: 0.947845804988662


## 9.2 MPNet Embeddings
Use pretrained transformer models to turn each bill’s text into a numerical vector.
MPNet (sentence-transformers/all-mpnet-base-v2) tends to outperform BERT on semantic similarity and downstream classification.

In [None]:
model = SentenceTransformer('all-mpnet-base-v2')

# Encode bill text
embeddings = model.encode(bills["clean_text"].tolist(), 
                          batch_size=16, 
                          show_progress_bar=True)
bills_emb = np.array(embeddings)

In [98]:
# p = r"C:/Users/saram/Desktop/Erdos_Institute/project-2025/"

# # Save
# np.save(p + "bill_embeddings.npy", bills_emb)

In [87]:
# Load later
bills_emb = np.load(p + "bill_embeddings.npy")

In [88]:
bills_emb.shape  # Should be (num_bills, embedding_dim)

(13812, 768)

In [89]:
bills_emb

array([[ 0.0077152 ,  0.08115873, -0.00354219, ..., -0.00809604,
         0.0123047 , -0.01125902],
       [ 0.00645859,  0.01635592,  0.0276007 , ..., -0.02757508,
        -0.03304846, -0.02868639],
       [ 0.02797876,  0.03059493,  0.04683481, ..., -0.01529225,
        -0.04337279, -0.01323536],
       ...,
       [ 0.03796244, -0.04223369,  0.03787776, ...,  0.0033854 ,
        -0.00608084, -0.00959797],
       [-0.06522495,  0.03699569,  0.00618374, ..., -0.02298611,
        -0.0113249 ,  0.01463324],
       [ 0.02661327, -0.00600304,  0.0545648 , ...,  0.03010296,
        -0.03346274,  0.01599839]], shape=(13812, 768), dtype=float32)

In [90]:
# Split into train/val/test
X_train_mpnet = bills_emb[train_mask.values]
X_val_mpnet   = bills_emb[val_mask.values]
X_test_mpnet  = bills_emb[test_mask.values]

## 9.4 XGBoost on MPNet Embeddings


In [113]:
# Class imbalance: scale_pos_weight = N_neg / N_pos on TRAIN
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)
scale_pos = n_neg / n_pos

xgb_mpnet = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    n_estimators=500,     # enough trees with low LR
    learning_rate=0.01,   # low learning rate for stability
    max_depth=4,          # shallow trees to avoid overfitting
    subsample=0.8,        # row subsampling
    colsample_bytree=0.8, # feature subsampling
    scale_pos_weight=scale_pos,  # fix extreme imbalance
    n_jobs=-1,
    random_state=42
)

# Fit boosted trees on MPNet embedding features
xgb_mpnet.fit(X_train_mpnet, y_train)

# Predict probability bill will pass
p_val_xgb_mpnet  = xgb_mpnet.predict_proba(X_val_mpnet)[:, 1]
p_test_xgb_mpnet = xgb_mpnet.predict_proba(X_test_mpnet)[:, 1]

# Pick best threshold on VALIDATION set using F1
best_thresh_xgb_mpnet = pick_best_threshold_f1(y_val, p_val_xgb_mpnet)

# Hard predictions using tuned threshold
y_val_pred_xgb_mpnet  = (p_val_xgb_mpnet  >= best_thresh_xgb_mpnet).astype(int)
y_test_pred_xgb_mpnet = (p_test_xgb_mpnet >= best_thresh_xgb_mpnet).astype(int)

# METRICS

# Thresholded point metrics on TEST
print("MPNet+XGB TEST Precision:",
      precision_score(y_test, y_test_pred_xgb_mpnet, zero_division=0))
print("MPNet+XGB TEST Recall:",
      recall_score(y_test,  y_test_pred_xgb_mpnet, zero_division=0))
print("MPNet+XGB TEST F1:",
      f1_score(y_test,     y_test_pred_xgb_mpnet, zero_division=0))

# Threshold-free ranking metrics
print("MPNet+XGB VAL ROC-AUC:",
      roc_auc_score(y_val, p_val_xgb_mpnet))
print("MPNet+XGB TEST ROC-AUC:",
      roc_auc_score(y_test, p_test_xgb_mpnet))
print("MPNet+XGB TEST PR-AUC:",
      average_precision_score(y_test, p_test_xgb_mpnet))

Best threshold (by F1): 0.457284
VAL precision at best thresh: 0.8333333333333334
VAL recall at best thresh:    0.8333333333333334
VAL F1 at best thresh:        0.8333333283333335
MPNet+XGB TEST Precision: 0.8333333333333334
MPNet+XGB TEST Recall: 0.7142857142857143
MPNet+XGB TEST F1: 0.7692307692307693
MPNet+XGB VAL ROC-AUC: 0.9904675898053381
MPNet+XGB TEST ROC-AUC: 0.9986278396096966
MPNet+XGB TEST PR-AUC: 0.7358843537414966


# 10. Results

**DistilBERT + Logistic Regression:**

TEST PR-AUC: 0.95

Precision: 1.00

Recall: 0.57

F1: 0.73

**MPNet + XGBoost:**

TEST PR-AUC: 0.74

Precision: 0.83

Recall: 0.71

F1: 0.77

**Interpretation:**
Transformers perform extremely well but less perfectly than TF-IDF+XGB, which supports the leakage hypothesis.

Transformers embed broader semantic content which dilutes the exact leakage tokens.

TF-IDF + trees is acting like a keyword detector over “Public Law”, “Approved”, etc.

Overall, what each metric tells us

* ROC-AUC:	Almost meaningless once it's >0.95 with heavy imbalance.
* PR-AUC:	The only trustworthy ranking metric.
* Precision:	How many flags you can trust. Currently bad except for leakage models.
* Recall:	Ability to find most winners. Linear methods and MPNet are decent here.
* F1:	Overall classification usefulness. Transformer > linear methods.

Best performing models (Ordered by true predictive quality)

MPNet + XGBoost:
Best balanced precision/recall without blatant leakage.

DistilBERT + LogReg:
Higher precision, lower recall.

Linear SVM / LogReg on TF-IDF:
Acceptable baseline; too many false positives.

XGB with raw TF-IDF:
Invalid due to leakage.