# NOTEBOOK 02: Will the Bill Make It Through Capitol Hill?
Section 5–9: Text-Only Modeling & Evaluation

## 1. Imports

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

from sklearn.metrics import (
    roc_auc_score,
    average_precision_score
)

from sentence_transformers import SentenceTransformer

from xgboost import XGBClassifier

## 2. Load Clean Dataset

In [8]:
PATH = r"C:/Users/saram/Desktop/Erdos_Institute/project-2025/"
DATA = "bills_clean_phase1.csv"

bills = pd.read_csv(PATH + DATA)
print("Total rows:", bills.shape)

Total rows: (13812, 10)


## 3. Train / Validation / Test Split

In [None]:
train_mask = bills["congress"].isin([113,114])
val_mask   = bills["congress"].isin([115,116])
test_mask  = bills["congress"] >= 117

train = bills[train_mask]
val   = bills[val_mask]
test  = bills[test_mask]

In [12]:
# Check the distribution
def summarize(mask, name):
    s = bills.loc[mask, "label"]
    print(name)
    print("Rows:", len(s))
    print("Positives:", s.sum())

summarize(train_mask, "TRAIN")
summarize(val_mask, "VAL")
summarize(test_mask, "TEST")

TRAIN
Rows: 1083
Positives: 19
VAL
Rows: 5433
Positives: 13
TEST
Rows: 7296
Positives: 13


## 4. Targets

In [None]:
y_train = train["label"].values
y_val   = val["label"].values
y_test  = test["label"].values

# SECTION 5: TF-IDF BASELINES

## 5. TF-IDF Vectorization

In [14]:
tfidf = TfidfVectorizer(
    ngram_range=(1,2),
    min_df=5,
    max_df=0.9,
    max_features=150_000,
    token_pattern=r"\b[a-zA-Z]{3,}\b"
)

X_train = tfidf.fit_transform(train["clean_text"])
X_val   = tfidf.transform(val["clean_text"])
X_test  = tfidf.transform(test["clean_text"])

# SECTION 6: LINEAR MODELS

## 6. Logistic Regression


In [15]:
lr = LogisticRegression(
    max_iter=2500,
    n_jobs=-1,
    class_weight="balanced"
)

lr.fit(X_train, y_train)

p_test_lr = lr.predict_proba(X_test)[:,1]

In [16]:
# Evaluate (ranking metrics only)
print("TFIDF + Logistic")
print("ROC-AUC:", roc_auc_score(y_test, p_test_lr))
print("PR-AUC: ", average_precision_score(y_test, p_test_lr))


TFIDF + Logistic
ROC-AUC: 0.9707643722472776
PR-AUC:  0.5279654808619086


## 7. SVM (Calibrated)

In [None]:
svm = LinearSVC(class_weight="balanced")

svm_cal = CalibratedClassifierCV(svm, method="sigmoid", cv=5)

svm_cal.fit(X_train, y_train)

p_test_svm = svm_cal.predict_proba(X_test)[:,1]

In [None]:
# Evaluate
print("TFIDF + SVM")
print("ROC-AUC:", roc_auc_score(y_test, p_test_svm))
print("PR-AUC: ", average_precision_score(y_test, p_test_svm))

TFIDF + SVM
ROC-AUC: 0.9744082637121221
PR-AUC:  0.6330011259115903


# SECTION 7: TRANSFORMER EMBEDDINGS

## 8. Load MPNet & Encode All Bills

In [20]:
model = SentenceTransformer("all-mpnet-base-v2")

embeddings = model.encode(
    bills["clean_text"].astype(str).tolist(),
    batch_size=16,
    show_progress_bar=True
)

E = np.array(embeddings)

X_train_e = E[train_mask.values]
X_val_e   = E[val_mask.values]
X_test_e  = E[test_mask.values]


Batches:   0%|          | 0/864 [00:00<?, ?it/s]

## 9. XGBoost on MPNet embeddings

In [21]:
scale_pos = np.sum(y_train==0) / np.sum(y_train==1)

xgb = XGBClassifier(
    objective="binary:logistic",
    n_estimators=400,
    learning_rate=0.02,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos,
    eval_metric="logloss",
    n_jobs=-1,
    random_state=42
)

xgb.fit(X_train_e, y_train)

p_test_xgb = xgb.predict_proba(X_test_e)[:,1]


In [22]:
# Evaluate
print("MPNet + XGB")
print("ROC-AUC:", roc_auc_score(y_test, p_test_xgb))
print("PR-AUC: ", average_precision_score(y_test, p_test_xgb))


MPNet + XGB
ROC-AUC: 0.9890049535799913
PR-AUC:  0.6570914589302368


# SECTION 8: MODEL LEADERBOARD

In [24]:
results = []

results.append({
    "MODEL":"TFIDF_LogReg",
    "ROC_AUC": roc_auc_score(y_test, p_test_lr),
    "PR_AUC":  average_precision_score(y_test, p_test_lr)
})

results.append({
    "MODEL":"TFIDF_SVM",
    "ROC_AUC": roc_auc_score(y_test, p_test_svm),
    "PR_AUC":  average_precision_score(y_test, p_test_svm)
})

results.append({
    "MODEL":"MPNet_XGB",
    "ROC_AUC": roc_auc_score(y_test, p_test_xgb),
    "PR_AUC":  average_precision_score(y_test, p_test_xgb)
})

leaderboard = pd.DataFrame(results)
leaderboard.sort_values("PR_AUC", ascending=False)


Unnamed: 0,MODEL,ROC_AUC,PR_AUC
2,MPNet_XGB,0.989005,0.657091
1,TFIDF_SVM,0.974408,0.633001
0,TFIDF_LogReg,0.970764,0.527965


In [26]:
results

[{'MODEL': 'TFIDF_LogReg',
  'ROC_AUC': np.float64(0.9707643722472776),
  'PR_AUC': np.float64(0.5279654808619086)},
 {'MODEL': 'TFIDF_SVM',
  'ROC_AUC': np.float64(0.9744082637121221),
  'PR_AUC': np.float64(0.6330011259115903)},
 {'MODEL': 'MPNet_XGB',
  'ROC_AUC': np.float64(0.9890049535799913),
  'PR_AUC': np.float64(0.6570914589302368)}]

# SECTION 9: PROBABILITY SANITY CHECK

In [25]:
# Ensure no leakage or overconfidence
base_rate = y_test.mean()
print("True pass rate:", base_rate)

for name, scores in [
    ("TFIDF_LR",p_test_lr),
    ("TFIDF_SVM",p_test_svm),
    ("MPNet_XGB",p_test_xgb)
]:
    print(name, "mean predicted prob:", scores.mean())


True pass rate: 0.001781798245614035
TFIDF_LR mean predicted prob: 0.12028842375212986
TFIDF_SVM mean predicted prob: 0.013985945471744575
MPNet_XGB mean predicted prob: 0.005888514


**Interpretation**

PR-AUC:

* Base rate = 0.0018 (~0.18%) (`base_rate = y_test.mean()`, basically # passes / # number of bills in the test set)

* Random PR-AUC ≈ base rate = 0.0018

This means:

* LogReg: 0.53 → ~300× better than random

* SVM: 0.63 → ~350× better than random

* MPNet+XGB: 0.66 → ~370× better than random

In highly imbalanced data, a PR-AUC above 0.50 is already excellent.

Thus:

* Our models can consistently rank passing bills near the top of the list.

* However, they can't perfectly say “yes/no” reliably yet.

**ROC-AUC high**

ROC-AUC compares: Probability the model ranks a true positive higher than a random negative.

ROC-AUC ≈ 0.99 means: In ~99% of positive–negative bill pairs, the positive bill has a higher score. That suggests strong ranking ability.

However: ROC-AUC is optimistically inflated under heavy imbalance. It says nothing about how many false alarms you make when acting on predictions.

ROC is good for sanity, but PR-AUC is good for real usefulness.

**PR-AUC: Why 0.657 is huge here**

Our dataset:

* Positives: 13
* Negatives: 7283

Random guessing would give: PR-AUC ≈ 13 / 7296 ≈ 0.0018

The MPNet: PR-AUC ≈ 0.657

This is an enormous lift.

What can it mean: The model puts most real successes near the top of its ranking, so checking the highest-scored bills lets you find winners quickly instead of hunting randomly.

"The model does not “predict yes or no.”
It ranks bills from “most likely to pass” → “least likely to pass.”

Because your PR-AUC is high (≈0.66) in a dataset where passes are extremely rare, it means:

1. “Most true passing bills are near the top of the ranking”

When you sort bills by model score:

Top of list  →  highest chance to pass
Bottom       →  lowest chance to pass

The real winners tend to appear near the top, not scattered randomly.

2. “Screening the top-N works”

If you took only the top results from the ranking, for example:

Top 50 bills
or
Top 100 bills

you would capture many of the actual passing bills, instead of having to sift through thousands of failures."

The model is good at triage:

* It can point you to the few bills worth paying attention to.

* It cannot confidently say “this bill will pass.”

**Interpretation of “mean predicted probability”**
True pass rate: 0.00178

TFIDF_LR mean predicted prob: 0.1203
TFIDF_SVM mean predicted prob: 0.0140
MPNet_XGB mean predicted prob: 0.0059

What should it be?

Good calibration: mean predicted prob ≈ true rate

Our models mean predicted prob seems inflated.

**TF-IDF + LogReg**

* Bad probability calibration

* The model is overconfident.

* It assigns probabilities far too high for a 0.18% event.

**SVM + calibration**

* Better but still inflated

**MPNet + XGB**

* BEST calibrated model

3.3× inflation is actually good under our constraints.

This is a strong sign:

* Best ranking, least miscalibration, most stable learner

**What these models are good for**

* Rank bills by passage likelihood extremely well.

* Identify top candidates worth human focus.

* Serve as a decision-support filter.

What they cannot do:

* Serve as a fully automated yes/no classifier.

* Provide literal probabilities like “this bill has 60% chance.”

* Be used for naive threshold decisions like p > 0.5.


**Best Model:** MPNet + XGBoost

Reasons:

* Highest PR-AUC

* Highest ROC-AUC

* Best probability calibration

* Transformer signal beats bag-of-words