# Part 0: Data Preparation

This notebook shows how the midterm dataset was created. **You do not need to run this notebook yourself** — the data files in `data/` are already provided. This notebook is here so you can see exactly where the data came from and what it contains before you start working with it.

## Source

The data comes from [`pem207/maine-bills`](https://huggingface.co/datasets/pem207/maine-bills) on Hugging Face — a dataset of bills introduced in the Maine Legislature. We use the **132nd session**, which contains bills from 2025–2026.

Each bill has:
- A short **title** (e.g., *An Act to Reduce Cost and Increase Access to Dental Care*)
- The full **text** of the bill
- A **committee** assignment (the committee the bill was referred to)

In the Maine legislature, when someone proposes a bill it gets "referred" to a committee that specializes in the bill's topic. For example, a bill about housing policy would likely be referred to the "Housing and Economic Development" committee. **In this assignment, we will try to predict whether a bill was referred to the "Housing and Economic Development" committee based on the title and text, then look at predicting the specific committee assignment as a multiclass classification problem.**

(After going to the committee, each committee votes on a recommendation for the bill, then the bill is voted on by the full legislature. We are not looking at that in this assignment, just focusing on the first step.)

## What We Do

1. Load the raw dataset from Hugging Face
2. Filter to bills that have a known committee assignment
3. Compute **sentence embeddings** for the title and full text using [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
4. Save four files to `data/`:
   - `X.json` — features (LD number + embeddings)
   - `y.json` — binary label (was the bill sent to Housing and Economic Development?)
   - `y_multi.json` — multiclass label (which committee?)
   - `raw.json` — human-readable version (title, text, committee name)

In [1]:
import os

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

  return f
  return self._get_more_data(ov, maxsize)


## Step 1: Load the Raw Dataset

We load the 132nd session from Hugging Face and convert it to a pandas DataFrame.

In [2]:
ds = load_dataset("pem207/maine-bills", "132", split="train")
df = ds.to_pandas()
df[["session", "ld_number", "document_type", "amendment_code", "committee", "title", "text"]].head()

README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development




data/132/train-00000-of-00001.parquet:   0%|          | 0.00/7.96M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3510 [00:00<?, ? examples/s]

Unnamed: 0,session,ld_number,document_type,amendment_code,committee,title,text
0,132,1,bill,CA_A_S9,,An Act to Increase Storm,"COMMITTEE AMENDMENT"" A ""to S.P. 29, L.D. 1, uA..."
1,132,3,bill,CA_A_S180,,An Act to Adopt Eastern,"COMMITTEE AMENDMENT ""A H to S.P. 12, L.D. 3, ""..."
2,132,1,bill,,Appropriations and Financial Affairs,An Act to Increase Storm Preparedness for Main...,Legislative Document\nNo. 1\nS.P. 29\nIn Senat...
3,132,2,bill,,,An Act to Allow Military Vehicles Purchased fo...,Legislative Document\nNo. 2\nS.P. 11\nIn Senat...
4,132,3,bill,,,An Act to Adopt Eastern Daylight Time Year-rou...,Legislative Document\nNo. 3\nS.P. 12\nIn Senat...


## Step 2: Filter to Bills with Known Committee Assignments

Not every row is useful for our task:
- Rows with `amendment_code` set are **committee amendments**, not original bills — we drop them
- Rows where `committee` is `NaN` have no label — we drop them too

After filtering we have **1,307 bills**, each with a committee assignment.

In [3]:
df = df[df["committee"].notna()]
df = df[df["amendment_code"].isna()]
df.shape

(1307, 15)

## Step 3: Explore the Committee Distribution

There are 20 committees in the raw dataset. Bills are unevenly distributed — Judiciary receives the most, with over 160 bills, while a few committees have only a handful.

In [4]:
df.groupby("committee")["ld_number"].count().sort_values(ascending=False)

committee
Judiciary                                            164
Health and Human Services                            149
Taxation                                             105
Health Coverage, Insurance and Financial Services    103
Housing and Economic Development                     101
Education and Cultural Affairs                        95
Criminal Justice and Public Safety                    92
State and Local Government                            79
Transportation                                        72
Veterans and Legal Affairs                            67
Energy, Utilities and Technology                      62
Labor                                                 54
Environment and Natural Resources                     51
Agriculture, Conservation and Forestry                43
Inland Fisheries and Wildlife                         33
Marine Resources                                      18
Appropriations and Financial Affairs                  16
Criminal Justice     

Note that a few committees (`Criminal Justice`, `Health`, `Re`) appear only once — likely OCR artifacts or edge cases. We drop any committee with only one sample, since these can't be split across train/test sets and would cause problems for stratified splitting and evaluation.

## Step 4: Drop Singleton Classes

In [5]:
committee_counts = df.groupby("committee")["ld_number"].count()
singleton_committees = committee_counts[committee_counts <= 1].index.tolist()
print(f"Dropping {len(singleton_committees)} singleton committees: {singleton_committees}")
df = df[~df["committee"].isin(singleton_committees)]
print(f"Remaining bills: {df.shape[0]}")
df.groupby("committee")["ld_number"].count().sort_values(ascending=False)

Dropping 3 singleton committees: ['Criminal Justice', 'Health', 'Re']
Remaining bills: 1304


committee
Judiciary                                            164
Health and Human Services                            149
Taxation                                             105
Health Coverage, Insurance and Financial Services    103
Housing and Economic Development                     101
Education and Cultural Affairs                        95
Criminal Justice and Public Safety                    92
State and Local Government                            79
Transportation                                        72
Veterans and Legal Affairs                            67
Energy, Utilities and Technology                      62
Labor                                                 54
Environment and Natural Resources                     51
Agriculture, Conservation and Forestry                43
Inland Fisheries and Wildlife                         33
Marine Resources                                      18
Appropriations and Financial Affairs                  16
Name: ld_number, dtyp

In [6]:
model = SentenceTransformer("all-MiniLM-L6-v2")

df["title_embedding"] = model.encode(df["title"].tolist(), show_progress_bar=True).tolist()
df["text_embedding"] = model.encode(df["text"].tolist(), show_progress_bar=True).tolist()

import numpy as np
sample = np.array(df["title_embedding"].iloc[0])
print(f"Embedding dimension: {len(sample)}")
print(f"First title embedding (first 8 values): {sample[:8].round(5)}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/41 [00:00<?, ?it/s]

Batches:   0%|          | 0/41 [00:00<?, ?it/s]

Embedding dimension: 384
First title embedding (first 8 values): [ 0.00848  0.02715  0.10227  0.07883 -0.03171 -0.01738 -0.02548 -0.09729]


## Step 5: Build Labels

We construct two label columns:

| Column | Type | Description |
|--------|------|-------------|
| `committee_bool` | `int` (0 or 1) | Was this bill sent to **Housing and Economic Development**? |
| `committee_id` | `int` (0–16) | Which committee? (integer-encoded) |

In [7]:
df["committee_id"] = df["committee"].astype("category").cat.codes
df["committee_bool"] = df["committee"].str.contains("Housing and Economic Development").astype(int)

df[["ld_number", "committee", "committee_id", "committee_bool"]].head()

Unnamed: 0,ld_number,committee,committee_id,committee_bool
2,1,Appropriations and Financial Affairs,1,0
162,91,"Health Coverage, Insurance and Financial Services",6,0
164,92,Environment and Natural Resources,5,0
167,93,Health and Human Services,7,0
169,94,Health and Human Services,7,0


## Step 6: Save the Data Files

We split the data into four files:

| File | Contents |
|------|----------|
| `data/X.json` | `ld_number`, `title_embedding`, `text_embedding` |
| `data/y.json` | `committee_bool` (binary label) |
| `data/y_multi.json` | `committee_id` (multiclass label, 0–16) |
| `data/raw.json` | `ld_number`, `title`, `text`, `committee` (human-readable) |

In [8]:
X = df[["ld_number", "title_embedding", "text_embedding"]]
y = df[["committee_bool"]]
y_multi = df[["committee_id"]]
raw = df[["ld_number", "title", "text", "committee"]]

print(f"X: {X.shape}")
print(f"y: {y.shape}")
print(f"y_multi: {y_multi.shape}")
print(f"raw: {raw.shape}")

# os.makedirs("data", exist_ok=True)
# X.to_json("data/X.json", orient="records")
# y.to_json("data/y.json", orient="records")
# y_multi.to_json("data/y_multi.json", orient="records")
# raw.to_json("data/raw.json", orient="records")
# print("Saved to data/")

X: (1304, 3)
y: (1304, 1)
y_multi: (1304, 1)
raw: (1304, 4)


## What You'll Work With

In the following notebooks you will load `data/X.json` and the appropriate label file. Here's a quick look at what they contain.

**Features (`X.json`)**: Each row is one bill. The embeddings are 384-dimensional vectors — each value represents a learned semantic feature of the text. You don't need to interpret individual dimensions; the geometry of the embedding space is what matters.

In [9]:
import pandas as pd

X_loaded = pd.read_json("data/X.json")
summary = X_loaded.head(3)[["ld_number"]].copy()
summary["title_embedding (dim)"] = X_loaded["title_embedding"].head(3).apply(len)
summary["text_embedding (dim)"] = X_loaded["text_embedding"].head(3).apply(len)
summary

Unnamed: 0,ld_number,title_embedding (dim),text_embedding (dim)
0,1,384,384
1,91,384,384
2,92,384,384


**Labels**: Use `raw.json` alongside your label file to see what the model is actually predicting.

In [10]:
raw_loaded = pd.read_json("data/raw.json")
y_loaded = pd.read_json("data/y.json")
y_multi_loaded = pd.read_json("data/y_multi.json")

preview = raw_loaded[["ld_number", "title", "committee"]].head()
preview["committee_bool"] = y_loaded["committee_bool"].head().values
preview["committee_id"] = y_multi_loaded["committee_id"].head().values
preview

Unnamed: 0,ld_number,title,committee,committee_bool,committee_id
0,1,An Act to Increase Storm Preparedness for Main...,Appropriations and Financial Affairs,0,1
1,91,An Act to Authorize Employees of the Maine Ass...,"Health Coverage, Insurance and Financial Services",0,6
2,92,An Act Regarding the Management of the Waste C...,Environment and Natural Resources,0,5
3,93,An Act to Reduce Cost and Increase Access to D...,Health and Human Services,0,7
4,94,An Act to Eliminate Miscarriage Reporting Requ...,Health and Human Services,0,7
