# Public Comment Period Segmentation via GPT-3.5-Turbo-Instruct

## High Level Overview

This notebook demonstrates how to use Large Language Models to segment out the "public comment" and "public hearing" portions of municipal meeting transcripts. There may be multiple of these periods within a single meeting.

The way that this is setup is we prompt an LLM to record the "first_sentence_text" and the "last_sentence_text" for each period. In addition we specify how the model should output the processed result (a structured JSON object). You can view the whole prompt in the [./prompts/v0-public-comment-period-seg.jinja](./prompts/v0-public-comment-period-seg.jinja) file.

## Dataset

I manually annotated 12 meeting transcripts from the Seattle CDP dataset, marking their first and last sentence of the public comment periods. If I recall correctly, there were 3 meetings that had multiple public comment periods (or public hearings).

We convert the annotated dataset into a DataFrame with the full text of the transcript, the metadata of the event, and the "masses" of the segmentation, the masses tell the system how long each portion of the meeting is. For example, if the masses are: `[100, 200, 100]` the portions might be something like a string 100 characters long that is unclassified, then a string of 200 characters long that we assume is the annotated public comment period, and finally another string 100 characters long to close out the meeting.

In the case of multiple comment periods, the masses might look like: `[100, 200, 100, 200, 100]` (not-comment, comment, not-comment, comment, not-comment).

## LLMs

I have tested with a few LLMs: OpenAI's gpt-3.5-turbo, gpt-3.5-turbo-instruct-16k, gpt-3.5-turbo-16k, and gpt-4; Anthropic's claude-2-100k; and [an open-source fine-tuned version of Meta's llama-2](https://huggingface.co/Yukang/Llama-2-7b-longlora-100k-ft).

The latest commit of this notebook uses Anthropic's Claude 2 for demonstration because it is the simpliest model to use. The others do not have context windows large enough to fit the whole transcript into context and need to be partitioned. There are methods for handling that that showed promising results but for the sake of demonstration, I will stick with Claude-2.

## Evaluation

We use `segeval` to evaluate the performance of the system. In general we want to compare the relative distances between reference public comment segmentation boundaries and the predicted public comment segmentation boundaries. The closer they are the better, however because of how much administrative "cruft" is in the transcript / meeting, there does need to be some flexibility here because the clerk or the chair of the meeting may talk about opening the public comment period multiple times in the adminstration portion of transitioning to public comment. For example, they might say "Let's move on to public comment" and then after describing how public comment will work (time allotment, rules, etc.) they will finally say "We are now open to public comment, our first commenter is Eva." It isn't clear which one is "better" for segmentation and the system may choose either so we have to allow for tolerance but otherwise, `segeval` works great.

## Dataset Setup

In [1]:
import pandas as pd

# Load annotated dataset
df = pd.read_json("trial-datasets/seattle-public-comment-period-seg-v0.jsonl", lines=True)

# Prep dataset for eval
prepped_eval_rows = []
for _, row in df.iterrows():
    # Get text
    text = row["text"]
    
    # Get meta
    meta = row["meta"]

    # Construct masses
    masses = []
    prev_index = 0
    if isinstance(row["spans"], list):
        for span in row["spans"]:
            # Choose what index to get based off label
            if span["label"] == "FIRST-SENTENCE":
                # Get start index
                mass_calc_index = span["start"]
            else:
                # Get end index
                mass_calc_index = span["end"]

            # Add masses to list
            masses.append(mass_calc_index - prev_index)

            # Update prev index
            prev_index = mass_calc_index
        
        # Add final mass
        masses.append(len(text) - prev_index)
    else:
        # Add mass for full text
        masses.append(len(text))

    # Add to list
    prepped_eval_rows.append({
        "text": row["text"],
        "meta": row["meta"],
        "true_masses": masses,
    })

# Convert to dataframe
prepped_eval_df = pd.DataFrame(prepped_eval_rows)
prepped_eval_df = prepped_eval_df.sample(3)
prepped_eval_df

Unnamed: 0,text,meta,true_masses
4,"Thank you. Have a great day. Good morning, eve...","{'event_id': 'c511fea02999', 'session_id': '93...","[447, 3749, 72862]"
11,"The May 11th, 2022 meeting of the Seattle City...","{'event_id': 'b5e3673a68ff', 'session_id': 'c9...","[3133, 3043, 59419]"
10,"Good morning, everybody. The March 3, 2021 mee...","{'event_id': '749503d88894', 'session_id': 'f6...","[1671, 10938, 1491, 1725, 144754]"


## Prompt Setup

In [2]:
import json

import backoff
from dotenv import load_dotenv
from langchain.chat_models import ChatAnthropic, ChatOpenAI
from langchain.chat_models.base import BaseChatModel
from langchain.llms import HuggingFaceEndpoint
from langchain.output_parsers import PydanticOutputParser
from langchain import PromptTemplate
from langchain.schema import HumanMessage
from pydantic import BaseModel, Field
import spacy

###############################################################################

load_dotenv()
llm = ChatAnthropic(model="claude-2.0", temperature=0, max_tokens_to_sample=4096)
# llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, max_tokens=4096)
# llm = HuggingFaceEndpoint(
#     endpoint_url="https://boxjj56zj0zbbjue.us-east-1.aws.endpoints.huggingface.cloud",
#     task="text2text-generation",
# )

nlp = spacy.load("en_core_web_trf")

###############################################################################

class PublicCommentPeriod(BaseModel):
    first_sentence_text: str | None = Field(
        description="the text of the sentence which introduces the public comment period, or null if no public comment period was found",
    )
    last_sentence_text: str | None = Field(
        description="the text of the sentence which concludes the public comment period, or if null no public comment period was found",
    )

class MultiPublicCommentPeriod(BaseModel):
    periods: list[PublicCommentPeriod] = Field(
        description="the list of public comment periods (sometimes also called public hearings)",
    )

PUBLIC_COMMENT_PERIOD_SEG_PARSER = PydanticOutputParser(pydantic_object=MultiPublicCommentPeriod)

###############################################################################

PUBLIC_COMMENT_PERIOD_SEG_PROMPT = PromptTemplate.from_file(
    "prompts/v0-public-comment-period-seg.jinja",
    input_variables=["transcript"],
    partial_variables={
        "format_instructions": PUBLIC_COMMENT_PERIOD_SEG_PARSER.get_format_instructions(),
    },
    template_format="jinja2",
)

@backoff.on_exception(backoff.expo, json.JSONDecodeError, max_tries=3)
def _process_transcript(text: str) -> list[int]:
    # Convert text to sentences
    sentences = list(nlp(text).sents)

    # Convert to prompt ready string
    transcript_str = "\n\n".join([sent.text for sent in sentences[:300]])

    # Fill the prompt
    input_ = PUBLIC_COMMENT_PERIOD_SEG_PROMPT.format_prompt(transcript=transcript_str)

    # Generate
    if isinstance(llm, BaseChatModel):
        # Generate
        output = llm([HumanMessage(content=input_.to_string())]).content
    else:
        # Generate
        output = llm(input_.to_string())

    # Parse output
    try:
        pc_periods = PUBLIC_COMMENT_PERIOD_SEG_PARSER.parse(output)

    except:
        print(output)
        raise Exception("Failed to parse output")

    # Process all periods found
    prev_index = 0
    predicted_masses = []
    for pc_period in pc_periods.periods:
        # Process masses
        if (
            pc_period.first_sentence_text is not None
            and pc_period.last_sentence_text is not None
        ):
            first_sentence_index = text.find(pc_period.first_sentence_text)
            predicted_masses.append(first_sentence_index - prev_index)
            prev_index = first_sentence_index

            last_sentence_index = text.find(pc_period.last_sentence_text)
            predicted_masses.append(last_sentence_index - prev_index)
            prev_index = last_sentence_index

    # Add final mass (or full text as mass)
    if len(predicted_masses) == 0:
        predicted_masses.append(len(text))
    else:
        predicted_masses.append(len(text) - prev_index)

    return predicted_masses

  from .autonotebook import tqdm as notebook_tqdm


## Outputs

In [3]:
import segeval
from tqdm import tqdm

results = []
for _, row in tqdm(prepped_eval_df.iterrows(), total=len(prepped_eval_df)):
    # Get masses
    predicted_masses = _process_transcript(row["text"])

    # Get similarity
    #
    # The parameter `n_t` allows for greater "translational" edits 
    # rather than "addition" or "deletion" edits.
    # how this is actually computed is looking at the 
    # relative distances between "boundary marks"
    # i.e.
    # a reference segementation looks like:    0000000100000001000000010001000
    # and a predicted segmentation looks like: 0000010000000000010000001001000
    # however depending on how far away the boundary is,
    # it can be considered an addition, deletion, or translation
    # additions and deletions are penelized more heavily than translations
    # in our case, there is a lot of fuzziness around "what is the 
    # true first start sentence and last end sentence because there is lots of admin
    # so to allow for this fuzziness, we boost the "n (allowed) T(ranslation)"
    #
    # Ultimately what this says is that we allow translation to be ~0.5% of the
    # transcript different
    sim = segeval.boundary_similarity(row["true_masses"], predicted_masses, n_t=int(len(row["text"]) * 0.05))

    # Get confusion matrix
    matrix = segeval.boundary_confusion_matrix(row["true_masses"], predicted_masses, n_t=int(len(row["text"]) * 0.05))

    # Get precision, recall, and f1
    precision = segeval.precision(matrix)
    recall = segeval.recall(matrix)
    f1 = segeval.fmeasure(matrix)

    # Add to results
    results.append({
        "text": row["text"],
        "meta": row["meta"],
        "true_masses": row["true_masses"],
        "predicted_masses": predicted_masses,
        "similarity": sim,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    })

# Convert to dataframe
results_df = pd.DataFrame(results)

print("Mean Similarity:", results_df["similarity"].mean())
print("Mean Precision:", results_df["precision"].mean())
print("Mean Recall:", results_df["recall"].mean())
print("Mean F1:", results_df["f1"].mean())
print()
results_df

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [33:50<00:00, 676.78s/it]

Mean Similarity: 0.8734748287940417
Mean Precision: 1.0
Mean Recall: 1.0
Mean F1: 1.0






Unnamed: 0,text,meta,true_masses,predicted_masses,similarity,precision,recall,f1
0,"Thank you. Have a great day. Good morning, eve...","{'event_id': 'c511fea02999', 'session_id': '93...","[447, 3749, 72862]","[447, 3583, 73028]",0.9784527518172376,1,1,1
1,"The May 11th, 2022 meeting of the Seattle City...","{'event_id': 'b5e3673a68ff', 'session_id': 'c9...","[3133, 3043, 59419]","[982, 5101, 59512]",0.6578225068618481,1,1,1
2,"Good morning, everybody. The March 3, 2021 mee...","{'event_id': '749503d88894', 'session_id': 'f6...","[1671, 10938, 1491, 1725, 144754]","[1671, 10878, 1218, 1942, 144870]",0.9841492277030393,1,1,1


In [4]:
results_df[["true_masses", "predicted_masses", "similarity", "f1"]]

Unnamed: 0,true_masses,predicted_masses,similarity,f1
0,"[447, 3749, 72862]","[447, 3583, 73028]",0.9784527518172376,1
1,"[3133, 3043, 59419]","[982, 5101, 59512]",0.6578225068618481,1
2,"[1671, 10938, 1491, 1725, 144754]","[1671, 10878, 1218, 1942, 144870]",0.9841492277030393,1
