# Political topic unification
---
Testing the merging of political topics into high-level, concise topics.

## Setup

### Import libraries

In [None]:
import numpy as np
from pydantic import BaseModel, Field
from openai import OpenAI

In [None]:
from polids.config import settings
from polids.utils.text_similarity import compute_text_similarity_scores
from polids.topic_unification.openai import OpenAITopicUnifier

### Set parameters

In [None]:
system_prompt = """# Role
You are a political data analyst specialized in categorizing policy areas into standardized, concise supercategories for downstream analysis.

# Objective
Process a list of political topics with frequency counts to:
1. Generate concise, high-level policy categories (unified topics).
2. Map each input topic to its corresponding unified category.
3. Output the results as structured JSON following a defined schema.

# Input Format
A list where each line contains a topic and its frequency, separated by |. The list is pre-sorted in descending order by frequency.
```markdown
- Topic String 1 | Frequency 1
- Topic String 2 | Frequency 2
...
- Topic String N | Frequency N
```

# Task
1. **Parse Input:** Extract the topic string and frequency integer from each line of the input markdown list. The topic string alone will be used as the key in the output mapping.
2. **Cluster Semantically:** Group input topics that share core concepts, address related issues, or belong to a clear broader policy domain. Use the descending frequency order as a strong signal for identifying core themes, but prioritize semantic coherence.
3. **Define Unified Topics:** For each cluster, create a single, concise name for the unified topic.
4. **Map Inputs:** Assign each input topic string to exactly one unified topic name.
5. **Generate Output:** Format the results according to the specified JSON schema.

# Guidelines
- **Conciseness:** Unified topic names must be 1-5 words. (e.g., "Economy", "Healthcare", "Climate Action").
- **Mutual Exclusivity:** Unified topics must be distinct and non-overlapping in scope. Each input topic must map to only one unified topic.
- **Comprehensiveness:** All input topics must be mapped to a unified topic.
- **Political Relevance:** Use standard political science or common policy terminology for unified topic names.
- **Balance:** Aim for a reasonable number of unified topics (typically 5-12) – avoid over-granularity or excessive consolidation.

# Example
## Input
- Environmental Protection | 20
- Economic Growth Initiatives | 18
- Immigration reform | 17
- Climate Change Policies | 16
- Healthcare for All | 15
- Job creation programs | 14
- Strengthen military | 13
- Universal Healthcare Coverage | 12
- Improve public schools | 11
- Sustainable Economic Development | 10
- Border security funding | 10
- Renewable energy subsidies | 9
- Lower prescription drug costs | 8
- Reduce corporate tax rate | 7
- Affordable higher education | 6
- Combat Global Warming | 5
- Foreign aid reform | 4
##Output
{
  "unified_topics": [
    "Climate Action",
    "Economy",
    "Immigration & Border Security",
    "Healthcare",
    "Education",
    "Defense & Foreign Policy"
  ],
  "topic_mapping": {
    "Environmental Protection": "Climate Action",
    "Economic Growth Initiatives": "Economy",
    "Immigration reform": "Immigration",
    "Climate Change Policies": "Climate Action",
    "Healthcare for All": "Healthcare",
    "Job creation programs": "Economy",
    "Strengthen military": "Defense & Foreign Policy",
    "Universal Healthcare Coverage": "Healthcare",
    "Improve public schools": "Education",
    "Sustainable Economic Development": "Economy",
    "Border security funding": "Immigration",
    "Renewable energy subsidies": "Climate Action",
    "Lower prescription drug costs": "Healthcare",
    "Reduce corporate tax rate": "Economy",
    "Affordable higher education": "Education",
    "Combat Global Warming": "Climate Action",
    "Foreign aid reform": "Defense & Foreign Policy"
  }
}"""

In [None]:
max_retries = 5

In [None]:
def map_topic(input_topic: str, topic_mapping: dict[str, list[str]]) -> str:
    """
    Maps the input topic to its corresponding unified topic using the provided mapping.
    If the input topic is not found in the mapping, it returns the most similar topic,
    based on character-based and semantic similarity.

    Args:
        input_topic (str): The topic to be mapped.
        topic_mapping (dict[str, list[str]]): A dictionary mapping output topics to input topics.

    Returns:
        str: The mapped topic.
    """
    mapped_output_topics = []
    for output_topic, input_topics in topic_mapping.items():
        for input_topic_ in input_topics:
            if input_topic == input_topic_:
                # Add the exact match to the list of mapped output topics
                # (note that we don't have a guarantee that there is only one exact match)
                mapped_output_topics.append(output_topic)
    if len(mapped_output_topics) == 1:
        return mapped_output_topics[0]
    elif len(mapped_output_topics) > 1:
        # Compare the mapped output topics to find the most similar one
        output_topics_to_compare = mapped_output_topics
    else:
        # Compare all output topics to find the most similar one
        output_topics_to_compare = list(topic_mapping.keys())
    output_topic_similarity_scores = {
        output_topic: np.mean(compute_text_similarity_scores(input_topic, output_topic))
        for output_topic in output_topics_to_compare
    }
    # Return the output topic with the highest similarity score
    return max(
        output_topic_similarity_scores,
        # Compare each dictionary key based on their average similarity score
        # and return the one with the highest score
        key=output_topic_similarity_scores.get,  # type: ignore
    )

## Load topics to merge
We're going to use manually defined topics, so as to avoid dependencies on previous steps of the pipeline.

In [None]:
topics_to_unify = {
    # Social Issues / Civil Rights
    "Protecting LGBTQ+ rights": 18,
    "Ensuring voting access for all": 25,
    "Criminal justice reform initiatives": 22,
    "Addressing systemic racism": 15,
    "Defending Second Amendment freedoms": 28,
    "Common-sense gun safety laws": 26,
    "Protecting freedom of religion": 12,
    "Police funding and accountability": 19,
    "Reproductive healthcare access": 23,
    # Technology & Infrastructure
    "Expanding rural broadband internet": 16,
    "Investing in national infrastructure (roads, bridges)": 30,
    "Cybersecurity for critical systems": 14,
    "Regulating big tech companies": 11,
    "Modernizing the power grid": 17,
    "Funding for scientific research (NIH, NSF)": 9,
    # Government Reform & Ethics
    "Campaign finance reform": 20,
    "Addressing political corruption": 13,
    "Term limits for elected officials": 8,
    "Protecting whistleblowers": 6,
    "Strengthening ethics regulations": 10,
    # Economy & Labor (Different Focus)
    "Raising the federal minimum wage": 24,
    "Supporting small business recovery": 21,
    "Protecting workers' right to organize": 18,
    "Addressing income inequality": 20,
    "Investing in workforce development programs": 15,
    # Agriculture & Environment (Different Focus)
    "Supporting sustainable farming practices": 7,
    "Ensuring food safety and security": 11,
    "Protecting national parks and public lands": 19,
    "Water resource management": 14,
    # Other
    "Addressing the opioid crisis": 22,
    "Improving veterans' healthcare and benefits": 27,
    "Affordable housing and homelessness": 16,
    "Disaster preparedness and relief funding": 10,
}

# Sort the topics by frequency in descending order
sorted_topics = sorted(topics_to_unify.items(), key=lambda x: x[1], reverse=True)
# Format the sorted topics into the required input format
input_markdown = "\n".join([f"- {topic} | {freq}" for topic, freq in sorted_topics])
print(input_markdown)

## Define the output schema

In [None]:
class UnifiedTopic(BaseModel):
    """
    Represents a high-level policy category with its constituent original topics.
    Ensures each unified topic contains at least two original topics.
    """

    name: str = Field(
        description="Concise name for the unified policy category (1-5 words)",
    )

    original_topics: list[str] = Field(
        description="List of original input topics mapped to this unified category",
    )


class UnifiedTopicsOutput(BaseModel):
    """
    Schema for political topic unification output with nested topic structure.
    Ensures mutual exclusivity through explicit grouping in separate unified topics.
    """

    unified_topics: list[UnifiedTopic] = Field(
        description="List of high-level policy categories with their constituent original topics. "
        "Each unified topic must contain at least two original topics, and all original "
        "topics from input must be included across the list. Categories must be mutually exclusive.",
    )

In [None]:
def get_topic_mapping_from_unified_topics(
    unified_topics: UnifiedTopicsOutput,
) -> dict[str, list[str]]:
    """
    Converts the unified topics output into a mapping of unified topic names to their original topics.

    Args:
        unified_topics (UnifiedTopicsOutput): The unified topics output.

    Returns:
        dict[str, list[str]]: A dictionary mapping each unified topic name to its original topics.
    """
    return {ut.name: ut.original_topics for ut in unified_topics.unified_topics}

## Test different LLMs

In [None]:
# Save the outputs of each method in a dictionary (key = method name)
topics_unification_results = dict()

### Initialize the LLM client

In [None]:
client = OpenAI(api_key=settings.openai_api_key)

### GPT 4.1 nano

In [None]:
llm_name = "gpt-4.1-nano-2025-04-14"
completion = client.beta.chat.completions.parse(
    model=llm_name,
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": input_markdown,
        },
    ],
    response_format=UnifiedTopicsOutput,  # Specify the schema for the structured output
    temperature=0,  # Low temperature should lead to less hallucination
    seed=42,  # Fix the seed for reproducibility
)
topics_unification_results[llm_name] = completion.choices[0].message.parsed
assert isinstance(topics_unification_results[llm_name], UnifiedTopicsOutput), (
    "Output does not match the expected schema."
)
topics_unification_results[llm_name]

In [None]:
unified_topic_names = [
    topic.name for topic in topics_unification_results[llm_name].unified_topics
]
print(
    f"{llm_name} generated {len(unified_topic_names)} unified topics: {', '.join(unified_topic_names)}"
)
topic_word_lengths = [
    len(topic.name.split())
    for topic in topics_unification_results[llm_name].unified_topics
]
print(
    f"{llm_name} generated unified topics with average word length of {sum(topic_word_lengths) / len(topic_word_lengths):.2f}"
)
mapped_topics = [
    original_topic
    for unified_topic in topics_unification_results[llm_name].unified_topics
    for original_topic in unified_topic.original_topics
]
unmapped_topics = [
    topic for topic in topics_to_unify.keys() if topic not in mapped_topics
]
print(
    f"{llm_name} left {len(unmapped_topics)} unmapped topics: {', '.join(unmapped_topics)}"
)
hallucinated_topics = [
    topic for topic in mapped_topics if topic not in topics_to_unify.keys()
]
print(
    f"{llm_name} hallucinated {len(hallucinated_topics)} topics: {', '.join(hallucinated_topics)}"
)

In [None]:
topic_mapping = get_topic_mapping_from_unified_topics(
    unified_topics=topics_unification_results[llm_name]
)
mapped_topics = {
    input_topic: map_topic(input_topic, topic_mapping)
    for input_topic in topics_to_unify.keys()
}
mapped_topics

GPT 4.1 nano can sometimes hallucinate topics. It also often leaves input topics unmapped.

### GPT 4.1 mini

In [None]:
llm_name = "gpt-4.1-mini-2025-04-14"
completion = client.beta.chat.completions.parse(
    model=llm_name,
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": input_markdown,
        },
    ],
    response_format=UnifiedTopicsOutput,  # Specify the schema for the structured output
    temperature=0,  # Low temperature should lead to less hallucination
    seed=42,  # Fix the seed for reproducibility
)
topics_unification_results[llm_name] = completion.choices[0].message.parsed
assert isinstance(topics_unification_results[llm_name], UnifiedTopicsOutput), (
    "Output does not match the expected schema."
)
topics_unification_results[llm_name]

In [None]:
unified_topic_names = [
    topic.name for topic in topics_unification_results[llm_name].unified_topics
]
print(
    f"{llm_name} generated {len(unified_topic_names)} unified topics: {', '.join(unified_topic_names)}"
)
topic_word_lengths = [
    len(topic.name.split())
    for topic in topics_unification_results[llm_name].unified_topics
]
print(
    f"{llm_name} generated unified topics with average word length of {sum(topic_word_lengths) / len(topic_word_lengths):.2f}"
)
mapped_topics = [
    original_topic
    for unified_topic in topics_unification_results[llm_name].unified_topics
    for original_topic in unified_topic.original_topics
]
unmapped_topics = [
    topic for topic in topics_to_unify.keys() if topic not in mapped_topics
]
print(
    f"{llm_name} left {len(unmapped_topics)} unmapped topics: {', '.join(unmapped_topics)}"
)
hallucinated_topics = [
    topic for topic in mapped_topics if topic not in topics_to_unify.keys()
]
print(
    f"{llm_name} hallucinated {len(hallucinated_topics)} topics: {', '.join(hallucinated_topics)}"
)

In [None]:
topic_mapping = get_topic_mapping_from_unified_topics(
    unified_topics=topics_unification_results[llm_name]
)
mapped_topics = {
    input_topic: map_topic(input_topic, topic_mapping)
    for input_topic in topics_to_unify.keys()
}
mapped_topics

GPT 4.1 mini can leave some input topics unmapped.

### GPT 4.1

In [None]:
llm_name = "gpt-4.1-2025-04-14"
completion = client.beta.chat.completions.parse(
    model=llm_name,
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": input_markdown,
        },
    ],
    response_format=UnifiedTopicsOutput,  # Specify the schema for the structured output
    temperature=0,  # Low temperature should lead to less hallucination
    seed=42,  # Fix the seed for reproducibility
)
topics_unification_results[llm_name] = completion.choices[0].message.parsed
assert isinstance(topics_unification_results[llm_name], UnifiedTopicsOutput), (
    "Output does not match the expected schema."
)
topics_unification_results[llm_name]

In [None]:
unified_topic_names = [
    topic.name for topic in topics_unification_results[llm_name].unified_topics
]
print(
    f"{llm_name} generated {len(unified_topic_names)} unified topics: {', '.join(unified_topic_names)}"
)
topic_word_lengths = [
    len(topic.name.split())
    for topic in topics_unification_results[llm_name].unified_topics
]
print(
    f"{llm_name} generated unified topics with average word length of {sum(topic_word_lengths) / len(topic_word_lengths):.2f}"
)
mapped_topics = [
    original_topic
    for unified_topic in topics_unification_results[llm_name].unified_topics
    for original_topic in unified_topic.original_topics
]
unmapped_topics = [
    topic for topic in topics_to_unify.keys() if topic not in mapped_topics
]
print(
    f"{llm_name} left {len(unmapped_topics)} unmapped topics: {', '.join(unmapped_topics)}"
)
hallucinated_topics = [
    topic for topic in mapped_topics if topic not in topics_to_unify.keys()
]
print(
    f"{llm_name} hallucinated {len(hallucinated_topics)} topics: {', '.join(hallucinated_topics)}"
)

In [None]:
topic_mapping = get_topic_mapping_from_unified_topics(
    unified_topics=topics_unification_results[llm_name]
)
mapped_topics = {
    input_topic: map_topic(input_topic, topic_mapping)
    for input_topic in topics_to_unify.keys()
}
mapped_topics

GPT 4.1 seems pretty reliable in following the instructions. No sign of hallucinated nor unmapped topics. It also consistently generates a shorter list of unified topics, with each one being more concisely worded than the smaller LLMs.

### Implemented solution

In [None]:
topic_unifier = OpenAITopicUnifier()
topics_unification_results_polids = topic_unifier.process(
    input_topic_frequencies=topics_to_unify
)
topics_unification_results_polids

In [None]:
mapped_topics = {
    input_topic: topic_unifier.map_input_topic_to_unified_topic(input_topic)
    for input_topic in topics_to_unify.keys()
}
mapped_topics