# Using LLMs to classify if social media posts are political or not (Part II).

I'm working on a project that involves gathering social media posts from [Bluesky](https://bsky.app/) and analyzing them. Part of that project requires knowing which posts are about political or social topics, and if so, what political side they support. Current ML classifiers don't work that well out of the box, so I'm trying to create our own classification scheme using LLMs. I'm trying to use LLMs in order to classify [Bluesky](https://bsky.app/) posts as either having political content or not, and if so, the political ideology, and I've found that LLMs work quite well for this task. I've used Llama3-8b and Llama3-70b via [Groq](https://groq.com/) so far, but are also open to experimenting with other open-source models as well (I have the on-prem infrastructure to host our own models, which is much cheaper at scale).

Previously, I've tried using just naive text classification and then afterwards adding context to the classifications. Now that I've shown that individual prompts work, I'd like to now work on refining this approach. We already know that it works enough for a quick pilot, but (1) it runs very slow because it runs serially, and (2) we want to improve the context that we provide to the LLM even more.

Specifically, there are a few new experiments things that I'd like to try:
- Can we change the prompt format so that the post information is given as a JSON?
- Can we implement batching? Implementing batching would have two main speed-ups:
    - Reduce the number of requests that are sent, since we can group things together.
    - Reduce input token count (since we don't need to add the question and background preamble portions of the prompt over and over again).
- Can we improve the context? Can we add context about current events?
- How does our model perform with other LLMs (e.g., Mixtral)?
- Can we experiment with optimizing the prompt (e.g, with [dspy](https://github.com/stanfordnlp/dspy))?

### Can we change the prompt format so that the post information is given as a JSON?

Currently, when we have posts with context, it looks something like this:

```plaintext
Pretend that you are a classifier that predicts whether a post has sociopolitical content or not. Sociopolitical refers \
to whether a given post is related to politics (government, elections, politicians, activism, etc.) or \
social issues (major issues that affect a large group of people, such as the economy, inequality, \
racism, education, immigration, human rights, the environment, etc.). We refer to any content \
that is classified as being either of these two categories as "sociopolitical"; otherwise they are not sociopolitical. \
Please classify the following text denoted in <text> as "sociopolitical" or "not sociopolitical". 

Then, if the post is sociopolitical, classify the text based on the political lean of the opinion or argument \
it presents. Your options are "democrat", "republican", or 'unclear'. \
You are analyzing text that has been pre-identified as 'political' in nature. \
If the text is not sociopolitical, return "unclear".

Think through your response step by step.

Return in a JSON format in the following way:
{
    "sociopolitical": <two values, 'sociopolitical' or 'not sociopolitical'>,
    "political_ideology": <three values, 'democrat', 'republican', 'unclear'>,
    "reason_sociopolitical": <optional, a 1 sentence reason for why the text is sociopolitical. If none, return an empty string, "">,
    "reason_political_ideology": <optional, a 1 sentence reason for why the text has the given political ideology. If none, return an empty string, "">
}

All of the fields in the JSON must be present for the response to be valid, and the answer must be returned in JSON format.

<text>
Curious what it looks like when people are aggressively not moving.

Here is the post text that needs to be classified:
'''
<text>
Curious what it looks like when people are aggressively not moving.
'''
Here is some context on the post that needs classification:
'''
<Thread that the post is a part of>

The post is a reply to another post with the following details:
'''
[text]: A high ranking NYPD official characterized NYU faculty linking hands and refusing to move—a classic tactic of *peaceful* protest—as aggression toward its officers and he got it uncritically published by the local Fox station and every outlet that aggregated from them. wapo.st/3UwN0Vf 🎁
'''
'''
Again, the text of the post that needs to be classified is:
'''
<text>
Curious what it looks like when people are aggressively not moving.
```

The formatting is not very clean and relies on adding arbitrary string chunks to our prompt. We could improve the formatting of the post information by instead structuring our code like the following:

```python
{
	"text": "<text to classify>",
	"context": {
		"content_referenced_in_post": "<Information about content referenced in the post, if any>",
		"urls_in_post": "<information about URLs in the post, if any>"
		"post_thread": "<Information about the thread that the post is a part of, if any>",
		"post_tags_and_labels": "<Information about any tags or labels in a post, if any>",
		"current_events_context": "<Any information about current events that might help the model, if any>",
		"post_author_context": "<Any information about the author that might help the model, if any>"
	}
}
```

I currently have [this](https://github.com/METResearchGroup/bluesky-research/blob/main/ml_tooling/llm/prompt_helper.py) function that does the bulk of the work in setting up the context of the prompt:

In [1]:
def generate_context_string(
    post: dict,
    context_details_list: list[tuple],
    justify_result: bool = False,
    only_json_format: bool = False
) -> str:
    """Given a list of (context_type, context_details) tuples, generate
    the context string for the prompt."""
    post_text = post["text"]
    full_context = ""
    for context_type, context_details in context_details_list:
        full_context += f"<{context_type}>\n {context_details}\n"
    if full_context:
        full_context = f"""
The classification of a post might depend on contextual information. \
For example, the text in a post might comment on an image or on a retweeted post. \
Attend to the context where appropriate. \
Here is some context on the post that needs classification: \
```
{full_context}
```
Again, the text of the post that needs to be classified is:
```
<text>
{post_text}
```
"""  # noqa
    if justify_result:
        full_context += "\nAfter giving your label, start a new line and then justify your answer in 1 sentence."  # noqa
    else:
        full_context += "\nJustifications are not necessary."
    if only_json_format:
        full_context += "\nReturn ONLY the JSON. I will parse the string result in JSON format."
    return full_context

This code relies on `context_details_list`, a list of tuples where the first value describes what the type of context is and the second value is a function that generates the string for that type of context.

```python
post_context_and_funcs = [
    ("Content referenced or linked to in the post", embedded_content_context),
    ("URLs", post_linked_urls),
    ("Thread that the post is a part of", define_thread_context),
    ("Tags and labels in the post", post_tags_labels),
    ("Context about current events", additional_current_events_context),
    ("Context about the post author", post_author_context)
]


def generate_context_details_list(post: dict) -> list[tuple]:
    context_details_list = []
    for context_name, context_func in post_context_and_funcs:
        context = context_func(post)
        if context:
            context_details_list.append((context_name, context))
    return context_details_list
```

We can refactor this logic and instead create a dictionary where the keys are the context type and the values are the values that we want for that type of context.

```python
post_context_and_funcs = [
    ("content_referenced_in_post", embedded_content_context),
    ("urls_in_post", post_linked_urls)
    ("post_thread", define_thread_context),
    ("post_tags_labels", post_tags_labels),
    ("current_events_context", additional_current_events_context),
    ("post_author_context", post_author_context)
]
```

Now I just have to change the function signatures of each function in order to return a dictionary with values for that type of context. For example, I can change how my `post_author_context` is implemented in order to return a dictionary:

```python
def post_author_context(post: dict) -> dict:
    """Returns contextual information about the post's author.

    For now, we just return if the post author is a news org, but we can add to
    this later on.
    """
    author = post["author"]
    return {
        "post_author_is_reputable_news_org": author in bsky_did_to_news_org_name
    }
```

To enforce the schemas, I also created Pydantic models. This'll help us do some type checking, more conveniently update our expected schemas, and make the expected result much more apparent.

```python
from pydantic import BaseModel, Field, validator
from typing import Optional, Union


class ImagesContextModel(BaseModel):
    """Pydantic model for the images context.

    Since we don't have OCR, we can't extract text from images, but we can
    extract the alt texts of the images in the post.
    """
    image_alt_texts: Optional[str] = Field(
        default=None,
        description="The alt texts of the images in the post."
    )


class RecordContextModel(BaseModel):
    """Pydantic model for the record context, which are records that are
    referenced in a post. Records are just another name for posts, such as if
    a post links to another Bluesky post."""
    text: Optional[str] = Field(
        default=None,
        description="The text of the post."
    )
    embed_image_alt_text: Optional[ImagesContextModel] = Field(
        default=None,
        description="The alt text of the embedded image in the post."
    )


class RecordWithMediaContextModel(BaseModel):
    """Pydantic model for the record with media context.

    This is a record that has media, such as an image or video.
    """
    images_context: Optional[ImagesContextModel] = Field(
        default=None,
        description="The images context of the post."
    )
    embedded_record_context: Optional[RecordContextModel] = Field(
        default=None,
        description="The record context of the embedded post."
    )
```

That way, we can now create functions for adding context that look like this:

```python
def post_linked_urls(post: dict) -> PostLinkedUrlsContextModel:
    """Context if the post refers to any URLs in the text."""
    url_in_text_context: ContextUrlInTextModel = context_url_in_text(post)
    embed_url_context: ContextEmbedUrlModel = context_embed_url(post)

    return PostLinkedUrlsContextModel(
        url_in_text_context=url_in_text_context,
        embed_url_context=embed_url_context
    )
```

Each function takes care of a particular type of context. We can then create our overall context by chaining these functions together to look like this:

```python
post_context_and_funcs = [
    ("content_referenced_in_post", embedded_content_context),
    ("urls_in_post", post_linked_urls),
    ("post_thread", define_thread_context),
    ("post_tags_labels", post_tags_labels),
    #("current_events_context", additional_current_events_context),
    ("post_author_context", post_author_context)
]


def generate_context_details_list(post: dict) -> list[tuple]:
    """Generates a list of tuples of (context_type, context_details) for a
    post."""
    context_details_list = []
    for context_name, context_func in post_context_and_funcs:
        context = context_func(post)
        if context:
            context_details_list.append((context_name, context))
    return context_details_list


def generate_post_and_context_json(post: dict) -> dict:
    """Creates a JSON object with the post and its context.

    The JSON object has the following format:
    {
        "post": {
            "text": "The text of the post"
        },
        "context": {
            "context_type": "context_details"
        }
    }
    """
    context_details_list: list[tuple] = generate_context_details_list(post)
    context_dict = {
        # convert each Pydantic model to dict
        context_type: context_details.dict()
        for (context_type, context_details) in context_details_list
    }
    return {
        "text": post["text"],
        "context": context_dict
    }
```

This creates a JSON object that looks like this:
```python
   {
        "post": {
            "text": "The text of the post"
        },
        "context": {
            "context_type": "context_details"
        }
    }
```

If we take a post that we looked at last time:

![Example Bluesky post](assets/images/sample_post_no_context_2.png "Example Bluesky post")

We can see what the new prompt to the LLM will look like:

```plaintext
Pretend that you are a classifier that predicts whether a post has sociopolitical content or not. Sociopolitical refers to whether a given post is related to politics (government, elections,
politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.).
We refer to any content that is classified as being either of these two categories as "sociopolitical"; otherwise they are not sociopolitical. Please classify the following text denoted in <text> as
"sociopolitical" or "not sociopolitical".

Then, if the post is sociopolitical, classify the text based on the political lean of the opinion or argument it presents. Your options are "democrat", "republican", or 'unclear'. You are analyzing
text that has been pre-identified as 'political' in nature. If the text is not sociopolitical, return "unclear".

Think through your response step by step.

Return in a JSON format in the following way:
{
    "sociopolitical": <two values, 'sociopolitical' or 'not sociopolitical'>,
    "political_ideology": <three values, 'democrat', 'republican', 'unclear'>,
    "reason_sociopolitical": <optional, a 1 sentence reason for why the text is sociopolitical. If none, return an empty string, "">,
    "reason_political_ideology": <optional, a 1 sentence reason for why the text has the given political ideology. If none, return an empty string, "">
}

All of the fields in the JSON must be present for the response to be valid, and the answer must be returned in JSON format.


Here is the post text that needs to be classified:
'''
<text>
Time for Shafiq to go. Ironic her cowing to unappeasable pols to keep her job actually led to her losing it. And deserved too.
'''


The following JSON object contains the post and its context:
'''
    {'context': {'content_referenced_in_post': {'embedded_content_type': None, 'embedded_record_with_media_context': None, 'has_embedded_content': False},
             'post_author_context': {'post_author_is_reputable_news_org': False},
             'post_tags_labels': {'post_labels': '', 'post_tags': ''},
             'post_thread': {'thread_parent_post': {'embedded_image_alt_text': None,
                                                    'text': "Faculty walkout at Columbia. I think it's safe to say President Shafiq & other leaders have lost their confidence."},
                             'thread_root_post': {'embedded_image_alt_text': None,
                                                  'text': "Faculty walkout at Columbia. I think it's safe to say President Shafiq & other leaders have lost their confidence."}},
             'urls_in_post': {'embed_url_context': {'is_trustworthy_news_article': False, 'url': None}, 'url_in_text_context': {'has_trustworthy_news_links': False}}},
 'text': 'Time for Shafiq to go. Ironic her cowing to unappeasable pols to keep her job actually led to her losing it. And deserved too.'}
'''
```

It doesn't do anything to change our result:

```python

```

But it does clean up the prompt and make it look cleaner.

### Batching

We want to be able to do the following:
- Classify posts in bulk
- Be able to error-check what the model gives us:
    - If we're classifying posts in bulk, make sure that we get as many JSONs as we expect.
    - Make sure that each JSON is properly formatted (has all the correct fields and compiles to JSON).

### Improving context

#### Proposing new ways to add context
- [way 1]

I also considered a few other alternatives and ruled them out for various reasons:
- [alternative 1]

### Batching plus improving the context

#### Calculating and estimating batching requirements


