# Labeling posts with Llama3-8b via Groq

We want to figure out what model would be good for our inference task. We've been experimenting with Llama3-8b via Groq and it seems to perform pretty well for our zero-shot task! Let's see how this performs on our pilot data.

We'll use Groq for this, and run it on 300 posts from our pilot data. We'll use the same prompts that we did from our demo Streamlit app.

We'll use the LiteLLM [Groq](https://litellm.vercel.app/docs/providers/groq) connection to connect to Groq.


In [155]:
import os

import pandas as pd

from ml_tooling.llm.inference import run_query
from ml_tooling.llm.prompt_helper import generate_complete_prompt_for_post_link

In [2]:
MODEL_NAME = "Llama3-8b (via Groq)"
current_wd = os.getcwd()
PILOT_DATA_FP = "../manuscript_pilot/representative_diversification_feed.csv"

In [3]:
pilot_data = pd.read_csv(PILOT_DATA_FP)

In [4]:
bluesky_post_links: list[str] = pilot_data["link"].tolist()

In [413]:
def create_prompts_for_each_link(
    links: list[str], task_name: str
) -> tuple[list[str], dict]:
    """Creates prompts for each link."""
    res = []
    links_to_prompt_map = {}
    for link in links:
        try:
            context_prompt = generate_complete_prompt_for_post_link(link, task_name)
            no_context_prompt = generate_complete_prompt_for_post_link
            res.append(prompt)
            links_to_prompt_map[link] = prompt
        except Exception as e:
            print(f"Error with link {link}: {e}")
            res.append(None)
            continue
    return (res, links_to_prompt_map)

We'll classify civic and political lean in one go. I didn't find any difference between doing it this way vs chaining them (doing civic first and then doing political).

In [None]:
# note: lots of request to Bsky to hydrate post and author info, so needed to
# remove the logs from the output. Takes ~4 minutes to run.
civic_and_political_lean_prompts = create_prompts_for_each_link(
    bluesky_post_links, "both"
)

In [416]:
all_prompts, links_to_prompt_map = civic_and_political_lean_prompts

In [423]:
sum_none = sum([prompt is None for prompt in all_prompts])

In [424]:
sum_none

7

In [417]:
print(len(links_to_prompt_map))
print(len(bluesky_post_links))

354
361


In [418]:
with open ("links_to_prompts_map.json", 'w') as f:
    json.dump(links_to_prompt_map, f)

Now we can export these prompts as a .jsonl

In [9]:
posts_to_classify: list[dict] = [
    {
        "link": link,
        "prompt": prompt,
        "task_name": "Civic and Political Lean"
    }
    for (link, prompt) in zip(bluesky_post_links, civic_and_political_lean_prompts)
]

In [11]:
jsonl_filename = "posts_to_classify.jsonl"
jsonl_fp = os.path.join(current_wd, jsonl_filename)

with open(jsonl_fp, "w") as f:
    for post in posts_to_classify:
        f.write(f"{str(post)}\n")

Now let's load the .jsonl to verify that it works

In [12]:
with open(jsonl_fp, 'r') as f:
    lines = f.readlines()

In [14]:
jsons_to_classify: list[dict] = [eval(line) for line in lines]

Now that we know that this works, let's start classifying. Let's first start by doing it on the first prompt.

In [16]:
example_post = jsons_to_classify[0]
example_prompt = example_post["prompt"]

In [19]:
example_result = run_query(
    prompt=example_prompt, model_name=MODEL_NAME
)

[92m20:30:42 - LiteLLM:INFO[0m: utils.py:1112 - [92m

POST Request Sent from LiteLLM:
curl -X POST \
https://api.groq.com/openai/v1/ \
-d '{'model': 'llama3-8b-8192', 'messages': [{'role': 'user', 'content': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You 

In [20]:
example_result

'Here is the classification result in JSON format:\n\n{\n    "civic": "civic",\n    "political_ideology": "left-leaning",\n    "reason_civic": "The post references a specific policy proposal by Bernie Sanders, a left-leaning politician, and discusses the benefits of a four-day workweek.",\n    "reason_political_ideology": "The post\'s language and tone, as well as its reference to a left-leaning politician\'s proposal, suggest a left-leaning political ideology."\n}'

We want something that has a nicer printing. Ideally it also only returns the JSON format.

In [21]:
print(example_result)

Here is the classification result in JSON format:

{
    "civic": "civic",
    "political_ideology": "left-leaning",
    "reason_civic": "The post references a specific policy proposal by Bernie Sanders, a left-leaning politician, and discusses the benefits of a four-day workweek.",
    "reason_political_ideology": "The post's language and tone, as well as its reference to a left-leaning politician's proposal, suggest a left-leaning political ideology."
}


Let's try the next result

In [22]:
second_prompt = jsons_to_classify[1]["prompt"]

In [23]:
second_result = run_query(
    prompt=second_prompt, model_name=MODEL_NAME
)

[92m20:32:13 - LiteLLM:INFO[0m: utils.py:1112 - [92m

POST Request Sent from LiteLLM:
curl -X POST \
https://api.groq.com/openai/v1/ \
-d '{'model': 'llama3-8b-8192', 'messages': [{'role': 'user', 'content': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You 

In [24]:
second_result

'Here is the classification result in JSON format:\n\n{\n    "civic": "civic",\n    "political_ideology": "right-leaning",\n    "reason_civic": "The post discusses social issues such as fear of kidnapping/CPS and loss of third spaces for teens, which are civic topics.",\n    "reason_political_ideology": "The post\'s emphasis on individual responsibility and parental protectionism is characteristic of right-leaning ideology."\n}\n\nNote that I classified the post as "civic" because it discusses social issues that affect a large group of people, such as fear of kidnapping/CPS and loss of third spaces for teens. I classified the post as "right-leaning" because it emphasizes individual responsibility and parental protectionism, which are characteristic of right-leaning ideology.'

Let's try to prompt the model to return only JSON format

In [27]:
second_prompt = (
    second_prompt + "\nReturn ONLY the JSON. I will parse the string result in JSON format."
)

In [28]:
second_result = run_query(
    prompt=second_prompt, model_name=MODEL_NAME
)

[92m20:33:35 - LiteLLM:INFO[0m: utils.py:1112 - [92m

POST Request Sent from LiteLLM:
curl -X POST \
https://api.groq.com/openai/v1/ \
-d '{'model': 'llama3-8b-8192', 'messages': [{'role': 'user', 'content': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You 

In [29]:
print(second_result)

{
    "civic": "civic",
    "political_ideology": "right-leaning",
    "reason_civic": "The post discusses social issues affecting teenagers, such as the decline of third spaces and the impact of fear on their social behavior.",
    "reason_political_ideology": "The post's emphasis on individual responsibility and the need for parental supervision suggests a right-leaning perspective."
}


In [31]:
second_result_dict: dict = eval(second_result)

In [32]:
second_result_dict

{'civic': 'civic',
 'political_ideology': 'right-leaning',
 'reason_civic': 'The post discusses social issues affecting teenagers, such as the decline of third spaces and the impact of fear on their social behavior.',
 'reason_political_ideology': "The post's emphasis on individual responsibility and the need for parental supervision suggests a right-leaning perspective."}

Let's go back and add this string command to all the prompts

In [33]:
for post in jsons_to_classify:
    post["prompt"] = post["prompt"] + "\nReturn ONLY the JSON. I will parse the string result in JSON format."

In [35]:
print(jsons_to_classify[0]["prompt"])



Pretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". 

Then, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are 'left-leaning', 'moderate', 'right-leaning', or 'unclear'. You are analyzing text that has been pre-identified as 'political' in nature. If the text is not civic, return "unclear".

Think through your response step by step.

Return in a JSON format in the following way:
{
    "civic": <

Now let's export these

In [36]:
jsonl_filename = "posts_to_classify.jsonl"
jsonl_fp = os.path.join(current_wd, jsonl_filename)

with open(jsonl_fp, "w") as f:
    for post in jsons_to_classify:
        f.write(f"{str(post)}\n")

Now let's rerun our inference

In [37]:
example_post = jsons_to_classify[100]
example_prompt = example_post["prompt"]
example_result = run_query(
    prompt=example_prompt, model_name=MODEL_NAME
)

[92m20:38:03 - LiteLLM:INFO[0m: utils.py:1112 - [92m

POST Request Sent from LiteLLM:
curl -X POST \
https://api.groq.com/openai/v1/ \
-d '{'model': 'llama3-8b-8192', 'messages': [{'role': 'user', 'content': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You 

In [39]:
example_res: dict = eval(example_result)

In [40]:
example_res

{'civic': 'civic',
 'political_ideology': 'left-leaning',
 'reason_civic': "The post references a news article about Elon Musk's charity and tax benefits, indicating a civic concern about government policies and philanthropy.",
 'reason_political_ideology': "The post's tone and language, such as 'it would be cool if laws existed', suggests a left-leaning perspective critical of wealth inequality and corporate influence."}

Great! Let's run this at scale now. Let's also save our intermediate results to see how they do.

In [44]:
inference_results: list[str] = []
total_num_posts = len(jsons_to_classify)

In [None]:
# note: this inferacts with the Groq API directly, so this will cost money.
# rate limit is 30 requests/min (https://console.groq.com/settings/limits)
# in total, this took ~16 minutes.
for idx, post in enumerate(jsons_to_classify):
    if idx % 50 == 0:
        print(f"Processing post {idx + 1} of {total_num_posts}")
    try:
        post_result = run_query(
            prompt=post["prompt"], model_name=MODEL_NAME
        )
        inference_results.append(post_result)
    except Exception as e:
        print(f"Error with post {idx + 1}: {e}")
        break

In [404]:
len(jsons_to_classify)

354

In [405]:
len(inference_results)

354

Let's see how it did

In [47]:
res = eval(inference_results[0])

In [48]:
res

{'civic': 'civic',
 'political_ideology': 'left-leaning',
 'reason_civic': 'The post references a specific policy proposal by a politician (Bernie Sanders) and discusses its potential benefits.',
 'reason_political_ideology': "The post's tone and language, as well as its reference to a left-leaning politician's proposal, suggest a left-leaning political ideology."}

In [49]:
classified_results: list[dict] = [
    {
        **post, "result": result
    }
    for (post, result) in zip(jsons_to_classify, inference_results)
]

In [428]:
len(jsons_to_classify)

354

In [427]:
len(inference_results)

354

In [426]:
len(classified_results)

354

In [425]:
classified_results

[{'link': 'https://bsky.app/profile/jbouie.bsky.social/post/3knqbtrdzrz2n',
  'prompt': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You are analyzing text that has been pre-identified as \'political\' in nature. If the text is not civic, return "unclear".\n\n

Let's hydrate the results from the classification

In [395]:
hydrated_classified_results: list[dict] = []

In [396]:
len(classified_results)

354

In [397]:
for idx, post in enumerate(classified_results):
    hydrated_res = {**post}
    try:
        post_dict = json.loads(post["result"])
        hydrated_res["valid_json_response"] = True
        hydrated_res["hydrated_result"] = post_dict
        hydrated_res["civic_label"] = post_dict["civic"]
        hydrated_res["political_label"] = post_dict["political_ideology"]
        hydrated_res["reason_civic_label"] = post_dict.get("reason_civic", None)
        hydrated_res["reason_political_label"] = post_dict.get("reason_political_ideology", None)
    except Exception as e:
        # sometimes the LLM returns an invalid JSON response
        print(f"Error with post {post['link']} at index {idx}: {e}")
        hydrated_res["valid_json_response"] = False
        hydrated_res["hydrated_result"] = None
        hydrated_res["civic_label"] = None
        hydrated_res["political_label"] = None
        hydrated_res["reason_civic_label"] = None
        hydrated_res["reason_political_label"] = None
    finally:
        hydrated_classified_results.append(hydrated_res)

Error with post https://bsky.app/profile/blueheronfarm.bsky.social/post/3knmci6lfht2e at index 31: Expecting ',' delimiter: line 5 column 197 (char 371)
Error with post https://bsky.app/profile/brendelbored.bsky.social/post/3knobbmewi22h at index 59: Expecting ',' delimiter: line 5 column 182 (char 413)
Error with post https://bsky.app/profile/kjhealy.bsky.social/post/3knl7425rb22z at index 93: Expecting ',' delimiter: line 5 column 202 (char 423)
Error with post https://bsky.app/profile/jbouie.bsky.social/post/3knstcm42kk2h at index 128: Expecting ',' delimiter: line 5 column 170 (char 420)
Error with post https://bsky.app/profile/kevinmkruse.bsky.social/post/3knonbpmxos2e at index 164: Expecting ',' delimiter: line 5 column 122 (char 306)
Error with post https://bsky.app/profile/rem.postes.club/post/3knt6owr3ft2e at index 228: Expecting ',' delimiter: line 5 column 196 (char 411)
Error with post https://bsky.app/profile/its.cassie.baby/post/3kntjyscy5n2v at index 269: Expecting ',' d

In [409]:
classified_results[52]

{'link': 'https://bsky.app/profile/theradr.bsky.social/post/3knk7aq4twk24',
 'prompt': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You are analyzing text that has been pre-identified as \'political\' in nature. If the text is not civic, return "unclear".\n\nT

In [408]:
hydrated_classified_results[52]

{'link': 'https://bsky.app/profile/theradr.bsky.social/post/3knk7aq4twk24',
 'prompt': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You are analyzing text that has been pre-identified as \'political\' in nature. If the text is not civic, return "unclear".\n\nT

In [410]:
print(hydrated_classified_results[52]["prompt"])



Pretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". 

Then, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are 'left-leaning', 'moderate', 'right-leaning', or 'unclear'. You are analyzing text that has been pre-identified as 'political' in nature. If the text is not civic, return "unclear".

Think through your response step by step.

Return in a JSON format in the following way:
{
    "civic": <

In [399]:
len(hydrated_classified_results)

354

Now let's export our results.

In [401]:
llama3_8b_classified_posts_filename = "classified_posts_llama3_8b.jsonl"
llama3_8b_classified_posts_fp = os.path.join(
    current_wd, llama3_8b_classified_posts_filename
)

with open(llama3_8b_classified_posts_fp, "w") as f:
    for post in hydrated_classified_results:
        f.write(f"{str(post)}\n")

In [402]:
llama3_8b_classified_posts_df: pd.DataFrame = pd.DataFrame(
    hydrated_classified_results
)

In [403]:
llama3_8b_classified_posts_df.to_csv("classified_posts_llama3_8b.csv")

In [386]:
len(hydrated_classified_results)

708

Now let's see how many valid JSONs there were

In [80]:
num_valid_jsons: int = sum(classified_posts_df["valid_json_response"])

In [81]:
print(f"Number of valid JSON responses: {num_valid_jsons}")
print(f"Total number of posts: {len(classified_posts_df)}")
print(f"Proportion of valid JSON responses: {num_valid_jsons / len(classified_posts_df)}")

Number of valid JSON responses: 335
Total number of posts: 354
Proportion of valid JSON responses: 0.9463276836158192


Now let's compare results to the GPT labels

In [77]:
pilot_data_cols = ["link", "civic", "political_ideology"]

In [78]:
joined_df: pd.DataFrame = pd.merge(
    pilot_data[pilot_data_cols],
    classified_posts_df, left_on="link", right_on="link"
)

In [79]:
joined_df.head()

Unnamed: 0,link,civic,political_ideology,prompt,task_name,result,hydrated_result,civic_label,political_label,reason_civic_label,reason_political_label,valid_json_response
0,https://bsky.app/profile/jbouie.bsky.social/po...,True,left-leaning,\n\nPretend that you are a classifier that pre...,Civic and Political Lean,"{\n ""civic"": ""civic"",\n ""political_ideol...","{'civic': 'civic', 'political_ideology': 'left...",civic,left-leaning,The post references a specific policy proposal...,"The post's tone and language, as well as its r...",True
1,https://bsky.app/profile/lethalityjane.bsky.so...,True,right-leaning,\n\nPretend that you are a classifier that pre...,Civic and Political Lean,"{\n ""civic"": ""civic"",\n ""political_ideol...","{'civic': 'civic', 'political_ideology': 'righ...",civic,right-leaning,The post discusses social issues affecting tee...,The post's emphasis on individual responsibili...,True
2,https://bsky.app/profile/esqueer.bsky.social/p...,True,left-leaning,\n\nPretend that you are a classifier that pre...,Civic and Political Lean,"{\n ""civic"": ""civic"",\n ""political_ideol...","{'civic': 'civic', 'political_ideology': 'left...",civic,left-leaning,"The post discusses censorship and neo-nazism, ...",The post's criticism of neo-nazism and its ass...,True
3,https://bsky.app/profile/stuflemingnz.bsky.soc...,False,,\n\nPretend that you are a classifier that pre...,Civic and Political Lean,"{\n ""civic"": ""not civic"",\n ""political_i...","{'civic': 'not civic', 'political_ideology': '...",not civic,unclear,,,True
4,https://bsky.app/profile/sararoseg.bsky.social...,True,left-leaning,\n\nPretend that you are a classifier that pre...,Civic and Political Lean,"{\n ""civic"": ""civic"",\n ""political_ideol...","{'civic': 'civic', 'political_ideology': 'left...",civic,left-leaning,The post discusses historical events related t...,The post presents a critical view of the Nazi ...,True


Let's only get the ones that we got valid results from 

In [85]:
filtered_results_df: pd.DataFrame = joined_df[joined_df["valid_json_response"]]

Now, given these, let's compare the labels from GPT4 to the labels from Llama3-8b

In [87]:
# from GPT4
print(filtered_results_df["civic"].value_counts())
print(filtered_results_df["political_ideology"].value_counts())

civic
False    173
True     162
Name: count, dtype: int64
political_ideology
left-leaning     129
unclear           16
right-leaning     15
 left-leaning      1
moderate           1
Name: count, dtype: int64


In [86]:
# from Llama3-8b
print(filtered_results_df["civic_label"].value_counts())
print(filtered_results_df["political_label"].value_counts())

civic_label
civic        189
not civic    146
Name: count, dtype: int64
political_label
unclear          175
left-leaning     123
right-leaning     34
moderate           3
Name: count, dtype: int64


In [202]:
# only those with civic_label = "civic"
civic_llama3_8b_df = filtered_results_df[filtered_results_df["civic_label"] == "civic"]

In [203]:
civic_llama3_8b_df["political_label"].value_counts()

political_label
left-leaning     123
right-leaning     34
unclear           29
moderate           3
Name: count, dtype: int64

Let's fix some of the labels from the pilot data. There is a " left-leaning" value that should be "left-leaning"

In [91]:
# in the "political_ideology" column, there is a " left-leaning" value that needs
# to be replaced with "left-leaning"
filtered_results_df["political_ideology"] = filtered_results_df["political_ideology"].replace(" left-leaning", "left-leaning")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_results_df["political_ideology"] = filtered_results_df["political_ideology"].replace(" left-leaning", "left-leaning")


In [92]:
print(filtered_results_df["political_ideology"].value_counts())

political_ideology
left-leaning     130
unclear           16
right-leaning     15
moderate           1
Name: count, dtype: int64


OK, now let's compare civic-ness between both models using a confusion matrix

In [93]:
# create an cross-tab of the "civic" and "civic-label" columns. Rename the 
# "civic" as "GPT-4 label" and "civic-label" as "Llama3-8b label". Where
# the "GPT4-label" is True, replace with "civic" (not in place, only in the crosstab)
# and replace "False" with "not civic"
civic_crosstab: pd.DataFrame = pd.crosstab(
    filtered_results_df["civic"],
    filtered_results_df["civic_label"].apply(lambda x: x == "civic")
)

In [97]:
gpt_4_civic_results = filtered_results_df["civic"]
gpt_4_civic_results = gpt_4_civic_results.replace(True, "civic")
gpt_4_civic_results = gpt_4_civic_results.replace(False, "not civic")
llama_3_8b_civic_results = filtered_results_df["civic_label"]

In [109]:
# compare gpt4 to llama results. Name each axis based on the model that it
# came from ("GPT-4" or "Llama3-8b")
civic_crosstab: pd.DataFrame = pd.crosstab(
    gpt_4_civic_results, llama_3_8b_civic_results,
    colnames=["Llama3-8b (columns)"], rownames=["GPT-4 (rows)"],
    margins=True
)


In [110]:
# columns are Llama3-8b, rows are GPT-4.
# there are a total of 189 civic posts from Llama (162 from GPT4)
civic_crosstab

Llama3-8b (columns),civic,not civic,All
GPT-4 (rows),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civic,106,56,162
not civic,83,90,173
All,189,146,335


Now let's calculate the cross-tab as a proportion

In [107]:
civic_crosstab_props: pd.DataFrame = pd.crosstab(
    gpt_4_civic_results, llama_3_8b_civic_results,
    colnames=["Llama3-8b (columns)"], rownames=["GPT-4 (rows)"],
    margins=True, normalize="all"
)

In [108]:
civic_crosstab_props

Llama3-8b,civic,not civic,All
GPT-4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civic,0.316418,0.167164,0.483582
not civic,0.247761,0.268657,0.516418
All,0.564179,0.435821,1.0


Now let's calculate precision, recall, and F1 score based on the confusion matrix

In [111]:
from sklearn.metrics import precision_recall_fscore_support

In [115]:
y_true = gpt_4_civic_results.tolist()
y_pred = llama_3_8b_civic_results.tolist()

In [117]:
y_true[0:5]

['civic', 'civic', 'civic', 'not civic', 'civic']

In [118]:
civic_metrics = precision_recall_fscore_support(
    y_true=y_true, y_pred=y_pred, average="binary", pos_label="civic"
)

In [120]:
civic_precision, civic_recall, civic_fbeta_score, civic_support = civic_metrics

In [123]:
print(f"Precision: {civic_precision}\tRecall: {civic_recall}\tF-1 score: {civic_fbeta_score}\tSupport: {civic_support}")

Precision: 0.5608465608465608	Recall: 0.654320987654321	F-1 score: 0.603988603988604	Support: None


Let's now do the same comparison for political ideology. For the political ideology, we only want posts that are civic. Let's take the posts that both GPT4 and Llama3-8b agreed are civic.

In [126]:
# both "civic_label" == "civic" and "civic" == True
civic_posts: pd.DataFrame = filtered_results_df[
    (filtered_results_df["civic_label"] == "civic")
    & (filtered_results_df["civic"] == True)
]

In [127]:
civic_posts = civic_posts[["political_ideology", "political_label"]].rename(
    columns={"political_ideology": "GPT-4 political ideology", "political_label": "Llama3-8b political ideology"}
)

In [128]:
civic_posts.head()

Unnamed: 0,GPT-4 political ideology,Llama3-8b political ideology
0,left-leaning,left-leaning
1,right-leaning,right-leaning
2,left-leaning,left-leaning
4,left-leaning,left-leaning
5,unclear,left-leaning


In [130]:
print(civic_posts["GPT-4 political ideology"].value_counts())
print(civic_posts["Llama3-8b political ideology"].value_counts())

GPT-4 political ideology
left-leaning     83
right-leaning    11
unclear          11
moderate          1
Name: count, dtype: int64
Llama3-8b political ideology
left-leaning     70
right-leaning    20
unclear          14
moderate          2
Name: count, dtype: int64


In [133]:
political_ideology_confusion_matrix = pd.crosstab(
    civic_posts["GPT-4 political ideology"], civic_posts["Llama3-8b political ideology"],
    rownames=["GPT-4 political ideology (rows)"], colnames=["Llama3-8b political ideology (columns)"],
    margins=True
)

In [134]:
political_ideology_confusion_matrix

Llama3-8b political ideology (columns),left-leaning,moderate,right-leaning,unclear,All
GPT-4 political ideology (rows),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
left-leaning,62,0,12,9,83
moderate,0,0,1,0,1
right-leaning,2,2,5,2,11
unclear,6,0,2,3,11
All,70,2,20,14,106


In [135]:
political_ideology_confusion_matrix_props = pd.crosstab(
    civic_posts["GPT-4 political ideology"], civic_posts["Llama3-8b political ideology"],
    rownames=["GPT-4 political ideology (rows)"], colnames=["Llama3-8b political ideology (columns)"],
    margins=True, normalize="all"
)

In [136]:
political_ideology_confusion_matrix_props

Llama3-8b political ideology (columns),left-leaning,moderate,right-leaning,unclear,All
GPT-4 political ideology (rows),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
left-leaning,0.584906,0.0,0.113208,0.084906,0.783019
moderate,0.0,0.0,0.009434,0.0,0.009434
right-leaning,0.018868,0.018868,0.04717,0.018868,0.103774
unclear,0.056604,0.0,0.018868,0.028302,0.103774
All,0.660377,0.018868,0.188679,0.132075,1.0


In [137]:
political_ideology_metrics = precision_recall_fscore_support(
    y_true=civic_posts["GPT-4 political ideology"].tolist(),
    y_pred=civic_posts["Llama3-8b political ideology"].tolist(),
    average="weighted"
)

In [138]:
political_ideology_recall, political_ideology_precision, political_ideology_fbeta_score, political_ideology_support = political_ideology_metrics

In [139]:
print(f"Precision: {political_ideology_precision}\tRecall: {political_ideology_recall}\tF-1 score: {political_ideology_fbeta_score}\tSupport: {political_ideology_support}")

Precision: 0.660377358490566	Recall: 0.7417115902964959	F-1 score: 0.6929845372922957	Support: None


## Comparing against Llama3-70b labels.

Now that I think about it, comparing the Llama3-8b results to the GPT-4 labels doesn't make sense, since the GPT-4 labels don't use any context. I just copied and pasted the text into ChatGPT myself. A better comparison is comparing Llama3-8b to a large model that takes in the same prompt (i.e., Llama3-70b) as well as hand-labeling samples. I'll hand-label samples while also letting this run to be labeled.

In [143]:
print(jsons_to_classify[0]["prompt"])



Pretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". 

Then, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are 'left-leaning', 'moderate', 'right-leaning', or 'unclear'. You are analyzing text that has been pre-identified as 'political' in nature. If the text is not civic, return "unclear".

Think through your response step by step.

Return in a JSON format in the following way:
{
    "civic": <

We need to reload our helper functions to populate the new model (Llama3-70b)

In [157]:
from importlib import reload
from ml_tooling.llm import inference
reload(inference)


<module 'ml_tooling.llm.inference' from '/Users/mark/Documents/work/bluesky-research/ml_tooling/llm/inference.py'>

In [158]:
inference.BACKEND_OPTIONS

{'Gemini': {'model': 'gemini/gemini-pro',
  'kwargs': {'temperature': 0.0,
   'safety_settings': [{'category': 'HARM_CATEGORY_HARASSMENT',
     'threshold': 'BLOCK_NONE'},
    {'category': 'HARM_CATEGORY_HATE_SPEECH', 'threshold': 'BLOCK_NONE'},
    {'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'threshold': 'BLOCK_NONE'},
    {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT',
     'threshold': 'BLOCK_NONE'}]}},
 'Llama3-8B (via HuggingFace) (NOT SUPPORTED YET)': {'model': 'huggingface/unsloth/llama-3-8b',
  'kwargs': {'api_base': 'https://api-inference.huggingface.co/models/unsloth/llama-3-8b'}},
 'Mixtral 8x22B (via HuggingFace)': {'model': 'huggingface/mistralai/Mixtral-8x22B-v0.1',
  'kwargs': {'api_base': 'https://api-inference.huggingface.co/models/mistralai/Mixtral-8x22B-v0.1'}},
 'Llama3-8b (via Groq)': {'model': 'groq/llama3-8b-8192',
  'kwargs': {'temperature': 0.0, 'response_format': {'type': 'json_object'}}},
 'Llama3-70b (via Groq)': {'model': 'groq/llama3-70b-8192',
  'kwar

In [159]:
llama3_70b_labels: list[str] = []
LARGE_MODEL_NAME = "Llama3-70b (via Groq)"

In [160]:
example_query_res = run_query(
    prompt=jsons_to_classify[0]["prompt"], model_name=LARGE_MODEL_NAME
)

[92m22:55:01 - LiteLLM:INFO[0m: utils.py:1112 - [92m

POST Request Sent from LiteLLM:
curl -X POST \
https://api.groq.com/openai/v1/ \
-d '{'model': 'llama3-70b-8192', 'messages': [{'role': 'user', 'content': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You

In [162]:
eval(example_query_res)

{'civic': 'civic',
 'political_ideology': 'left-leaning',
 'reason_civic': 'The post discusses a policy proposal related to work hours, which is a social issue.',
 'reason_political_ideology': 'The post supports a proposal by Bernie Sanders, a left-leaning politician, and promotes a progressive idea.'}

OK, this looks great! Let's let this run for a bit.

In [None]:
# run queries against Llama3-70b model. This will take a while since we're
# making a lot of requests to Groq. Takes ~30 minutes.
for idx, post in enumerate(jsons_to_classify):
    if idx % 50 == 0:
        print(f"Processing post {idx + 1} of {total_num_posts}")
    prompt = post["prompt"]
    try:
        result = run_query(prompt=prompt, model_name=LARGE_MODEL_NAME)
        llama3_70b_labels.append(result)
    except Exception as e:
        # for the ones that failed, this happens because we set a requirement
        # that it must be valid JSON. Some of the responses are not JSON, so
        # Groq throws an error and tells us that the result wasn't JSON.
        print(f"Error with post {post['link']}: {e}")
        llama3_70b_labels.append("")
        continue

Let's check our results

In [164]:
len(llama3_70b_labels)

354

In [167]:
large_model_results: list[dict] = [
    {
        **post, "result": result
    }
    for (post, result) in zip(jsons_to_classify, llama3_70b_labels)
]

In [190]:
hydrated_large_model_results: list[dict] = []

In [191]:
import json

In [192]:
for idx, post in enumerate(large_model_results):
    hydrated_res = {**post}
    try:
        post_dict = json.loads(post["result"])
        hydrated_res["valid_json_response"] = True
        hydrated_res["hydrated_result"] = post_dict
        hydrated_res["civic_label"] = post_dict["civic"]
        hydrated_res["political_label"] = post_dict["political_ideology"]
        hydrated_res["reason_civic_label"] = post_dict.get("reason_civic", None)
        hydrated_res["reason_political_label"] = post_dict.get("reason_political_ideology", None)
    except Exception as e:
        # sometimes the LLM returns an invalid JSON response
        print(f"Error with post {post['link']} at index {idx}: {e}")
        hydrated_res["valid_json_response"] = False
        hydrated_res["hydrated_result"] = None
        hydrated_res["civic_label"] = None
        hydrated_res["political_label"] = None
        hydrated_res["reason_civic_label"] = None
        hydrated_res["reason_political_label"] = None
    finally:
        hydrated_large_model_results.append(hydrated_res)

Error with post https://bsky.app/profile/jbouie.bsky.social/post/3knovwzjb4c2j at index 7: Expecting value: line 1 column 1 (char 0)
Error with post https://bsky.app/profile/luckytran.bsky.social/post/3knglwgkla326 at index 28: Expecting value: line 1 column 1 (char 0)
Error with post https://bsky.app/profile/mommunism.bsky.social/post/3knkvugzd2k2n at index 71: Expecting value: line 1 column 1 (char 0)
Error with post https://bsky.app/profile/stevevladeck.bsky.social/post/3kntbnvmx4c2j at index 145: Expecting value: line 1 column 1 (char 0)
Error with post https://bsky.app/profile/kenwhite.bsky.social/post/3knol6vhq7k2t at index 163: Expecting value: line 1 column 1 (char 0)
Error with post https://bsky.app/profile/blakeprof.bsky.social/post/3kntcx3hqx323 at index 198: Expecting value: line 1 column 1 (char 0)
Error with post https://bsky.app/profile/juliusgoat.bsky.social/post/3knsvuoelgl2q at index 257: Expecting value: line 1 column 1 (char 0)


In [195]:
# the ones that failed are the ones that didn't have a valid JSON format.
large_model_results[7]

{'link': 'https://bsky.app/profile/jbouie.bsky.social/post/3knovwzjb4c2j',
 'prompt': '\n\nPretend that you are a classifier that predicts whether a post has civic content or not. Civic refers to whether a given post is related to politics (government, elections, politicians, activism, etc.) or social issues (major issues that affect a large group of people, such as the economy, inequality, racism, education, immigration, human rights, the environment, etc.). We refer to any content that is classified as being either of these two categories as “civic”; otherwise they are not civic. Please classify the following text denoted in <text> as "civic" or "not civic". \n\nThen, if the post is civic, classify the text based on the political lean of the opinion or argument it presents. Your options are \'left-leaning\', \'moderate\', \'right-leaning\', or \'unclear\'. You are analyzing text that has been pre-identified as \'political\' in nature. If the text is not civic, return "unclear".\n\nTh

Let's spot-check these against the results of the smaller model:

In [193]:
len(hydrated_large_model_results)

354

In [176]:
print(large_model_results[0]["result"])
print(classified_results[0]["result"])

{
"civic": "civic",
"political_ideology": "left-leaning",
"reason_civic": "The post discusses a policy proposal related to work hours, which is a social issue.",
"reason_political_ideology": "The post supports a proposal by Bernie Sanders, a left-leaning politician, and promotes a progressive idea."
}
{
    "civic": "civic",
    "political_ideology": "left-leaning",
    "reason_civic": "The post references a specific policy proposal by a politician (Bernie Sanders) and discusses its potential benefits.",
    "reason_political_ideology": "The post's tone and language, as well as its reference to a left-leaning politician's proposal, suggest a left-leaning political ideology."
}


Now let's save these results. Let's evaluate these only when we have all the hand-labeled samples done.

In [194]:
large_model_results_filename = "classified_posts_llama3_70b.jsonl"
large_model_results_fp = os.path.join(current_wd, large_model_results_filename)

with open(large_model_results_fp, "w") as f:
    for post in hydrated_large_model_results:
        f.write(f"{str(post)}\n")

In [196]:
large_model_df: pd.DataFrame = pd.DataFrame(hydrated_large_model_results)

num_valid_jsons: int = sum(large_model_df["valid_json_response"])

print(f"Number of valid JSON responses: {num_valid_jsons}")
print(f"Total number of posts: {len(large_model_df)}")
print(f"Proportion of valid JSON responses: {num_valid_jsons / len(large_model_df)}")

Number of valid JSON responses: 347
Total number of posts: 354
Proportion of valid JSON responses: 0.980225988700565


In [385]:
large_model_df.to_csv("classified_posts_llama3_70b.csv")

Let's get a sense of the ones that it thinks is civic

In [198]:
filtered_large_model_df = large_model_df[large_model_df["valid_json_response"]]

In [199]:
filtered_large_model_df["civic_label"].value_counts()

civic_label
civic        200
not civic    147
Name: count, dtype: int64

In [200]:
civic_large_model_df = filtered_large_model_df[
    filtered_large_model_df["civic_label"] == "civic"
]

In [201]:
civic_large_model_df["political_label"].value_counts()

political_label
left-leaning     173
right-leaning     11
unclear            9
moderate           7
Name: count, dtype: int64

Let's compare these to the base rates for Llama3-8b and the pilot data

## Loading hand-labeled samples and comparing to LLM labels

I hand-labeled the pilot data posts as civic/not-civic and the political lean of the civic posts myself.

## Comparing all models against the hand-labeled results.

We have classifications from three LLMs:
1. GPT-4 (with no context)
2. Llama3-8b (with context)
3. Llama3-70b

Let's compare these to the hand-labeled examples.

For each, let's load their civic and their political ideology labels.

Let's only use posts for which data exists for all models (i.e., let's filter out the posts that a model might not have been able to create a valid JSON for.)

GPT-4 (pilot data)

In [305]:
filtered_gpt4_df = pilot_data.copy()
# civic_gpt4_df = filtered_gpt4_df[
#     filtered_gpt4_df["civic"] == True
# ]

In [306]:
# clean up pilot data results to match others
filtered_gpt4_df["political_ideology"] = (
    filtered_gpt4_df["political_ideology"].replace(
        " left-leaning", "left-leaning"
    )
)
filtered_gpt4_df["political_ideology"] = (
    filtered_gpt4_df["political_ideology"].fillna("unclear")
)
filtered_gpt4_df["civic"] = (
    filtered_gpt4_df["civic"].replace(True, "civic")
)
filtered_gpt4_df["civic"] = (
    filtered_gpt4_df["civic"].replace(False, "not civic")
)

In [307]:
filtered_gpt4_df["civic"].value_counts()

civic
not civic    187
civic        174
Name: count, dtype: int64

In [308]:
filtered_gpt4_df["political_ideology"].value_counts()

political_ideology
unclear          205
left-leaning     140
right-leaning     15
moderate           1
Name: count, dtype: int64

In [309]:
gpt4_df_subset = filtered_gpt4_df[
    ["link", "civic", "political_ideology"]
]
gpt4_df_subset = gpt4_df_subset.rename(
    columns={
        "civic": "gpt4_civic_label",
        "political_ideology": "gpt4_political_label"
    }
)

In [287]:
gpt4_df_subset.columns

Index(['link', 'gpt4_civic_label', 'gpt4_political_label'], dtype='object')

Llama3-8b

In [288]:
# only those with valid_response_json
filtered_llama3_8b_df = classified_posts_df[classified_posts_df["valid_json_response"]]
# civic_llama3_8b_df = filtered_llama3_8b_df[
#     filtered_llama3_8b_df["civic_label"] == "civic"
# ]

In [289]:
filtered_llama3_8b_df["civic_label"].value_counts()

civic_label
civic        189
not civic    146
Name: count, dtype: int64

In [290]:
filtered_llama3_8b_df["political_label"].value_counts()

political_label
unclear          175
left-leaning     123
right-leaning     34
moderate           3
Name: count, dtype: int64

In [291]:
llama3_8b_df_subset = filtered_llama3_8b_df[
    ["link", "civic_label", "political_label"]
]
llama3_8b_df_subset = llama3_8b_df_subset.rename(
    columns={
        "civic_label": "llama3-8b_civic_label",
        "political_label": "llama3-8b_political_label"
    }
)

In [292]:
llama3_8b_df_subset.columns

Index(['link', 'llama3-8b_civic_label', 'llama3-8b_political_label'], dtype='object')

Llama3-70b

In [293]:
# only those with valid_response_json
filtered_llama3_70b_df = large_model_df[large_model_df["valid_json_response"]]
# civic_llama3_70b_df = filtered_llama3_70b_df[
#     filtered_llama3_70b_df["civic_label"] == "civic"
# ]

In [294]:
filtered_llama3_70b_df["civic_label"].value_counts()

civic_label
civic        200
not civic    147
Name: count, dtype: int64

In [295]:
filtered_llama3_70b_df["political_label"].value_counts()

political_label
left-leaning     173
unclear          156
right-leaning     11
moderate           7
Name: count, dtype: int64

In [296]:
llama3_70b_df_subset = filtered_llama3_70b_df[
    ["link", "civic_label", "political_label"]
]
llama3_70b_df_subset = llama3_70b_df_subset.rename(
    columns={
        "civic_label": "llama3-70b_civic_label",
        "political_label": "llama3-70b_political_label"
    }
)

In [297]:
llama3_70b_df_subset.columns

Index(['link', 'llama3-70b_civic_label', 'llama3-70b_political_label'], dtype='object')

Now let's join the labels across Llama3-8b, Llama3-70b, and GPT4. These are going to be the posts that we analyze against the ground-truth labels.

In [310]:
# merge gpt4 and llama3-8b dfs
all_llm_labeled_posts: pd.DataFrame = pd.merge(
    gpt4_df_subset, llama3_8b_df_subset, on="link"
)

# merge the merged df and the llama3-70b df
all_llm_labeled_posts: pd.DataFrame = pd.merge(
    all_llm_labeled_posts, llama3_70b_df_subset, on="link"
)


In [311]:
all_llm_labeled_posts.shape

(328, 7)

We have 328 posts that we have labels for across all the LLMs.

In [312]:
all_llm_labeled_posts.head()

Unnamed: 0,link,gpt4_civic_label,gpt4_political_label,llama3-8b_civic_label,llama3-8b_political_label,llama3-70b_civic_label,llama3-70b_political_label
0,https://bsky.app/profile/jbouie.bsky.social/po...,civic,left-leaning,civic,left-leaning,civic,left-leaning
1,https://bsky.app/profile/lethalityjane.bsky.so...,civic,right-leaning,civic,right-leaning,civic,unclear
2,https://bsky.app/profile/esqueer.bsky.social/p...,civic,left-leaning,civic,left-leaning,civic,left-leaning
3,https://bsky.app/profile/stuflemingnz.bsky.soc...,not civic,unclear,not civic,unclear,not civic,unclear
4,https://bsky.app/profile/sararoseg.bsky.social...,civic,left-leaning,civic,left-leaning,civic,left-leaning


First, let's see how many of the posts all of these have agreement on

In [316]:
all_llm_labeled_posts.head()

Unnamed: 0,link,gpt4_civic_label,gpt4_political_label,llama3-8b_civic_label,llama3-8b_political_label,llama3-70b_civic_label,llama3-70b_political_label
0,https://bsky.app/profile/jbouie.bsky.social/po...,civic,left-leaning,civic,left-leaning,civic,left-leaning
1,https://bsky.app/profile/lethalityjane.bsky.so...,civic,right-leaning,civic,right-leaning,civic,unclear
2,https://bsky.app/profile/esqueer.bsky.social/p...,civic,left-leaning,civic,left-leaning,civic,left-leaning
3,https://bsky.app/profile/stuflemingnz.bsky.soc...,not civic,unclear,not civic,unclear,not civic,unclear
4,https://bsky.app/profile/sararoseg.bsky.social...,civic,left-leaning,civic,left-leaning,civic,left-leaning


In [318]:
all_llm_labeled_posts["all_llms_agree_civic"] = all_llm_labeled_posts.apply(
    lambda row: (
        row["gpt4_civic_label"]
        == row["llama3-8b_civic_label"]
        == row["llama3-70b_civic_label"]
    ),
    axis=1
)

all_llm_labeled_posts["llama_models_agree_civic"] = all_llm_labeled_posts.apply(
    lambda row: (
        row["llama3-8b_civic_label"]
        == row["llama3-70b_civic_label"]
    ),
    axis=1
)

all_llm_labeled_posts["all_llms_agree_political_ideology"] = all_llm_labeled_posts.apply(
    lambda row: (
        row["gpt4_political_label"]
        == row["llama3-8b_political_label"]
        == row["llama3-70b_political_label"]
    ),
    axis=1
)

all_llm_labeled_posts["llama_models_agree_political_ideology"] = all_llm_labeled_posts.apply(
    lambda row: (
        row["llama3-8b_political_label"]
        == row["llama3-70b_political_label"]
    ),
    axis=1
)

Now let's export these

In [319]:
all_llm_labeled_posts_dicts = all_llm_labeled_posts.to_dict(orient="records")

In [320]:
all_llm_labeled_posts_filename = "all_llm_labeled_posts.jsonl"
all_llm_results_fp = os.path.join(current_wd, all_llm_labeled_posts_filename)

with open(all_llm_results_fp, "w") as f:
    for post in all_llm_labeled_posts_dicts:
        f.write(f"{str(post)}\n")

In [321]:
all_llm_labeled_posts.to_csv("all_llm_labeled_posts.csv")

What is the degree of agreement between the context models (Llama3-8b and Llama3-70b)?

In [323]:
all_llm_labeled_posts.head()

Unnamed: 0,link,gpt4_civic_label,gpt4_political_label,llama3-8b_civic_label,llama3-8b_political_label,llama3-70b_civic_label,llama3-70b_political_label,all_llms_agree_civic,llama_models_agree_civic,all_llms_agree_political_ideology,llama_models_agree_political_ideology
0,https://bsky.app/profile/jbouie.bsky.social/po...,civic,left-leaning,civic,left-leaning,civic,left-leaning,True,True,True,True
1,https://bsky.app/profile/lethalityjane.bsky.so...,civic,right-leaning,civic,right-leaning,civic,unclear,True,True,False,False
2,https://bsky.app/profile/esqueer.bsky.social/p...,civic,left-leaning,civic,left-leaning,civic,left-leaning,True,True,True,True
3,https://bsky.app/profile/stuflemingnz.bsky.soc...,not civic,unclear,not civic,unclear,not civic,unclear,True,True,True,True
4,https://bsky.app/profile/sararoseg.bsky.social...,civic,left-leaning,civic,left-leaning,civic,left-leaning,True,True,True,True


In [324]:
total_posts = all_llm_labeled_posts.shape[0]

In [326]:
prop_agreement_llama_civic = (
    sum(all_llm_labeled_posts["llama_models_agree_civic"])
    / total_posts
)
print(f"Total Llama agreement for civic posts: {prop_agreement_llama_civic}")

prop_agreement_llama_political_ideology = (
    sum(all_llm_labeled_posts["llama_models_agree_political_ideology"])
    / total_posts
)
print(f"Total Llama agreement for political ideology posts: {prop_agreement_llama_political_ideology}")

Total Llama agreement for civic posts: 0.899390243902439
Total Llama agreement for political ideology posts: 0.8079268292682927


What is the degree of agreement between all the LLMs?

In [327]:
prop_agreement_all_civic = (
    sum(all_llm_labeled_posts["all_llms_agree_civic"])
    / total_posts
)
print(f"Total agreement for civic posts: {prop_agreement_all_civic}")

prop_agreement_all_political_ideology = (
    sum(all_llm_labeled_posts["all_llms_agree_political_ideology"])
    / total_posts
)
print(f"Total agreement for political ideology posts: {prop_agreement_all_political_ideology}")

Total agreement for civic posts: 0.5304878048780488
Total agreement for political ideology posts: 0.4481707317073171


Now, let's compare these against the ground truth labels.

In [328]:
GROUND_TRUTH_LABELS_CSV_FP = "../manuscript_pilot/hand_labeled_pilot_posts.csv"
ground_truth_labels = pd.read_csv(GROUND_TRUTH_LABELS_CSV_FP)

Let's first do some basic analysis of the ground-truth labels

In [330]:
print(ground_truth_labels["civic_hand_label"].value_counts())

civic_hand_label
civic        193
not civic    161
Name: count, dtype: int64


In [331]:
civic_ground_truth_labels = ground_truth_labels[
    ground_truth_labels["civic_hand_label"] == "civic"
]

In [332]:
print(civic_ground_truth_labels["political_ideology_hand_label"].value_counts())

political_ideology_hand_label
left-leaning     160
unclear           20
right-leaning      7
moderate           5
Name: count, dtype: int64


Let's remove the rows from the ground truth labels that don't have any values (they're deleted posts)

In [353]:
ground_truth_labels = ground_truth_labels[
    ~pd.isna(ground_truth_labels["civic_hand_label"])
]

Now let's join the ground-truth labels to the LLM labels

In [354]:
llm_labels_with_ground_truth_df = pd.merge(
    all_llm_labeled_posts,
    ground_truth_labels[
        ["link", "civic_hand_label", "political_ideology_hand_label"]
    ],
    on="link"
)

In [355]:
llm_labels_with_ground_truth_df.shape

(321, 13)

Let's export this data.

In [356]:
llm_labels_with_ground_truth_dicts = llm_labels_with_ground_truth_df.to_dict(orient="records")

In [357]:
all_results_filename = "all_llm_labels_with_ground_truth_labels.jsonl"
all_results_full_fp = os.path.join(current_wd, all_results_filename)

with open(all_results_full_fp, "w") as f:
    for post in llm_labels_with_ground_truth_dicts:
        f.write(f"{str(post)}\n")

In [358]:
llm_labels_with_ground_truth_df.to_csv(
    "all_llm_labels_with_ground_truth_labels.csv"
)

Now let's compare each model against the ground truth labels, for both civic and for political ideology.

In [359]:
llm_labels_with_ground_truth_df.head()

Unnamed: 0,link,gpt4_civic_label,gpt4_political_label,llama3-8b_civic_label,llama3-8b_political_label,llama3-70b_civic_label,llama3-70b_political_label,all_llms_agree_civic,llama_models_agree_civic,all_llms_agree_political_ideology,llama_models_agree_political_ideology,civic_hand_label,political_ideology_hand_label
0,https://bsky.app/profile/jbouie.bsky.social/po...,civic,left-leaning,civic,left-leaning,civic,left-leaning,True,True,True,True,civic,left-leaning
1,https://bsky.app/profile/lethalityjane.bsky.so...,civic,right-leaning,civic,right-leaning,civic,unclear,True,True,False,False,civic,unclear
2,https://bsky.app/profile/esqueer.bsky.social/p...,civic,left-leaning,civic,left-leaning,civic,left-leaning,True,True,True,True,civic,left-leaning
3,https://bsky.app/profile/stuflemingnz.bsky.soc...,not civic,unclear,not civic,unclear,not civic,unclear,True,True,True,True,not civic,
4,https://bsky.app/profile/sararoseg.bsky.social...,civic,left-leaning,civic,left-leaning,civic,left-leaning,True,True,True,True,civic,left-leaning


#### GPT4:

Civic classification

In [375]:
gpt4_civic_metrics = precision_recall_fscore_support(
    y_true=llm_labels_with_ground_truth_df["civic_hand_label"].tolist(),
    y_pred=llm_labels_with_ground_truth_df["gpt4_civic_label"].tolist(),
    average="binary",
    pos_label="civic"
)

In [361]:
(
    gpt4_civic_precision,
    gpt4_civic_recall,
    gpt4_civic_fbeta_score,
    gpt4_civic_support
) = gpt4_civic_metrics


In [362]:
print(f"Precision: {gpt4_civic_precision}\tRecall: {gpt4_civic_recall}\tF-1 score: {gpt4_civic_fbeta_score}\tSupport: {gpt4_civic_support}")

Precision: 0.9415584415584416	Recall: 0.8333333333333334	F-1 score: 0.8841463414634146	Support: None


In [370]:
total_values = llm_labels_with_ground_truth_df.shape[0]

In [369]:
confusion_matrix = pd.crosstab(
    llm_labels_with_ground_truth_df["civic_hand_label"].tolist(),
    llm_labels_with_ground_truth_df["gpt4_civic_label"].tolist()
)

In [371]:
acc = (
    (confusion_matrix.values[0][0] + confusion_matrix.values[1][1]) 
    / total_values
)

In [372]:
print(f"Accuracy: {acc}")

Accuracy: 0.881619937694704


Political ideology classification

#### Llama3-8b

Civic classification

In [374]:
llama3_8b_civic_metrics = precision_recall_fscore_support(
    y_true=llm_labels_with_ground_truth_df["civic_hand_label"].tolist(),
    y_pred=llm_labels_with_ground_truth_df["llama3-8b_civic_label"].tolist(),
    average="binary",
    pos_label="civic"
)

In [376]:
(
    llama3_8b_civic_precision,
    llama3_8b_civic_recall,
    llama3_8b_civic_fbeta_score,
    llama3_8b_civic_support
) = llama3_8b_civic_metrics


In [377]:
print(f"Precision: {llama3_8b_civic_precision}\tRecall: {llama3_8b_civic_recall}\tF-1 score: {llama3_8b_civic_fbeta_score}\tSupport: {llama3_8b_civic_support}")

Precision: 0.6123595505617978	Recall: 0.6264367816091954	F-1 score: 0.6193181818181818	Support: None


In [378]:
confusion_matrix = pd.crosstab(
    llm_labels_with_ground_truth_df["civic_hand_label"].tolist(),
    llm_labels_with_ground_truth_df["llama3-8b_civic_label"].tolist()
)

In [379]:
acc = (
    (confusion_matrix.values[0][0] + confusion_matrix.values[1][1]) 
    / total_values
)

In [380]:
print(f"Accuracy: {acc}")

Accuracy: 0.5825545171339563


Political classification

#### Llama3-70b

Civic classification

In [381]:
llama3_70b_civic_metrics = precision_recall_fscore_support(
    y_true=llm_labels_with_ground_truth_df["civic_hand_label"].tolist(),
    y_pred=llm_labels_with_ground_truth_df["llama3-70b_civic_label"].tolist(),
    average="binary",
    pos_label="civic"
)

In [383]:
(
    llama3_70b_civic_precision,
    llama3_70b_civic_recall,
    llama3_70b_civic_fbeta_score,
    llama3_70b_civic_support
) = llama3_70b_civic_metrics


In [384]:
print(f"Precision: {llama3_70b_civic_precision}\tRecall: {llama3_70b_civic_recall}\tF-1 score: {llama3_70b_civic_fbeta_score}\tSupport: {llama3_70b_civic_support}")

Precision: 0.6033519553072626	Recall: 0.6206896551724138	F-1 score: 0.6118980169971672	Support: None
