# Comparing LLMs for Climate Claim Classification: A Hands-On Tutorial
## Introduction

In this tutorial, we'll explore the fascinating world of using Large Language Models (LLMs) to classify climate change contrarian claims. We'll compare the performance of different LLMs, including a locally-run model using Ollama and a cloud-based model from TogetherAI (using an OpenAI compatible API).  We'll delve into the nuances of evaluating model performance using agreement statistics like Gwet's AC1 and explore randomness testing. This notebook is a practical guide, blending code with explanations to make the concepts accessible.

We'll be working with a dataset of climate change contrarian claims, each labeled with a category from a predefined codebook.  Our goal is to see how well different LLMs can replicate human-assigned labels.

This tutorial builds upon the methodologies presented in the following studies:
- Computer-assisted classification of contrarian claims about climate change: https://www.nature.com/articles/s41598-021-01714-4
- LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding: https://arxiv.org/abs/2306.14924



## Setup and Installation

First, we need to install the necessary Python packages. We'll use `ollama` to run a local LLM, `openai` to interact with the TogetherAI API (which mimics the OpenAI API), `pandas` for data manipulation, `tqdm` for progress bars, and statistical packages from `scipy` and `statsmodels`.

In [None]:
# Install required packages
!pip install ollama openai pandas tqdm -q

Next, we need to install Ollama itself. Ollama allows us to run LLMs locally, giving us more control and potentially better privacy. The following commands install Ollama on a Linux system (which Google Colab uses).

In [None]:
# Install Ollama
!sudo apt-get install -y pciutils
!curl -fsSL https://ollama.com/install.sh | sh

## Setting up the Ollama Server

To use Ollama within Google Colab, we need to run it as a background service. The following code sets the necessary environment variables (`OLLAMA_HOST` and `OLLAMA_ORIGINS`) and starts the Ollama server in a separate thread. This allows our Python code to interact with the Ollama server while it runs in the background.

In [None]:
# run ollama server on Colab
import os
import threading
import subprocess

def start_ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])

ollama_thread = threading.Thread(target=start_ollama)
ollama_thread.start()

## Downloading the LLM

We'll use the `mannix/gemma2-9b-simpo` model, a variant of Google's Gemma model. This model has been fine-tuned for instruction following and question answering, making it suitable for our classification task. The `ollama pull` command downloads the model to our local environment.

In [None]:
# Download LLM
!ollama pull mannix/gemma2-9b-simpo

## Importing Libraries

Now, let's import the Python libraries we'll be using:

In [None]:
# Import packages
import pandas as pd
import json
import ollama
from tqdm import tqdm
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import chi2_contingency
import numpy as np

In [None]:
# instantiate progress bare for pandas application
tqdm.pandas()

## The Codebook

The foundation of our classification task is the codebook.  This defines the categories we're using to classify climate change contrarian claims.  The `categories_codebook` string contains a detailed description of each category, including examples.  A well-defined codebook is *crucial* for ensuring consistency and accuracy, both for human annotators and for our LLMs.

In [None]:
# Improved codebook with more specific categories and examples
categories_codebook = """
Climate Change Denial Arguments Codebook:
- 1.1 Ice, permafrost, or snow cover isn't melting.
- 1.2 We're heading into global cooling or a new ice age.
- 1.3 Cold weather or snow means there's no global warming.
- 1.4 The climate hasn't warmed or changed in recent decades.
- 1.5 The oceans are cooling, or they're not warming.
- 1.6 Sea level rise is exaggerated or isn't accelerating.
- 1.7 Extreme weather isn't increasing, has always happened, or isn't linked to climate change.
- 1.8 They changed the term from 'global warming' to 'climate change' because it's not really warming.
- 2.1 Climate change is just part of natural cycles or variations.
- 2.2 Human impacts other than greenhouse gases (like aerosols or land use) are the cause.
- 2.3 There's no real evidence that CO2 or the greenhouse effect is driving climate change.
- 2.4 CO2 levels aren't rising, or the ocean's pH isn't dropping.
- 2.5 Human CO2 emissions are too small to make a difference.
- 3.1 The climate isn't very sensitive to CO2, and there are feedbacks that reduce warming.
- 3.2 Species, plants, or coral reefs aren't affected by climate change yet, or they are even benefiting.
- 3.3 CO2 is good, not a pollutant.
- 3.4 The temperature increase is only a few degrees, which isn't a big deal.
- 3.5 Climate change doesn't contribute to human conflict or threaten national security.
- 3.6 Climate change doesn't have negative effects on health.
- 4.1 Climate policies, whether mitigation or adaptation, are harmful.
- 4.2 Climate policies are ineffective or flawed.
- 4.3 The problem is too hard to solve.
- 4.4 Clean energy technologies or biofuels won't work.
- 4.5 We need energy from fossil fuels or nuclear power.
- 5.1 Climate science is uncertain, unsound, or unreliable (refers to data, methods, or models).
- 5.2 The climate movement is alarmist, wrong, political, biased, or hypocritical.
- 5.3 Climate change science or policy is a conspiracy or a deception.
- 0.0 None of the above.
"""

## The Classification Function (Ollama)

The `classify_claim` function is the heart of our interaction with the Ollama LLM. It takes a claim (a piece of text) as input and returns the predicted category number.

1.  **Prompt Construction:**  We create a prompt that includes the codebook and the claim to be classified.  We instruct the LLM to output *only* the category number in JSON format (e.g., `{"category": 1.1}`). This structured output is crucial for easy parsing.

2.  **Ollama API Call:** We use the `ollama.chat` function to send the prompt to the LLM. We specify the model (`mannix/gemma2-9b-simpo:latest`) and set the `format` to 'json'.  The `messages` parameter structures the interaction as a conversation, with a "system" message setting the context and a "user" message containing the prompt.

3.  **Response Parsing:** We receive the LLM's response, which should be a JSON string. We use `json.loads` to parse this string into a Python dictionary. We extract the 'category' value and convert it to a float.

4.  **Error Handling:** We include a `try-except` block to handle potential errors, such as invalid JSON responses or missing keys.  This makes our code more robust.

In [None]:
# Main function
def classify_claim(claim):
   prompt = f"""
   Given the following Climate Change Denial Arguments Codebook:
   {categories_codebook}
   Classify the following claim into one of the categories. Pick the one that fits best - if multiple, pick the most relevant one.
   Claim: {claim}
   Output only the category number as a float in JSON format, like this: {{"category": 1.1}}
   """
   response = ollama.chat(
       model='mannix/gemma2-9b-simpo:latest',
       messages=[
           {"role": "system", "content": "You are a climate change claim classification assistant. Classify the given claim according to the codebook."},
           {"role": "user", "content": prompt}
       ],
       format='json'
   )
   try:
       result = json.loads(response['message']['content'])
       return float(result['category'])
   except (json.JSONDecodeError, KeyError, ValueError) as e:
       print(f"Error parsing LLM response: {e}")
       print(f"Full response: {response['message']['content']}")
       return None

## Agreement Statistics: Gwet's AC1

To evaluate how well our LLMs agree with human annotators (and with each other), we use Gwet's AC1 statistic.  AC1 is a measure of inter-rater reliability, similar to Cohen's Kappa, but it's more robust when dealing with uneven marginal distributions (i.e., when some categories are much more common than others).

The `gwet_ac1` function calculates Gwet's AC1 given two lists of ratings.

1.  **Initialization:**  It determines the number of observations (`n`) and the unique categories (`q`).

2.  **Observed Agreement (Pa):**  It calculates the proportion of observations where the two raters (or the LLM and the human) agree.

3.  **Chance Agreement (Pe):**  This is the tricky part. Gwet's AC1 calculates chance agreement differently than Cohen's Kappa.  For each category, it calculates the average proportion of times that category was assigned by *either* rater.  Then, it calculates the expected agreement due to chance using these proportions.

4.  **AC1 Calculation:** Finally, it calculates AC1 using the formula: `(Pa - Pe) / (1 - Pe)`. This normalizes the observed agreement by the expected chance agreement.

In [None]:
def gwet_ac1(ratings1, ratings2):
   """Calculate Gwet's AC1"""
   n = len(ratings1)
   categories = sorted(set(ratings1) | set(ratings2))
   q = len(categories)

   # Calculate observed agreement
   pa = sum(r1 == r2 for r1, r2 in zip(ratings1, ratings2)) / n

   # Calculate chance agreement
   pi = [(sum(r1 == cat for r1 in ratings1) +
          sum(r2 == cat for r2 in ratings2)) / (2 * n)
         for cat in categories]
   peg = sum(p * (1 - p) for p in pi) / (q - 1)

   # Calculate Gwet's AC1
   ac1 = (pa - peg) / (1 - peg)
   return ac1

The interpretation of Gwet’s AC1 values is similar to other agreement statistics like Cohen’s kappa, and the “goodness” of the values depends on the context. Here’s a general guide for interpreting Gwet’s AC1:

General Interpretation:

*   0.81 to 1.00: Almost perfect agreement
*   0.61 to 0.80: Substantial agreement
*   0.41 to 0.60: Moderate agreement
*   0.21 to 0.40: Fair agreement
*   0.00 to 0.20: Slight agreement
*   Below 0.00: Poor or no agreement (worse than chance)

## Testing for Randomness

We also want to check if the LLM's classifications are simply random.  If the LLM is just guessing, the agreement statistics are meaningless.  The `test_randomness` function performs statistical tests to check for randomness.

1.  **Binary Case:** If there are only two categories, it uses a z-test for proportions (`proportions_ztest`) to test if the proportion of one category is significantly different from 0.5 (what we'd expect under random guessing).

2.  **Multiple Categories:** If there are more than two categories, it uses a chi-squared test (`chi2_contingency`).  It compares the observed frequencies of each category to the expected frequencies under a uniform distribution (i.e., equal probability for each category).

The function returns the p-value. A small p-value (typically less than 0.05) suggests that the classifications are *not* random.

In [None]:
def test_randomness(codes):
   """Perform tests of randomness"""
   unique_codes = sorted(set(codes))

   if len(unique_codes) == 2:  # Binary case
       count = sum(codes == unique_codes[1])
       nobs = len(codes)
       stat, pval = proportions_ztest(count, nobs, 0.5)
       return pval
   else:  # Multiple categories
       observed = pd.Series(codes).value_counts()
       expected = np.ones(len(unique_codes)) * len(codes) / len(unique_codes)
       stat, pval = chi2_contingency([observed, expected])[0:2]
       return pval

## Loading and Processing the Data

Now, let's load the dataset of climate change contrarian claims.  We're using a CSV file hosted on GitHub.  The `pd.read_csv` function loads the data into a Pandas DataFrame.

In [None]:
# Load the CSV file
df = pd.read_csv('https://raw.githubusercontent.com/aaubs/llm-content-analysis/main/data/contrarian_claims_reasons.csv')

We apply our Ollama classification function to each claim in the 'text' column of the DataFrame. The `progress_apply` function (from `tqdm`) provides a progress bar, which is helpful when processing a large number of claims. The results are stored in a new column called 'new_model_code'.

In [None]:
# Apply the classification function to the 'text' column with tqdm
df['new_model_code'] = df['text'].progress_apply(classify_claim)

Before calculating our metrics, we convert the code columns to float data type. This ensures consistency in our calculations.

In [None]:
# Convert codes to float
df['original_code'] = df['original_code'].astype(float)
df['replicated_code'] = df['replicated_code'].astype(float)
df['model_code'] = df['model_code'].astype(float)
df['new_model_code'] = df['new_model_code'].astype(float)

## Calculating and Interpreting Results

We're now ready to calculate the agreement statistics and perform the randomness tests.  We store the results in a dictionary for easy access.

In [None]:
# Calculate metrics
results = {
   'human_human_ac1': gwet_ac1(df['original_code'], df['replicated_code']),
   'human_model_ac1': gwet_ac1(df['original_code'], df['model_code']),
   'human_newmodel_ac1': gwet_ac1(df['original_code'], df['new_model_code']),
   'model_newmodel_ac1': gwet_ac1(df['model_code'], df['new_model_code']),
   'randomness_pval_original': test_randomness(df['model_code']),
   'randomness_pval_new': test_randomness(df['new_model_code'])
}

We print the results, formatted to three decimal places.

In [None]:
# Print results
print("Agreement Metrics (Gwet's AC1):")
print(f"Human-Human: {results['human_human_ac1']:.3f}")
print(f"Human-Original Model: {results['human_model_ac1']:.3f}")
print(f"Human-New Model: {results['human_newmodel_ac1']:.3f}")
print(f"Model-Model: {results['model_newmodel_ac1']:.3f}")
print("\nRandomness Test p-values:")
print(f"Original Model: {results['randomness_pval_original']:.3f}")
print(f"New Model: {results['randomness_pval_new']:.3f}")

## Confusion Matrix and Classification Report

To get a more detailed view of the agreement between the original model and our new Ollama model, we create a confusion matrix and a classification report.

First, we convert the float codes to strings, as required by the `confusion_matrix` and `classification_report` functions.

In [None]:
# Convert float codes to string labels for confusion matrix
df['model_code_str'] = df['model_code'].astype(str)
df['new_model_code_str'] = df['new_model_code'].astype(str)

Then we compute a confusion matrix to show the frequency of agreements and disagreements between the `model_code` (original study) and the `new_model_code` (new model) labels.

In [None]:
# Create confusion matrix
conf_matrix = confusion_matrix(df['model_code_str'], df['new_model_code_str'])

# Get actual labels from confusion matrix
actual_labels = list(range(conf_matrix.shape[0]))

conf_df = pd.DataFrame(
    conf_matrix,
    index=[f'True_{label}' for label in actual_labels],
    columns=[f'Pred_{label}' for label in actual_labels]
)

# Add row/column totals
conf_df['Total'] = conf_df.sum(axis=1)
conf_df.loc['Total'] = conf_df.sum()

print("\nConfusion Matrix:")
conf_df

The classification report provides precision, recall, F1-score, and support for each category, giving us insights into the model's performance on individual categories.

In [None]:
# Classification report
print("\nClassification Report (New Model vs Original Model):")
print(classification_report(df['model_code_str'], df['new_model_code_str']))

## Using TogetherAI (OpenAI Compatible)

Now, let's compare our local Ollama model with a cloud-based model from TogetherAI. We use the `openai` library, but we configure it to use the TogetherAI API endpoint. This demonstrates how you can use the same familiar OpenAI interface to access different LLM providers.

In [None]:
from google.colab import userdata

In [None]:
from openai import OpenAI

In [None]:
# Setup OpenAI client with custom API key and base URL
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=TOGETHER_API_KEY
)

We create a new classification function, `classify_claim_openai`, that uses the TogetherAI API.  It's very similar to the Ollama function, but it uses the `client.chat.completions.create` method from the `openai` library. We specify the `meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo` model and set `temperature=0` to make the responses deterministic (or as deterministic as possible with LLMs). We again specify `response_format={"type": "json_object"}` to ensure we receive a JSON response.

In [None]:
def classify_claim_openai(claim):
   prompt = f"""Given the following Climate Change Denial Arguments Codebook:
{categories_codebook}
Classify the following claim into one of the categories. Pick the one that fits best - if multiple, pick the most relevant one.
Claim: {claim}
Output only the category number as a float in JSON format, like this: {{"category": 1.1}}"""

   response = client.chat.completions.create(
       model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
       messages=[
           {"role": "system", "content": "You are a climate change claim classification assistant. Classify the given claim according to the codebook."},
           {"role": "user", "content": prompt}
       ],
       temperature=0,
       response_format={"type": "json_object"}
   )
   try:
       result = json.loads(response.choices[0].message.content)
       return float(result['category'])
   except (json.JSONDecodeError, KeyError, ValueError) as e:
       print(f"Error parsing response: {e}")
       print(f"Full response: {response.choices[0].message.content}")
       return None

We apply the `classify_claim_openai` function to the 'text' column and store the results in a new 'openai_model_code' column.

In [None]:
# Add new column for OpenAI model predictions
df['openai_model_code'] = df['text'].progress_apply(classify_claim_openai)
df['openai_model_code'] = df['openai_model_code'].astype(float)

We recalculate the agreement metrics, now including the TogetherAI/OpenAI model. This allows us to compare the performance of all three models (original study's model, Ollama, and TogetherAI).

In [None]:
# Calculate metrics including OpenAI model
results = {
   'human_human_ac1': gwet_ac1(df['original_code'], df['replicated_code']),
   'human_model_ac1': gwet_ac1(df['original_code'], df['model_code']),
   'human_gemma_ac1': gwet_ac1(df['original_code'], df['new_model_code']),
   'human_openai_ac1': gwet_ac1(df['original_code'], df['openai_model_code']),
   'model_gemma_ac1': gwet_ac1(df['model_code'], df['new_model_code']),
   'model_openai_ac1': gwet_ac1(df['model_code'], df['openai_model_code']),
   'gemma_openai_ac1': gwet_ac1(df['new_model_code'], df['openai_model_code'])
}

print("\nAgreement Metrics (Gwet's AC1):")
for k, v in results.items():
   print(f"{k}: {v:.3f}")

Finally, we generate confusion matrices comparing all pairs of models: Original-Gemma, Original-OpenAI, and Gemma-OpenAI. This gives us a visual comparison of their agreement patterns.

In [None]:
# Confusion matrices between all model pairs
model_pairs = [
   ('model_code', 'new_model_code', 'Original-Gemma'),
   ('model_code', 'openai_model_code', 'Original-OpenAI'),
   ('new_model_code', 'openai_model_code', 'Gemma-OpenAI')
]

for col1, col2, name in model_pairs:
   conf = confusion_matrix(df[col1].astype(str), df[col2].astype(str))
   conf_df = pd.DataFrame(conf)
   print(f"\nConfusion Matrix {name}:")
   print(conf_df)

## Conclusion

This tutorial demonstrates how to use and compare different LLMs for a text classification task. We've covered setting up Ollama for local LLM inference, using the TogetherAI API, constructing effective prompts, evaluating model performance with Gwet's AC1, performing randomness tests, and visualizing results with confusion matrices and classification reports. This provides a solid foundation for applying LLMs to your own content analysis projects. Remember to critically evaluate the results and consider the limitations of LLMs, especially their potential for bias and inconsistency.