## Classifying unstructured geographic text based on distance

This experiment employs LLMs to extract and summarize geographic information from unstructured, empirical literature on migration studies, especially climate migration (i.e., climate change or environmental events related human migration). The core task here is to simply, automatically classify the type of human migration discussed in each paper into 2 qualitative categories: long-distance human migration or short-distance human migration. 

It's important to note that this task differs from more straightforward NER, but relatively comprehensive. The conceptualization of migration type is rarely explicitly addressed by the context, but should be inferred by LLMs based on more implicit natural language descriptions. According to the literature, apart from NER-like extraction, such information should sometimes be inferred from distance between place names, from difficulties induced by border crossing, from other synonyms or semantically similar words like 'internal human migration' and 'international human migration' and so on and so forth. It's natually that such conceptualization can't be identical when different background applies, however, studying that is not useless, since it not only helps data analysis, ontology engineering and information retrieval, but also influence decision making and policy making regarding human migration. 

### Libraries

In [118]:
from openai import OpenAI, RateLimitError, APIError, APITimeoutError, AuthenticationError, BadRequestError, NotFoundError
from google.api_core.exceptions import ResourceExhausted, RetryError, DeadlineExceeded
import google.generativeai as genai

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# from multiprocessing import Pool
import time
import os
import ast
from tqdm import tqdm
import sklearn

### Layout

In [3]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_seq_items', None)

### Data

In this experiment we use pre-processed literature, namely literature processed as Python Dictionary, to serve as our input, since directly inputting PDF is not supported by some of the LLMs. To ensure equality (since we do not aim to compare performance among different LLMs here), we simply make this decision. For transferring PDF to Dictionary please refer to another file called pdf2text.ipynb. We admit that this may cause lower accuracy as figures and tables are excluded, and context can be incoherent.

In [4]:
literature = np.load('literature.npy', allow_pickle=True).item()
shot = np.load('example.npy', allow_pickle=True).item()

### Pre-defined variables

Anything you want to set as prior, such as max token, temperature, top-p, logprob, etc.

In [7]:
MAX_TOKEN = 16

### General commands for getting response

We employ 3 different models: gpt-4.1-mini, gemini-2.5-flash, and deepseek-3.2-chat via API provided by them. For each model, we consider 3 scenarios: 0-shot, 1-shot, and 3-shot, which enables us to see if giving examples can make any difference. Giving examples to LLMs can make them perform better is also a hypothesis of our research.

In [44]:
def api_def(provider):
    # API keys are hidden in another file
    with open('api.txt', 'r', encoding='utf-8') as file:
        lines = [line.strip() for line in file.readlines()]

        # Different LLMs need different kinds of command
        if provider == 'chatgpt':
            client = OpenAI(api_key=str(lines[0]), base_url="https://api.openai.com/v1")
            return client
        
        elif provider == 'deepseek':
            client = OpenAI(api_key=str(lines[1]), base_url="https://api.deepseek.com")
            return client
        
        elif provider == 'gemini':
            genai.configure(api_key=str(lines[4]))

In [29]:
def get_completion(prompt, provider, model, temperature):
    messages = [{"role": "system", "content": "You are an expert in Human Geography, synthesizing literature on \
    climate-induced migration. Your task is to simply classify the provided text, solely based on the evidence in it."},
               {"role": "user", "content": prompt}]

    # Different LLMs need different kinds of command
    if provider == 'gpt':
        client = api_def('chatgpt')
        response = client.responses.create(
            model=model,
            input=messages,
            temperature=temperature,
            max_output_tokens = MAX_TOKEN
        )
        return response.output_text
        
    elif provider == 'ds':
        client = api_def('deepseek')
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens = MAX_TOKEN
        )
        return response.choices[0].message.content

    elif provider == 'gemini':
        api_def('gemini')
        model_gemini = genai.GenerativeModel(model)
        response = model_gemini.generate_content(
            prompt,
            generation_config={"temperature": temperature, "max_output_tokens": MAX_TOKEN * 5}
        )
        return response.text        

In [9]:
CONTEXT = f"""
We are classifying academic papers based on the type of human migration they address. \n \n
While reviewing a paper, assign it one of the following codes based solely on the evidence presented in this specific paper: \n
'1': The paper explicitly focuses on or observes evidence for long-distance human migration. \n
'2': The paper explicitly focuses on or observes evidence for short-distance human migration. \n
'3': The paper explicitly focuses on or observes evidence for both long-distance and short-distance human migration. \n
'0': The paper does not provide any explicit information on distance of human migration. \n \n

Instructions for doing classification: \n
1.  While classifying a paper, focus only on its own content. Do not infer any missing information from e.g., past \
tasks. Only concentrate on sections that explicitly describe this exact work, such as 'abstract', 'objectives', \
'data', 'methods', 'results', and 'conclusions', etc., rather than those that primarily summarize others' \
work, like 'introduction', 'literature review' or 'related works' etc. \n
2.  Classification of migration type should be based on interpreting context. This includes not only extracting \
explicit mentions of 'long-distance human migration' or 'short-distance human migration', but also reasoning \
according to the context where for example, migration distance, origins and destinations, borders crossing etc. \
are addressed. \n \n

Return only the single character without any other thing such as explanations, opinions or meta-comments. 
"""

In [10]:
def get_response(paper: dict, example, provider, model, temperature):

    fail_count = 0
    while True:
        if example == 3:
            prompt = f"""
            <Group leader>: {CONTEXT} \n
            Now please assign a character for the following paper: \n
            \"\"\"{shot[0]}\"\"\"
            \n
            <Expert>: My answer is: 1 \n
            <Group leader>: Please assign a character for another paper: \n
            \"\"\"{shot[1]}\"\"\"
            \n
            <Expert>: My answer is: 3 \n
            <Group leader>: Please assign a character for another paper: \n
            \"\"\"{shot[2]}\"\"\"
            \n
            <Expert>: My answer is: 2 \n
            
            <Group leader>: Please assign a character for this paper like what the expert did: \n"
            f"{paper}\n"
            "Keep in mind that output is only one character."
            """

        elif example == 1:
            prompt = f"""
            <Group leader>: {CONTEXT} \n
            Now please assign a character for the following paper: \n
            \"\"\"{shot[0]}\"\"\"
            \n
            <Expert>: My answer is: 1 \n
            
            <Group leader>: Please assign a character for this paper like what the expert did: \n"
            f"{paper}\n"
            "Keep in mind that output is only one character."
            """

        elif example == 0:
            prompt = f"""
            <Group leader>: {CONTEXT} \n
            Please assign a character for the following paper: \n
            f"{paper}\n"
            "Keep in mind that output is only one character."
            """

        response = get_completion(prompt, provider, model, temperature)
        return response

### Organization

For each model and each scenario, we consider 5 different temperatures and run each experiment for 10 times to test the influence of temperature parameter. The whole process is applied to 30 different papers in the same way.

In [39]:
def response_agg(literature, example, provider, model, idx):
    results = []
    fail_count = 0
    temperatures = np.round(np.arange(0.0, 2.0, 0.4), 1)
        
    for t in tqdm(temperatures, desc=f"File {idx+1}"):        
        for run_id in range(10):
            success = False
            while not success:
                try:
                    result = {"answer": get_response(literature, example, provider, model, t)}
                    result["provider"] = provider
                    result["model"] = model
                    result["few shot"] = example
                    result["temperature"] = t
                    result["run"] = run_id
                    results.append(result)
                    # When exceed Tokens per minute
                    time.sleep(1)
                    success = True
                except (RateLimitError, APIError, APITimeoutError, ResourceExhausted) as e: # AuthenticationError, BadRequestError, NotFoundError
                    time.sleep(1)
                    fail_count = fail_count + 1
                    if fail_count >= 10:
                        raise ValueError(f"Can't get response, retried too many times") from e

    df = pd.DataFrame(results)
    return df

In [13]:
def batch_response_to_df(all_literature, example, provider, model):

    save_path = f"results_lit/{provider}_{example}.csv"
    os.makedirs("results_lit", exist_ok=True)

    if os.path.exists(save_path):
        temporal_df = pd.read_csv(save_path)
        processed_ids = set(temporal_df["file_id"].astype(int).tolist())
        print(f"Start from paper {len(processed_ids)+1}")
    else:
        temporal_df = pd.DataFrame()
        processed_ids = set()
        
    all_results = [temporal_df] if not temporal_df.empty else []
    
    for idx, literature in enumerate(tqdm(all_literature.values(), desc="Progress")):
        file_id = idx + 1
        if file_id in processed_ids:
            continue
        
        df = response_agg(literature, example, provider, model, idx)
        df["file_id"] = file_id
        all_results.append(df)

        temporal_df = pd.concat(all_results, ignore_index=True)
        temporal_df.astype(str).to_csv(save_path, index=False)

    final_df = pd.concat(all_results, ignore_index=True)
    return final_df

In all: 30 documents, 3 models, 3 scenarios regarding amount of examples, 5 temperatures, 10 runs. In sum we need to call api for 13500 times, and will get a final result containing 13500 rows.

### Results

In [14]:
gpt41_0 = batch_response_to_df(literature, 0, 'gpt', 'gpt-4.1-mini')

Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 1:   0%|          | 0/5 [00:00<?, ?it/s][A
File 1:  20%|██        | 1/5 [00:16<01:04, 16.10s/it][A
File 1:  40%|████      | 2/5 [00:30<00:45, 15.19s/it][A
File 1:  60%|██████    | 3/5 [00:43<00:27, 13.92s/it][A
File 1:  80%|████████  | 4/5 [00:50<00:11, 11.55s/it][A
File 1: 100%|██████████| 5/5 [01:00<00:00, 12.04s/it][A
Progress:   3%|▎         | 1/30 [01:00<29:05, 60.20s/it]
File 2:   0%|          | 0/5 [00:00<?, ?it/s][A
File 2:  20%|██        | 1/5 [00:11<00:45, 11.25s/it][A
File 2:  40%|████      | 2/5 [00:22<00:34, 11.34s/it][A
File 2:  60%|██████    | 3/5 [00:34<00:23, 11.79s/it][A
File 2:  80%|████████  | 4/5 [00:42<00:10, 10.11s/it][A
File 2: 100%|██████████| 5/5 [00:51<00:00, 10.30s/it][A
Progress:   7%|▋         | 2/30 [01:51<25:42, 55.09s/it]
File 3:   0%|          | 0/5 [00:00<?, ?it/s][A
File 3:  20%|██        | 1/5 [00:09<00:37,  9.45s/it][A
File 3:  40%|████      | 2/5 [00:19<00:29,  9.72s/it][A
File 3

In [15]:
gpt41_1 = batch_response_to_df(literature, 1, 'gpt', 'gpt-4.1-mini')

Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 1:   0%|          | 0/5 [00:00<?, ?it/s][A
File 1:  20%|██        | 1/5 [00:12<00:48, 12.01s/it][A
File 1:  40%|████      | 2/5 [00:24<00:36, 12.09s/it][A
File 1:  60%|██████    | 3/5 [00:36<00:24, 12.25s/it][A
File 1:  80%|████████  | 4/5 [00:49<00:12, 12.38s/it][A
File 1: 100%|██████████| 5/5 [00:59<00:00, 11.98s/it][A
Progress:   3%|▎         | 1/30 [00:59<28:57, 59.93s/it]
File 2:   0%|          | 0/5 [00:00<?, ?it/s][A
File 2:  20%|██        | 1/5 [00:11<00:45, 11.28s/it][A
File 2:  40%|████      | 2/5 [00:21<00:31, 10.66s/it][A
File 2:  60%|██████    | 3/5 [00:31<00:20, 10.44s/it][A
File 2:  80%|████████  | 4/5 [00:44<00:11, 11.46s/it][A
File 2: 100%|██████████| 5/5 [00:54<00:00, 10.81s/it][A
Progress:   7%|▋         | 2/30 [01:53<26:20, 56.46s/it]
File 3:   0%|          | 0/5 [00:00<?, ?it/s][A
File 3:  20%|██        | 1/5 [00:10<00:41, 10.42s/it][A
File 3:  40%|████      | 2/5 [00:18<00:27,  9.04s/it][A
File 3

In [37]:
gpt41_3 = batch_response_to_df(literature, 3, 'gpt', 'gpt-4.1-mini')

Start from paper 7


Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 7:   0%|          | 0/5 [00:00<?, ?it/s][A
File 7:  20%|██        | 1/5 [00:29<01:57, 29.27s/it][A
File 7:  40%|████      | 2/5 [00:48<01:10, 23.54s/it][A
File 7:  60%|██████    | 3/5 [01:04<00:39, 19.78s/it][A
File 7:  80%|████████  | 4/5 [01:18<00:17, 17.56s/it][A
File 7: 100%|██████████| 5/5 [01:40<00:00, 20.11s/it][A
Progress:  23%|██▎       | 7/30 [01:40<05:30, 14.36s/it]
File 8:   0%|          | 0/5 [00:00<?, ?it/s][A
File 8:  20%|██        | 1/5 [00:14<00:59, 14.91s/it][A
File 8:  40%|████      | 2/5 [00:30<00:45, 15.33s/it][A
File 8:  60%|██████    | 3/5 [00:46<00:30, 15.44s/it][A
File 8:  80%|████████  | 4/5 [01:00<00:15, 15.08s/it][A
File 8: 100%|██████████| 5/5 [01:11<00:00, 14.27s/it][A
Progress:  27%|██▋       | 8/30 [02:51<08:48, 24.02s/it]
File 9:   0%|          | 0/5 [00:00<?, ?it/s][A
File 9:  20%|██        | 1/5 [00:14<00:59, 14.79s/it][A
File 9:  40%|████      | 2/5 [00:30<00:46, 15.40s/it][A
File 9

In [17]:
ds32_0 = batch_response_to_df(literature, 0, 'ds', 'deepseek-chat')

Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 1:   0%|          | 0/5 [00:00<?, ?it/s][A
File 1:  20%|██        | 1/5 [00:17<01:11, 17.97s/it][A
File 1:  40%|████      | 2/5 [00:36<00:55, 18.34s/it][A
File 1:  60%|██████    | 3/5 [00:53<00:35, 17.77s/it][A
File 1:  80%|████████  | 4/5 [01:11<00:17, 17.63s/it][A
File 1: 100%|██████████| 5/5 [01:29<00:00, 17.81s/it][A
Progress:   3%|▎         | 1/30 [01:29<43:02, 89.04s/it]
File 2:   0%|          | 0/5 [00:00<?, ?it/s][A
File 2:  20%|██        | 1/5 [00:17<01:10, 17.56s/it][A
File 2:  40%|████      | 2/5 [00:34<00:50, 16.91s/it][A
File 2:  60%|██████    | 3/5 [00:52<00:34, 17.48s/it][A
File 2:  80%|████████  | 4/5 [01:08<00:17, 17.03s/it][A
File 2: 100%|██████████| 5/5 [01:26<00:00, 17.24s/it][A
Progress:   7%|▋         | 2/30 [02:55<40:46, 87.36s/it]
File 3:   0%|          | 0/5 [00:00<?, ?it/s][A
File 3:  20%|██        | 1/5 [00:16<01:06, 16.66s/it][A
File 3:  40%|████      | 2/5 [00:33<00:50, 16.82s/it][A
File 3

In [18]:
ds32_1 = batch_response_to_df(literature, 1, 'ds', 'deepseek-chat')

Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 1:   0%|          | 0/5 [00:00<?, ?it/s][A
File 1:  20%|██        | 1/5 [00:21<01:27, 21.92s/it][A
File 1:  40%|████      | 2/5 [00:40<00:59, 19.95s/it][A
File 1:  60%|██████    | 3/5 [00:57<00:37, 18.57s/it][A
File 1:  80%|████████  | 4/5 [01:17<00:19, 19.18s/it][A
File 1: 100%|██████████| 5/5 [01:37<00:00, 19.47s/it][A
Progress:   3%|▎         | 1/30 [01:37<47:02, 97.34s/it]
File 2:   0%|          | 0/5 [00:00<?, ?it/s][A
File 2:  20%|██        | 1/5 [00:18<01:13, 18.29s/it][A
File 2:  40%|████      | 2/5 [00:36<00:54, 18.07s/it][A
File 2:  60%|██████    | 3/5 [00:55<00:36, 18.44s/it][A
File 2:  80%|████████  | 4/5 [01:13<00:18, 18.43s/it][A
File 2: 100%|██████████| 5/5 [01:33<00:00, 18.69s/it][A
Progress:   7%|▋         | 2/30 [03:10<44:21, 95.06s/it]
File 3:   0%|          | 0/5 [00:00<?, ?it/s][A
File 3:  20%|██        | 1/5 [00:18<01:14, 18.53s/it][A
File 3:  40%|████      | 2/5 [00:37<00:55, 18.52s/it][A
File 3

In [19]:
ds32_3 = batch_response_to_df(literature, 3, 'ds', 'deepseek-chat')

Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 1:   0%|          | 0/5 [00:00<?, ?it/s][A
File 1:  20%|██        | 1/5 [00:22<01:30, 22.58s/it][A
File 1:  40%|████      | 2/5 [00:43<01:04, 21.40s/it][A
File 1:  60%|██████    | 3/5 [01:01<00:40, 20.10s/it][A
File 1:  80%|████████  | 4/5 [01:23<00:20, 20.76s/it][A
File 1: 100%|██████████| 5/5 [01:40<00:00, 20.16s/it][A
Progress:   3%|▎         | 1/30 [01:40<48:43, 100.83s/it]
File 2:   0%|          | 0/5 [00:00<?, ?it/s][A
File 2:  20%|██        | 1/5 [00:19<01:18, 19.54s/it][A
File 2:  40%|████      | 2/5 [00:37<00:56, 18.76s/it][A
File 2:  60%|██████    | 3/5 [00:55<00:36, 18.43s/it][A
File 2:  80%|████████  | 4/5 [01:12<00:17, 17.73s/it][A
File 2: 100%|██████████| 5/5 [01:31<00:00, 18.35s/it][A
Progress:   7%|▋         | 2/30 [03:12<44:33, 95.48s/it] 
File 3:   0%|          | 0/5 [00:00<?, ?it/s][A
File 3:  20%|██        | 1/5 [00:47<03:11, 47.80s/it][A
File 3:  40%|████      | 2/5 [01:53<02:54, 58.07s/it][A
File

In [23]:
gemini25_0 = batch_response_to_df(literature, 0, 'gemini', 'gemini-2.5-flash')

Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 1:   0%|          | 0/5 [00:00<?, ?it/s][A
File 1:  20%|██        | 1/5 [00:12<00:49, 12.31s/it][A
File 1:  40%|████      | 2/5 [00:23<00:34, 11.62s/it][A
File 1:  60%|██████    | 3/5 [00:36<00:24, 12.21s/it][A
File 1:  80%|████████  | 4/5 [00:47<00:11, 11.90s/it][A
File 1: 100%|██████████| 5/5 [01:00<00:00, 12.16s/it][A
Progress:   3%|▎         | 1/30 [01:00<29:23, 60.82s/it]
File 2:   0%|          | 0/5 [00:00<?, ?it/s][A
File 2:  20%|██        | 1/5 [00:12<00:49, 12.41s/it][A
File 2:  40%|████      | 2/5 [00:26<00:39, 13.16s/it][A
File 2:  60%|██████    | 3/5 [00:37<00:24, 12.18s/it][A
File 2:  80%|████████  | 4/5 [00:47<00:11, 11.54s/it][A
File 2: 100%|██████████| 5/5 [00:59<00:00, 11.88s/it][A
Progress:   7%|▋         | 2/30 [02:00<27:59, 60.00s/it]
File 3:   0%|          | 0/5 [00:00<?, ?it/s][A
File 3:  20%|██        | 1/5 [00:12<00:51, 12.90s/it][A
File 3:  40%|████      | 2/5 [00:24<00:36, 12.18s/it][A
File 3

In [24]:
gemini25_1 = batch_response_to_df(literature, 1, 'gemini', 'gemini-2.5-flash')

Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 1:   0%|          | 0/5 [00:00<?, ?it/s][A
File 1:  20%|██        | 1/5 [00:15<01:00, 15.03s/it][A
File 1:  40%|████      | 2/5 [00:27<00:40, 13.36s/it][A
File 1:  60%|██████    | 3/5 [00:41<00:27, 13.94s/it][A
File 1:  80%|████████  | 4/5 [00:56<00:14, 14.40s/it][A
File 1: 100%|██████████| 5/5 [01:09<00:00, 13.91s/it][A
Progress:   3%|▎         | 1/30 [01:09<33:36, 69.55s/it]
File 2:   0%|          | 0/5 [00:00<?, ?it/s][A
File 2:  20%|██        | 1/5 [00:12<00:51, 12.96s/it][A
File 2:  40%|████      | 2/5 [00:25<00:37, 12.66s/it][A
File 2:  60%|██████    | 3/5 [00:37<00:25, 12.55s/it][A
File 2:  80%|████████  | 4/5 [00:49<00:12, 12.27s/it][A
File 2: 100%|██████████| 5/5 [01:03<00:00, 12.74s/it][A
Progress:   7%|▋         | 2/30 [02:13<30:51, 66.11s/it]
File 3:   0%|          | 0/5 [00:00<?, ?it/s][A
File 3:  20%|██        | 1/5 [00:14<00:56, 14.07s/it][A
File 3:  40%|████      | 2/5 [00:28<00:42, 14.15s/it][A
File 3

In [45]:
gemini25_3 = batch_response_to_df(literature, 3, 'gemini', 'gemini-2.5-flash')

Start from paper 23


Progress:   0%|          | 0/30 [00:00<?, ?it/s]
File 23:   0%|          | 0/5 [00:00<?, ?it/s][A
File 23:  20%|██        | 1/5 [00:27<01:49, 27.37s/it][A
File 23:  40%|████      | 2/5 [00:52<01:19, 26.34s/it][A
File 23:  60%|██████    | 3/5 [01:17<00:51, 25.56s/it][A
File 23:  80%|████████  | 4/5 [01:41<00:25, 25.06s/it][A
File 23: 100%|██████████| 5/5 [02:04<00:00, 24.87s/it][A
Progress:  77%|███████▋  | 23/30 [02:04<00:37,  5.41s/it]
File 24:   0%|          | 0/5 [00:00<?, ?it/s][A
File 24:  20%|██        | 1/5 [00:23<01:35, 23.89s/it][A
File 24:  40%|████      | 2/5 [00:45<01:07, 22.63s/it][A
File 24:  60%|██████    | 3/5 [01:08<00:45, 22.53s/it][A
File 24:  80%|████████  | 4/5 [01:31<00:22, 22.76s/it][A
File 24: 100%|██████████| 5/5 [01:56<00:00, 23.24s/it][A
Progress:  80%|████████  | 24/30 [04:00<01:11, 11.89s/it]
File 25:   0%|          | 0/5 [00:00<?, ?it/s][A
File 25:  20%|██        | 1/5 [00:22<01:31, 22.76s/it][A
File 25:  40%|████      | 2/5 [00:44<01:07, 22.

### Evaluation

We simply calculate the accuracy of classification results compared to human annotation first. We get an accuracy value for each setting regarding model, few-shot scenario and temperature.

In [119]:
manual_class = pd.read_excel('manual.xlsx')
manual_class["class"] = (manual_class["Long distance"].astype(int) + manual_class["Short distance"].astype(int)*2).astype(str)
manual_class = manual_class.drop(index=range(0, 3), columns=["ID", "AUTHOR", "TITLE", "Long distance", "Short distance"]).reset_index(drop=True)

In [130]:
df_dict = {}

for file_name in os.listdir('results_lit/'):
    if file_name.endswith('.csv'):
        file_path = os.path.join('results_lit/', file_name)

        df = pd.read_csv(file_path)
        df_name = file_name.replace('.csv', '')
        df_dict[df_name] = df

LLM_class = pd.concat(df_dict.values(), axis=0, ignore_index=True)
LLM_class['file_id'] = LLM_class['file_id'].astype(int) - 1
LLM_class['label'] = (LLM_class['file_id'].map(manual_class['class'])).astype(str).str.strip()
LLM_class['answer'] = LLM_class['answer'].astype(str).str.strip()

In [133]:
LLM_class['accuracy'] = (LLM_class['label'] == LLM_class['answer']).astype(int)
LLM_new = LLM_class.drop(columns=['provider', 'answer', 'label']).reset_index(drop=True)
LLM_summary = LLM_new.groupby(['model', 'few shot', 'temperature']).agg({'accuracy': 'mean'}).reset_index()
LLM_summary

Unnamed: 0,model,few shot,temperature,accuracy
0,deepseek-chat,0,0.0,0.273333
1,deepseek-chat,0,0.4,0.286667
2,deepseek-chat,0,0.8,0.286667
3,deepseek-chat,0,1.2,0.28
4,deepseek-chat,0,1.6,0.276667
5,deepseek-chat,1,0.0,0.233333
6,deepseek-chat,1,0.4,0.243333
7,deepseek-chat,1,0.8,0.256667
8,deepseek-chat,1,1.2,0.223333
9,deepseek-chat,1,1.6,0.246667


In [136]:
# random
file_ids = np.repeat(np.arange(30), 10)
random_class = np.random.randint(0, 4, size=300)

random = pd.DataFrame({
    'file_id': file_ids,
    'answer': random_class
})

In [137]:
random['label'] = (random['file_id'].map(manual_class['class'])).astype(str).str.strip()
random['answer'] = random['answer'].astype(str).str.strip()
random['accuracy'] = (random['label'] == random['answer']).astype(int)
random = random.drop(columns=['answer']).reset_index(drop=True)
random_summary = random['accuracy'].mean()
random_summary

np.float64(0.26)

In [138]:
def plot(df, model, few_shot, pos):    
    subset = df[(df['model'] == model)]
    
    plt.figure(figsize=(8,6))
    plt.title(f"{model} - Accuracy", fontsize=24, pad=15)

    few_shot_values = subset['few shot'].unique()
    colors = {'gpt-4.1-mini': '#036aff', 'deepseek-chat': '#9cafff', 'gemini-2.5-flash': '#ff7a40'}
    color = colors.get(model, '#036aff')
    line_styles = ['-', '--', '-.']

    plt.axhline(y=random_summary, color='black', linestyle=':', linewidth=2, label='random')
    for i, few_shot in enumerate(few_shot_values):
        few_shot_data = subset[subset['few shot'] == few_shot]
        line_style = line_styles[i % len(line_styles)]
        plt.plot(few_shot_data['temperature'], few_shot_data['accuracy'], marker='o', markersize=6, linewidth=2, \
                color=color, linestyle=line_style, label=f'{few_shot}-shot')
    
    plt.xlabel('Temperature', fontsize=18)
    plt.ylabel('Accuracy', fontsize=18)
    plt.grid(True, alpha=0.3)
    plt.ylim(0.2, 0.6)
    plt.xticks(np.arange(0, 2, 0.4), fontsize=14)
    plt.yticks(np.arange(0.2, 0.7, 0.1), fontsize=14)
    plt.legend(title='Scenario', title_fontsize=18, fontsize=14, loc='best', framealpha=0.9)
    filepath = os.path.join(pos, f"{model}.png")
    plt.savefig(filepath, dpi=300, bbox_inches='tight')
    plt.close()

In [139]:
for model in LLM_summary['model'].unique():
    for few_shot in LLM_summary['few shot'].unique():
        plot(LLM_summary, model, few_shot, 'plots/lit')

LLM classification results show marked discrepancies from human annotations. In the case of DeepSeek specifically, accuracy remains indistinguishable from random chance across various scenarios and temperature settings. Let's look at confusion matrix a bit.

In [148]:
def conf_matrix(df, model, few_shot, temperature):
    subset = df[(df['model'] == model) & (df['few shot'] == few_shot) & (df['temperature'] == temperature)]
    matrix = pd.DataFrame(sklearn.metrics.confusion_matrix(subset['label'], subset['answer']))
    return matrix

In [149]:
for model in LLM_class['model'].unique():
    for few_shot in LLM_class['few shot'].unique():
        for temperature in LLM_class['temperature'].unique():
            conf_matrix(LLM_class, model, few_shot, temperature).to_csv(f"analysis_lit/{model}_{few_shot}_{temperature}.csv", index=False)

In [150]:
sklearn.metrics.confusion_matrix(LLM_class['label'], LLM_class['answer'])

array([[ 773,  363,  815,  749],
       [ 161,  699,   24,  466],
       [2117,  833, 2053, 1297],
       [ 980,  375,  409, 1386]])

In [154]:
subset = LLM_class[(LLM_class['model'] == 'gpt-4.1-mini')]
sklearn.metrics.confusion_matrix(subset['label'], subset['answer'])

array([[  65,  130,  361,  344],
       [   0,  251,   22,  177],
       [ 163,  233, 1193,  511],
       [ 108,  120,  265,  557]])