### Introduction

Powered by theoretical advancement, big data and computer hardware, recent advances in autoregressive large language models (LLMs) have shown remarkable capacity to generate and understand complex human-like language. Yet, these models are typically trained on general corpora and usually struggle with domains like geography, which is inherently local, contextual, and interdisciplinary, shaped by diverse social, environmental, and political processes. (Janowicz et al., 2025). On the other hand, via proper and systematic prompts, including providing a few examples regarding the domain specific tasks, giving LLMs time to conduct chain-of-thought thinking, being task explicit etc., it has been proved that they can generally adapt to such tasks well (OpenAI, 2022; Bhandari et al., 2023; Majic et al., 2024). In this experiment, I use several popular LLMs to extract knowledge from unstructured climate mobility literature, on the one hand to apply them as advances in NLP to efficiently and precisely conduct downstream tasks, including literature review and domain specific graph foundation model construction for reliable, transparent and explainable retrieval and reasoning, and on the other hand as AI benchmarking, to demonstrate how domain-specific, knowledge-rich contexts can be understood better by AI, and test their geo-alignment.

In this experiment, I take the review paper from Borderon et al., 2019 and their coding system as ground truth, to benchmark 3 different LLMs, with 0-shot and 3-shot learning, as well as different temperatures. For each task I run 20 times for the comparison of accuracy.

### Libraries

In [74]:
from openai import OpenAI, RateLimitError, APIError, APITimeoutError, AuthenticationError, BadRequestError, NotFoundError
from google.api_core.exceptions import ResourceExhausted, RetryError, DeadlineExceeded
import google.generativeai as genai
import tiktoken

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import time
import os
import ast
from tqdm import tqdm

### Layout

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_seq_items', None)

### Data

In [3]:
literature = np.load('literature.npy', allow_pickle=True).item()
example = np.load('example.npy', allow_pickle=True).item()

### General commands for getting response

In [10]:
text = "{{'Qualitative method': '1', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '0', 'Households': '1', 'Subnational groups': '1', 'National groups': '0', 'International groups': '0', \
'Urban': '0', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '0', \
'Rainfall pattern / Variability': '1', 'Temperature change': '0', 'Food scarcity / Famine / Food security ': '1', 'Drought / Aridity / Desertification': '1', \
'Floods': '1', 'Erosion / Soil fertilty / Land degradation / Deforestation / Salinisation': '1', 'Self assessment / Perceived environment': '1', \
'Labour migration': '1', 'Marriage migration': '0', 'Refugees': '0', 'International migration': '0', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '1', 'Circular / Seasonal': '1', 'Long distance': '1', 'Short distance': '0', 'Temporal': '1', 'Permanent': '0', \
'Age': '1', 'Gender': '0', 'Ethnicity / Religion': '0'}}"
enc = tiktoken.encoding_for_model("gpt-4.1-mini")
MAX_TOKEN = len(enc.encode(text))

In [9]:
def api_def(provider):
    # API keys are hidden in another file
    with open('api.txt', 'r', encoding='utf-8') as file:
        lines = [line.strip() for line in file.readlines()]

        # Different LLMs need different kinds of command
        if provider == 'chatgpt':
            client = OpenAI(api_key=str(lines[0]), base_url="https://api.openai.com/v1")
            return client
        
        elif provider == 'deepseek':
            client = OpenAI(api_key=str(lines[1]), base_url="https://api.deepseek.com")
            return client
        
        elif provider == 'gemini':
            genai.configure(api_key=str(lines[2]))

In [11]:
def get_completion(prompt, provider, model, temperature):
    messages = [{"role": "system", "content": "You are an expert in climate mobility area and are conducting a systematic literature review. \
    Your task is to read the provided text and classify it according to the given properties with binary codes. \
    Do not include explanations, personal opinions or provide unrelated meta-comments in the answer. Provide only the classification results."},
               {"role": "user", "content": prompt}]

    # Different LLMs need different kinds of command
    if provider == 'gpt':
        client = api_def('chatgpt')
        response = client.responses.create(
            model=model,
            input=messages,
            temperature=temperature,
            max_output_tokens = round(MAX_TOKEN + 3)
        )
        return response.output_text
        
    elif provider == 'ds':
        client = api_def('deepseek')
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens = round(MAX_TOKEN * 1.2)
        )
        return response.choices[0].message.content

    elif provider == 'gemini':
        api_def('gemini')
        model_gemini = genai.GenerativeModel(model)
        response = model_gemini.generate_content(
            prompt,
            generation_config={"temperature": temperature, "max_output_tokens": round(MAX_TOKEN * 1.3)}
        )
        return response.text        

### Prompt and response processing -- return Python dictionary

In [12]:
CONTEXT = f"""
We are conducting a systematic review of climate mobility literature. For each paper, your task is to assign some binary codes, namely '0' or '1', \
according to the specific indicators that I will provide, to classify the paper across multiple dimensions for downstream analysis. \n

While reviewing a paper, evaluate *only the content of this paper*. Do not use information from any other paper. Do not infer missing information from \
past tasks or earlier papers. Focus primarily on the abstract, data/methods, results, and conclusions sections. \
Do not rely on introduction, state-of-the-art, related works, or motivation sections, especially when they describe other researchers’ work. \
Only use information from those sections if they explicitly describe what *this paper* does. \n

For each index, output one and only one character: '0' (for 'No') or '1' (Yes), according to the concrete explanation below. Do not provide explanations. \
Each indicator is independent, do not infer one indicator based on another. Please follow strictly the format I will specify later. \
When information is missing or unclear, just assign '0'.
"""

In [13]:
ZERO_SHOT_PROMPT = f"""

<Group Leader>: 
{CONTEXT} \n

The following 36 indicators will be used to categorize each paper, regarding its content: \n
a. Methods used for data analysis (refers only to analytical methods, not methods for data collection) (2 indicators): 1. 'Qualitative method' and \
2. 'Quantitative method'. Assign '0' for 'Not explicitly used' or '1' for 'Explicitly used'. A paper may use both (some researchers call it \
'mixed-method research'), one, or even neither of them. \n
b. Types of data used for analysis (only commenting on such points doesn't count) (2 indicators): 3. 'Socio-demo-economic data' ('demo' means \
'demographic') and 4. 'Environmental data'. Assign '0' for 'Not explicitly used' or '1' for 'Explicitly used'. Please do not care about how data is \
collected. A paper may include both, one, or even neither of them. \n
c. Focal demographic units of analysis (refers to the units of analysis, not to scale of research areas) (5 indicators): 5. 'Individuals', \
6. 'Households', 7. 'Subnational groups' (such as community and province), 8. 'National groups', 9. 'International groups'. Assign '0' for \
'Not explicitly concentrated' or '1' for 'Explicitly concentrated'. A paper may consider several, one or even none of them. \n
d. Research location types (refers to the places where investigation took place, not the places regarding migration) (2 indicators): 10. 'Urban', \
11. 'Rural'. Assign '0' for 'Not explicitly focused' or '1' for 'Explicitly focused'. A paper may focus on both, one, or even neither of them. \n
e. Temporal aspects (2 indicators): 12. 'Time frame considered' (such as using temporal analysis), 13. 'Foresight' (i.e., forecast, prediction, and \
future perspectives was/were addressed). Assign '0' for 'Not explicitly focused' or '1' for 'Explicitly focused'. \
A paper may consider both, one, or neither of them. \n
f: Environmental stressors as variables (only assign '1' when a stressor is explicitly used for analysis, while only mentioning without evidence \
doesn't count) (7 indicators): 14. 'Rainfall pattern / Variability', 15. 'Temperature change', 16. 'Food scarcity / Famine / Food security', \
17. 'Drought / Aridity / Desertification', 18. 'Floods', 19. 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation', \
20. 'Self assessment / Perceived environment'. Note that indicator 20 indicates that the above stressors measured via human perception were \
included in the paper, no matter if data from observation was included. Assign '0' for 'Not explicitly considered' or '1' for 'Explicitly considered'. \
For indicators 14 to 19, a paper may include several, one, or none of them. \n 
g: Types of migration discussed (only assign '1' when a type of migration is explicitly mentioned with evidence, while only discussing without any \
evidence doesn't count) (8 indicators): 21. 'Labour migration' (migration related to work), 22. 'Marriage migration', 23. 'Refugees', \
24. 'International migration', 25. 'Cross-border migration' (compared to 'International migration' which indicates those migration with longer \
distance, 'Cross-border migration' denotes those migration that people move only from one nation to an adjacent one), 26. 'Internal migration' (those \
migration within one single nation), 27. 'Rural to urban', 28. 'Rural to rural'. A paper might mention several, one, or none of them. \n
h: Mobility/migration patterns discussed (only assign '1' when a migration pattern is explicitly mentioned with evidence)(5 indicators): \
29. 'Circular / Seasonal', 30. 'Long distance', 31. 'Short distance', 32. 'Temporal' (migrants may come back), and 33. 'Permanent' (migrants will not \
come back). A paper can discuss several, one, or none of them. \n
i: Specific demographic characteristics as variables (only assign '1' when a characteristic is explicitly used for analysis, while only mentioning \
without evidence doesn't count) (3 indicators): 34. 'Age', 35. 'Gender', 36. 'Ethnicity / Religion'. A paper may use several, one, or none of them. \n

Always output the result in the following Python Dictionary structure with identical order, with each blank replaced by 0 or 1 only, and no extra text: \n
{{'Qualitative method': '_', 'Quantitative method': '_', 'Socio-demo-economic data': '_', 'Environmental data': '_', \
'Individuals': '_', 'Households': '_', 'Subnational groups': '_', 'National groups': '_', 'International groups': '_', \
'Urban': '_', 'Rural': '_', 'Time frame considered': '_', 'Foresight': '_', \
'Rainfall pattern / Variability': '_', 'Temperature change': '_', 'Food scarcity / Famine / Food security': '_', 'Drought / Aridity / Desertification': '_', \
'Floods': '_', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '_', 'Self assessment / Perceived environment': '_', \
'Labour migration': '_', 'Marriage migration': '_', 'Refugees': '_', 'International migration': '_', 'Cross-border migration': '_', 'Internal migration': '_', \
'Rural to urban': '_', 'Rural to rural': '_', 'Circular / Seasonal': '_', 'Long distance': '_', 'Short distance': '_', 'Temporal': '_', 'Permanent': '_', \
'Age': '_', 'Gender': '_', 'Ethnicity / Religion': '_'}}
"""

In [14]:
ONE_SHOT_PROMPT = f"""

{ZERO_SHOT_PROMPT} \n

Now please try to code the following paper: \n
\"\"\"{example[0]}\"\"\"
\n
<Climate Mobility Expert>: Here is my result: \n
{{'Qualitative method': '1', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '0', 'Households': '1', 'Subnational groups': '1', 'National groups': '0', 'International groups': '0', \
'Urban': '0', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '0', \
'Rainfall pattern / Variability': '1', 'Temperature change': '0', 'Food scarcity / Famine / Food security': '1', 'Drought / Aridity / Desertification': '1', \
'Floods': '1', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '1', 'Self assessment / Perceived environment': '1', \
'Labour migration': '1', 'Marriage migration': '0', 'Refugees': '0', 'International migration': '0', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '1', 'Circular / Seasonal': '1', 'Long distance': '1', 'Short distance': '0', 'Temporal': '1', 'Permanent': '0', \
'Age': '1', 'Gender': '0', 'Ethnicity / Religion': '0'}}
"""

In [15]:
THREE_SHOT_PROMPT = f"""

{ZERO_SHOT_PROMPT} \n

Now please try to code the following 3 papers: \n
\"\"\"{example}\"\"\"
\n
<Climate Mobility Expert>: OK, here is my result: \n
{{'Qualitative method': '1', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '0', 'Households': '1', 'Subnational groups': '1', 'National groups': '0', 'International groups': '0', \
'Urban': '0', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '0', \
'Rainfall pattern / Variability': '1', 'Temperature change': '0', 'Food scarcity / Famine / Food security': '1', 'Drought / Aridity / Desertification': '1', \
'Floods': '1', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '1', 'Self assessment / Perceived environment': '1', \
'Labour migration': '1', 'Marriage migration': '0', 'Refugees': '0', 'International migration': '0', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '1', 'Circular / Seasonal': '1', 'Long distance': '1', 'Short distance': '0', 'Temporal': '1', 'Permanent': '0', \
'Age': '1', 'Gender': '0', 'Ethnicity / Religion': '0'}}, \
{{'Qualitative method': '0', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '0', 'Households': '1', 'Subnational groups': '0', 'National groups': '0', 'International groups': '0', \
'Urban': '1', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '1', \
'Rainfall pattern / Variability': '1', 'Temperature change': '1', 'Food scarcity / Famine / Food security': '0', 'Drought / Aridity / Desertification': '0', \
'Floods': '0', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '0', 'Self assessment / Perceived environment': '0', \
'Labour migration': '1', 'Marriage migration': '0', 'Refugees': '0', 'International migration': '1', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '0', 'Circular / Seasonal': '1', 'Long distance': '1', 'Short distance': '1', 'Temporal': '1', 'Permanent': '1', \
'Age': '0', 'Gender': '0', 'Ethnicity / Religion': '0'}}, \
{{'Qualitative method': '0', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '1', 'Households': '1', 'Subnational groups': '0', 'National groups': '0', 'International groups': '0', \
'Urban': '0', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '0', \
'Rainfall pattern / Variability': '0', 'Temperature change': '0', 'Food scarcity / Famine / Food security': '0', 'Drought / Aridity / Desertification': '0', \
'Floods': '0', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '1', 'Self assessment / Perceived environment': '0', \
'Labour migration': '1', 'Marriage migration': '1', 'Refugees': '0', 'International migration': '0', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '0', 'Circular / Seasonal': '1', 'Long distance': '0', 'Short distance': '1', 'Temporal': '1', 'Permanent': '1', \
'Age': '1', 'Gender': '1', 'Ethnicity / Religion': '0'}}
"""

In [16]:
THREE_SHOT_SEPARATE_PROMPT = f"""

{ZERO_SHOT_PROMPT} \n

Now please try to code the following paper: \n
\"\"\"{example[0]}\"\"\"
\n
<Climate Mobility Expert>: OK, here is my result: \n
{{'Qualitative method': '1', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '0', 'Households': '1', 'Subnational groups': '1', 'National groups': '0', 'International groups': '0', \
'Urban': '0', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '0', \
'Rainfall pattern / Variability': '1', 'Temperature change': '0', 'Food scarcity / Famine / Food security': '1', 'Drought / Aridity / Desertification': '1', \
'Floods': '1', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '1', 'Self assessment / Perceived environment': '1', \
'Labour migration': '1', 'Marriage migration': '0', 'Refugees': '0', 'International migration': '0', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '1', 'Circular / Seasonal': '1', 'Long distance': '1', 'Short distance': '0', 'Temporal': '1', 'Permanent': '0', \
'Age': '1', 'Gender': '0', 'Ethnicity / Religion': '0'}}, \n
<Prof>: Please try to code another paper below: \n
\"\"\"{example[1]}\"\"\"
\n
<Climate Mobility Expert>: Here is my result: \n
{{'Qualitative method': '0', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '0', 'Households': '1', 'Subnational groups': '0', 'National groups': '0', 'International groups': '0', \
'Urban': '1', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '1', \
'Rainfall pattern / Variability': '1', 'Temperature change': '1', 'Food scarcity / Famine / Food security': '0', 'Drought / Aridity / Desertification': '0', \
'Floods': '0', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '0', 'Self assessment / Perceived environment': '0', \
'Labour migration': '1', 'Marriage migration': '0', 'Refugees': '0', 'International migration': '1', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '0', 'Circular / Seasonal': '1', 'Long distance': '1', 'Short distance': '1', 'Temporal': '1', 'Permanent': '1', \
'Age': '0', 'Gender': '0', 'Ethnicity / Religion': '0'}}, \n
<Prof>: Good jbo! Now please code one more paper: \n
\"\"\"{example[2]}\"\"\"
\n
<Climate Mobility Expert>: No problem, here is my result: \n
{{'Qualitative method': '0', 'Quantitative method': '1', 'Socio-demo-economic data': '1', 'Environmental data': '1', \
'Individuals': '1', 'Households': '1', 'Subnational groups': '0', 'National groups': '0', 'International groups': '0', \
'Urban': '0', 'Rural': '1', 'Time frame considered': '1', 'Foresight': '0', \
'Rainfall pattern / Variability': '0', 'Temperature change': '0', 'Food scarcity / Famine / Food security': '0', 'Drought / Aridity / Desertification': '0', \
'Floods': '0', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation': '1', 'Self assessment / Perceived environment': '0', \
'Labour migration': '1', 'Marriage migration': '1', 'Refugees': '0', 'International migration': '0', 'Cross-border migration': '0', 'Internal migration': '1', \
'Rural to urban': '1', 'Rural to rural': '0', 'Circular / Seasonal': '1', 'Long distance': '0', 'Short distance': '1', 'Temporal': '1', 'Permanent': '1', \
'Age': '1', 'Gender': '1', 'Ethnicity / Religion': '0'}}
"""

In [135]:
def extract_features(paper: dict, example, provider, model, temperature):

    fail_count = 0
    while True:
        if example == 3:
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions and the examples, understand the leader's requirements and the expert's work on coding papers. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement \
            just as what the expert did. \n
            You just need to return the final result of step 2, and make it as Python dictionary format as performed below. Do not explain reasoning. \n

            Here are the instructions and examples: \n"
            f"{THREE_SHOT_PROMPT}\n"
            Here is the new paper, please code it like the examples: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python dictionary, with no extra text."
            """

        elif example == '3s':
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions and the examples, understand the leader's requirements and the expert's work on coding papers. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement \
            just as what the expert did. \n
            You just need to return the final result of step 2, and make it as Python dictionary format as performed below. Do not explain reasoning. \n

            Here are the instructions and examples: \n"
            f"{THREE_SHOT_SEPARATE_PROMPT}\n"
            Here is the new paper, please code it like the examples: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python dictionary, with no extra text."
            """

        elif example == 1:
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions and the example, understand the leader's requirements and the expert's work on coding paper. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement \
            just as what the expert did. \n
            You just need to return the final result of step 2, and make it as Python dictionary format as performed below. Do not explain reasoning. \n

            Here are the instructions and example: \n"
            f"{ONE_SHOT_PROMPT}\n"
            Here is the new paper, please code it like the expert: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python dictionary, with no extra text."
            """

        elif example == 0:
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions, understand the leader's requirements regarding coding papers. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement. \n
            You just need to return the final result of step 2, and make it as Python dictionary format as performed below. Do not explain reasoning. \n

            Here are the instructions: \n"
            f"{ZERO_SHOT_PROMPT}\n"
            Here is the paper, please code it according to the requirements: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python dictionary, with no extra text."
            """

        response = get_completion(prompt, provider, model, temperature)

        # Parse response when needed
        if response.strip().startswith("```"):
            lines = response.splitlines()
            if lines and lines[0].startswith("```"):
                lines = lines[1:]
            if lines and lines[-1].startswith("```"):
                lines = lines[:-1]
            response = "\n".join(lines)
        response = response.strip()

        # Transfer to useable data
        try:
            data = ast.literal_eval(response)
            return data
        except Exception as e:
            fail_count += 1
            time.sleep(1)
            if fail_count >= 10:
                raise ValueError(f"Can't parse as Python dict: {response}, retried too many times") from e

In [21]:
def extract_to_df(literature: dict, example, provider, model):
    results = []
    temperatures = np.round(np.arange(0.0, 2.0, 0.4), 1)

    # 5 temperatures, 10 times each
    for t in tqdm(temperatures, leave=True):        
        for run_id in range(10):
            success = False
            while not success:
                try:
                    features = extract_features(literature, example, provider, model, t)
                    features["provider"] = provider
                    features["model"] = model
                    features["few shot"] = example
                    features["temperature"] = t
                    features["run"] = run_id
                    results.append(features)
                    success = True
                except (RateLimitError, APIError, APITimeoutError, ResourceExhausted) as e: # AuthenticationError, BadRequestError, NotFoundError
                    time.sleep(2)

    df = pd.DataFrame(results)
    return df

### Results

In [22]:
gpt4o_0 = extract_to_df(literature[0], 0, 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [07:09<00:00, 85.87s/it]


In [23]:
gpt4o_1 = extract_to_df(literature[0], 1, 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [07:30<00:00, 90.16s/it]


In [24]:
gpt4o_3 = extract_to_df(literature[0], 3, 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [08:16<00:00, 99.28s/it]


In [25]:
gpt4o_3s = extract_to_df(literature[0], '3s', 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [07:31<00:00, 90.34s/it]


In [26]:
gpt41_0 = extract_to_df(literature[0], 0, 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:19<00:00, 63.99s/it]


In [27]:
gpt41_1 = extract_to_df(literature[0], 1, 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [04:56<00:00, 59.25s/it]


In [28]:
gpt41_3 = extract_to_df(literature[0], 3, 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:36<00:00, 67.34s/it]


In [29]:
gpt41_3s = extract_to_df(literature[0], '3s', 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:48<00:00, 69.67s/it]


In [30]:
dsv3_0 = extract_to_df(literature[0], 0, 'ds', 'deepseek-chat')

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [08:21<00:00, 100.32s/it]


In [31]:
dsv3_1 = extract_to_df(literature[0], 1, 'ds', 'deepseek-chat')

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [08:37<00:00, 103.59s/it]


In [32]:
dsv3_3 = extract_to_df(literature[0], 3, 'ds', 'deepseek-chat')

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [09:02<00:00, 108.47s/it]


In [33]:
dsv3_3s = extract_to_df(literature[0], '3s', 'ds', 'deepseek-chat')

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [08:57<00:00, 107.55s/it]


In [34]:
gemini25_0 = extract_to_df(literature[0], 0, 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:19<00:00, 15.99s/it]


In [35]:
gemini25_1 = extract_to_df(literature[0], 1, 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:39<00:00, 19.85s/it]


In [36]:
gemini25_3 = extract_to_df(literature[0], 3, 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:16<00:00, 15.23s/it]


In [37]:
gemini25_3s = extract_to_df(literature[0], '3s', 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:30<00:00, 18.03s/it]


In [38]:
gemini20_0 = extract_to_df(literature[0], 0, 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:49<00:00, 21.98s/it]


In [39]:
gemini20_1 = extract_to_df(literature[0], 1, 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:25<00:00, 29.06s/it]


In [40]:
gemini20_3 = extract_to_df(literature[0], 3, 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:34<00:00, 66.90s/it]


In [41]:
gemini20_3s = extract_to_df(literature[0], '3s', 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:22<00:00, 28.51s/it]


In [42]:
df_all = pd.concat([gpt4o_0, gpt4o_1, gpt4o_3, gpt4o_3s, gpt41_0, gpt41_1, gpt41_3, gpt41_3s, dsv3_0, dsv3_1, dsv3_3, dsv3_3s, gemini25_0, gemini25_1, \
                    gemini25_3, gemini25_3s, gemini20_0, gemini20_1, gemini20_3, gemini20_3s], ignore_index=True)
df_all = df_all.astype(str)
df_all.to_csv('bi_together_text.csv', index=False)

### Evaluation

In [123]:
LLM_result = pd.read_csv('bi_together_text.csv')
manual_result = pd.read_excel('manual.xlsx')

In [124]:
manual_result = manual_result.iloc[[3]].drop(columns=['ID', 'AUTHOR', 'TITLE']).reset_index(drop=True)
manual_result

Unnamed: 0,Qualitative method,Quantitative method,Socio-demo-economic data,Environmental data,Individuals,Households,Subnational groups,National groups,International groups,Urban,Rural,Time frame considered,Foresight,Rainfall pattern / Variability,Temperature change,Food scarcity / Famine / Food security,Drought / Aridity / Desertification,Floods,Erosion / Soil fertility / Land degradation / Deforestation / Salinisation,Self assessment / Perceived environment,Labour migration,Marriage migration,Refugees,International migration,Cross-border migration,Internal migration,Rural to urban,Rural to rural,Circular / Seasonal,Long distance,Short distance,Temporal,Permanent,Age,Gender,Ethnicity / Religion
0,0,1,1,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0


In [125]:
value_cols = manual_result.columns # For comparisons
model_cols_all = LLM_result.columns[36:] # About models
model_cols = [c for c in model_cols_all if c != 'run'] # Except for 'run'

bool_df = (LLM_result[value_cols] == manual_result.iloc[0])
LLM_result['accuracy'] = bool_df.mean(axis=1)

df_model_acc = LLM_result[model_cols + ['accuracy']].reset_index(drop=True)

In [178]:
def plot(df, provider, model, few_shot, pos):    
    subset = df[(df['provider'] == provider) & (df['model'] == model) & (df['few shot'] == few_shot)]
    
    if few_shot != '3s':
        shot_label = f"{few_shot}-shot learning (together)"
    else:
        shot_label = '3-shot learning (separate)'
    
    plt.figure(figsize=(8,6))
    plt.title(f"Accuracy of {model} under {shot_label} setting at different temperatures")
    label = f"{model}-{shot_label}"
        
    sns.boxplot(data=subset, x='temperature', y='accuracy', width=0.5, showfliers=True, \
                boxprops=dict(facecolor="#ff8936", edgecolor='black', linewidth=1), \
                medianprops=dict(color='black', linewidth=1.5), \
                whiskerprops=dict(color="black", linewidth=1), \
                capprops=dict(color="black", linewidth=1), \
                flierprops=dict(marker='o', markerfacecolor="black", markersize=4, alpha=0.5))
    
    plt.xlabel('Temperature')
    plt.ylabel('Accuracy')
    plt.grid(True)
    plt.ylim(0.2, 1)
    filename = f"{model}_{few_shot}_accuracy_boxplot.png"
    filepath = os.path.join(pos, filename)
    plt.savefig(filepath, dpi=300, bbox_inches='tight')
    plt.close()

In [179]:
for provider in df_model_acc['provider'].unique():
    models_for_provider = df_model_acc[df_model_acc['provider'] == provider]['model'].unique()
    for model in models_for_provider:
        for few_shot in df_model_acc['few shot'].unique():
            plot(df_model_acc, provider, model, few_shot, 'plots/text_input_together_dict')

### Prompt and response processing -- return list

In [131]:
ZERO_SHOT_PROMPT_L = f"""

<Group Leader>: 
{CONTEXT} \n

The following 36 indicators will be used to categorize each paper, regarding its content: \n
a. Methods used for data analysis (refers only to analytical methods, not methods for data collection) (2 indicators): 1. 'Qualitative method' and \
2. 'Quantitative method'. Assign '0' for 'Not explicitly used' or '1' for 'Explicitly used'. A paper may use both (some researchers call it \
'mixed-method research'), one, or even neither of them. \n
b. Types of data used for analysis (only commenting on such points doesn't count) (2 indicators): 3. 'Socio-demo-economic data' ('demo' means \
'demographic') and 4. 'Environmental data'. Assign '0' for 'Not explicitly used' or '1' for 'Explicitly used'. Please do not care about how data is \
collected. A paper may include both, one, or even neither of them. \n
c. Focal demographic units of analysis (refers to the units of analysis, not to scale of research areas) (5 indicators): 5. 'Individuals', \
6. 'Households', 7. 'Subnational groups' (such as community and province), 8. 'National groups', 9. 'International groups'. Assign '0' for \
'Not explicitly concentrated' or '1' for 'Explicitly concentrated'. A paper may consider several, one or even none of them. \n
d. Research location types (refers to the places where investigation took place, not the places regarding migration) (2 indicators): 10. 'Urban', \
11. 'Rural'. Assign '0' for 'Not explicitly focused' or '1' for 'Explicitly focused'. A paper may focus on both, one, or even neither of them. \n
e. Temporal aspects (2 indicators): 12. 'Time frame considered' (such as using temporal analysis), 13. 'Foresight' (i.e., forecast, prediction, and \
future perspectives was/were addressed). Assign '0' for 'Not explicitly focused' or '1' for 'Explicitly focused'. \
A paper may consider both, one, or neither of them. \n
f: Environmental stressors as variables (only assign '1' when a stressor is explicitly used for analysis, while only mentioning without evidence \
doesn't count) (7 indicators): 14. 'Rainfall pattern / Variability', 15. 'Temperature change', 16. 'Food scarcity / Famine / Food security', \
17. 'Drought / Aridity / Desertification', 18. 'Floods', 19. 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation', \
20. 'Self assessment / Perceived environment'. Note that indicator 20 indicates that the above stressors measured via human perception were \
included in the paper, no matter if data from observation was included. Assign '0' for 'Not explicitly considered' or '1' for 'Explicitly considered'. \
For indicators 14 to 19, a paper may include several, one, or none of them. \n 
g: Types of migration discussed (only assign '1' when a type of migration is explicitly mentioned with evidence, while only discussing without any \
evidence doesn't count) (8 indicators): 21. 'Labour migration' (migration related to work), 22. 'Marriage migration', 23. 'Refugees', \
24. 'International migration', 25. 'Cross-border migration' (compared to 'International migration' which indicates those migration with longer \
distance, 'Cross-border migration' denotes those migration that people move only from one nation to an adjacent one), 26. 'Internal migration' (those \
migration within one single nation), 27. 'Rural to urban', 28. 'Rural to rural'. A paper might mention several, one, or none of them. \n
h: Mobility/migration patterns discussed (only assign '1' when a migration pattern is explicitly mentioned with evidence)(5 indicators): \
29. 'Circular / Seasonal', 30. 'Long distance', 31. 'Short distance', 32. 'Temporal' (migrants may come back), and 33. 'Permanent' (migrants will not \
come back). A paper can discuss several, one, or none of them. \n
i: Specific demographic characteristics as variables (only assign '1' when a characteristic is explicitly used for analysis, while only mentioning \
without evidence doesn't count) (3 indicators): 34. 'Age', 35. 'Gender', 36. 'Ethnicity / Religion'. A paper may use several, one, or none of them. \n

Always output the result as **a Python list of 36 elements** like below, with each blank replaced by '0' or '1' only, with no other text: \
['_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', \
'_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_']. Keep in mind that **the i-th element of the list MUST match the i-th indicator**. \
Do not shuffle!!
"""

In [132]:
ONE_SHOT_PROMPT_L = f"""

{ZERO_SHOT_PROMPT_L} \n

Now please try to code the following paper: \n
\"\"\"{example[0]}\"\"\"
\n
<Climate Mobility Expert>: OK, here is my result: \n
['1', '1', '1', '1', '0', '1', '1', '0', '0', '0', '1', '1', '0', '1', '0', '1', '1', '1', '1', '1', '1', '0', '0', '0', '0', '1', '1', '1', \
'1', '1', '0', '1', '0', '1', '0', '0']
"""

In [133]:
THREE_SHOT_PROMPT_L = f"""

{ZERO_SHOT_PROMPT_L} \n

Now please try to code the following 3 papers: \n
\"\"\"{example}\"\"\"
\n
<Climate Mobility Expert>: OK, here is my result: \n
['1', '1', '1', '1', '0', '1', '1', '0', '0', '0', '1', '1', '0', '1', '0', '1', '1', '1', '1', '1', '1', '0', '0', '0', '0', '1', '1', '1', \
'1', '1', '0', '1', '0', '1', '0', '0'], \
['0', '1', '1', '1', '0', '1', '0', '0', '0', '1', '1', '0', '0', '1', '1', '0', '0', '0', '0', '0', '1', '0', '0', '1', '0', '1', '1', '0', \
'1', '1', '1', '1', '1', '0', '0', '0'], \
['0', '1', '1', '1', '1', '1', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '1', '0', '1', '1', '0', '0', '0', '1', '1', '0', \
'1', '0', '1', '1', '1', '1', '1', '0']
"""

In [134]:
THREE_SHOT_SEPARATE_PROMPT_L = f"""

{ZERO_SHOT_PROMPT_L} \n

Now please try to code the following paper: \n
\"\"\"{example[0]}\"\"\"
\n
<Climate Mobility Expert>: OK, here is my result: \n
['1', '1', '1', '1', '0', '1', '1', '0', '0', '0', '1', '1', '0', '1', '0', '1', '1', '1', '1', '1', '1', '0', '0', '0', '0', '1', '1', '1', \
'1', '1', '0', '1', '0', '1', '0', '0'] \n
<Prof>: Please try to code another paper below: \n
\"\"\"{example[1]}\"\"\"
\n
<Climate Mobility Expert>: OK, here is my result: \n
['0', '1', '1', '1', '0', '1', '0', '0', '0', '1', '1', '0', '0', '1', '1', '0', '0', '0', '0', '0', '1', '0', '0', '1', '0', '1', '1', '0', \
'1', '1', '1', '1', '1', '0', '0', '0'] \n
<Prof>: Good job! Now please code one more paper: \n
\"\"\"{example[2]}\"\"\"
\n
<Climate Mobility Expert>: No problem, here is my result: \n
['0', '1', '1', '1', '1', '1', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '1', '0', '1', '1', '0', '0', '0', '1', '1', '0', \
'1', '0', '1', '1', '1', '1', '1', '0']
"""

In [141]:
def extract_features_l(paper: dict, example, provider, model, temperature):

    fail_count = 0
    while True:
        if example == 3:
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions and the examples, understand the leader's requirements and the expert's work on coding papers. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement \
            just as what the expert did. \n
            You just need to return the final result of step 2, and make it as Python list format as performed below. Do not explain reasoning. \n

            Here are the instructions and examples: \n"
            f"{THREE_SHOT_PROMPT_L}\n"
            Here is the new paper, please code it like the examples: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python list, with no extra text."
            """

        elif example == '3s':
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions and the examples, understand the leader's requirements and the expert's work on coding papers. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement \
            just as what the expert did. \n
            You just need to return the final result of step 2, and make it as Python list format as performed below. Do not explain reasoning. \n

            Here are the instructions and examples: \n"
            f"{THREE_SHOT_SEPARATE_PROMPT_L}\n"
            Here is the new paper, please code it like the examples: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python list, with no extra text."
            """

        elif example == 1:
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions and the example, understand the leader's requirements and the expert's work on coding paper. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement \
            just as what the expert did. \n
            You just need to return the final result of step 2, and make it as Python list format as performed below. Do not explain reasoning. \n

            Here are the instructions and example: \n"
            f"{ONE_SHOT_PROMPT_L}\n"
            Here is the new paper, please code it like the expert: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python list, with no extra text."
            """

        elif example == 0:
            prompt = f"""
            Please answer in a consistent style, performing the following actions step by step: \n
            1 - Read the instructions, understand the leader's requirements regarding coding papers. \n
            2 - You will be provided with a paper. Please read and give your answer according to the leader's requirement. \n
            You just need to return the final result of step 2, and make it as Python list format as performed below. Do not explain reasoning. \n

            Here are the instructions: \n"
            f"{ZERO_SHOT_PROMPT_L}\n"
            Here is the paper, please code it according to the requirements: \n"
            f"{paper}\n"
            "Keep in mind that output format is Python list, with no extra text."
            """

        response = get_completion(prompt, provider, model, temperature)

        # Parse response when needed
        if response.strip().startswith("```"):
            lines = response.splitlines()
            if lines and lines[0].startswith("```"):
                lines = lines[1:]
            if lines and lines[-1].startswith("```"):
                lines = lines[:-1]
            response = "\n".join(lines)
        response = response.strip()

        # Transfer to useable data
        try:
            data = ast.literal_eval(response)
            return data
        except Exception as e:
            fail_count += 1
            time.sleep(1)
            if fail_count >= 10:
                raise ValueError(f"Can't parse as Python list: {response}, retried too many times") from e

In [137]:
def extract_to_df_l(literature: dict, example, provider, model):
    results = []
    temperatures = np.round(np.arange(0.0, 2.0, 0.4), 1)

    feature_columns = ['Qualitative method', 'Quantitative method', 'Socio-demo-economic data', 'Environmental data', \
                       'Individuals', 'Households', 'Subnational groups', 'National groups', 'International groups', \
                       'Urban', 'Rural', 'Time frame considered', 'Foresight', \
                       'Rainfall pattern / Variability', 'Temperature change', 'Food scarcity / Famine / Food security', 'Drought / Aridity / Desertification', \
                       'Floods', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation', 'Self assessment / Perceived environment', \
                       'Labour migration', 'Marriage migration', 'Refugees', 'International migration', 'Cross-border migration', 'Internal migration', \
                       'Rural to urban', 'Rural to rural', 'Circular / Seasonal', 'Long distance', 'Short distance', 'Temporal', 'Permanent', \
                       'Age', 'Gender', 'Ethnicity / Religion']
    
    meta_columns = ["provider", "model", "few shot", "temperature", "run"]
    
    for t in tqdm(temperatures, leave=True):       
        for run_id in range(10):
            success = False
            while not success:
                try:
                    features = extract_features_l(literature, example, provider, model, t)
                    row = dict(zip(feature_columns, features))
                    row["provider"] = provider
                    row["model"] = model
                    row["few shot"] = example
                    row["temperature"] = t
                    row["run"] = run_id
                    results.append(row)
                    success = True
                except (RateLimitError, APIError, APITimeoutError, ResourceExhausted) as e: # AuthenticationError, BadRequestError, NotFoundError
                    time.sleep(2)

    df = pd.DataFrame(results, columns=feature_columns+meta_columns)
    return df

In [142]:
gpt4o_0_l = extract_to_df_l(literature[0], 0, 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [03:22<00:00, 40.45s/it]


In [144]:
gpt4o_1_l = extract_to_df_l(literature[0], 1, 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [03:23<00:00, 40.67s/it]


In [145]:
gpt4o_3_l = extract_to_df_l(literature[0], 3, 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [04:16<00:00, 51.30s/it]


In [146]:
gpt4o_3s_l = extract_to_df_l(literature[0], '3s', 'gpt', 'gpt-4o-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [03:26<00:00, 41.32s/it]


In [147]:
gpt41_0_l = extract_to_df_l(literature[0], 0, 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:14<00:00, 26.98s/it]


In [148]:
gpt41_1_l = extract_to_df_l(literature[0], 1, 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:29<00:00, 29.92s/it]


In [151]:
gpt41_3_l = extract_to_df_l(literature[0], 3, 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [03:06<00:00, 37.31s/it]


In [152]:
gpt41_3s_l = extract_to_df_l(literature[0], '3s', 'gpt', 'gpt-4.1-mini')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:43<00:00, 32.62s/it]


In [153]:
dsv3_0_l = extract_to_df_l(literature[0], 0, 'ds', 'deepseek-chat')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [04:06<00:00, 49.37s/it]


In [154]:
dsv3_1_l = extract_to_df_l(literature[0], 1, 'ds', 'deepseek-chat')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [04:07<00:00, 49.53s/it]


In [155]:
dsv3_3_l = extract_to_df_l(literature[0], 3, 'ds', 'deepseek-chat')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [04:38<00:00, 55.64s/it]


In [156]:
dsv3_3s_l = extract_to_df_l(literature[0], '3s', 'ds', 'deepseek-chat')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [04:38<00:00, 55.68s/it]


In [157]:
gemini20_0_l = extract_to_df_l(literature[0], 0, 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:03<00:00, 12.67s/it]


In [158]:
gemini20_1_l = extract_to_df_l(literature[0], 1, 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:27<00:00, 17.50s/it]


In [159]:
gemini20_3_l = extract_to_df_l(literature[0], 3, 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:41<00:00, 20.39s/it]


In [160]:
gemini20_3s_l = extract_to_df_l(literature[0], '3s', 'gemini', 'gemini-2.0-flash')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:39<00:00, 31.91s/it]


In [161]:
gemini25_0_l = extract_to_df_l(literature[0], 0, 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:46<00:00,  9.25s/it]


In [162]:
gemini25_1_l = extract_to_df_l(literature[0], 1, 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:01<00:00, 12.28s/it]


In [163]:
gemini25_3_l = extract_to_df_l(literature[0], 3, 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:03<00:00, 12.79s/it]


In [164]:
gemini25_3s_l = extract_to_df_l(literature[0], '3s', 'gemini', 'gemini-2.5-flash-lite')

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:01<00:00, 12.31s/it]


In [165]:
df_all_l = pd.concat([gpt4o_0_l, gpt4o_1_l, gpt4o_3_l, gpt4o_3s_l, gpt41_0_l, gpt41_1_l, gpt41_3_l, gpt41_3s_l, dsv3_0_l, dsv3_1_l, dsv3_3_l, dsv3_3s_l, \
                      gemini25_0_l, gemini25_1_l, gemini25_3_l, gemini25_3s_l, gemini20_0_l, gemini20_1_l, gemini20_3_l, gemini20_3s_l], ignore_index=True)
df_all_l = df_all_l.astype(str)
df_all_l.to_csv('bi_together_text_list.csv', index=False)

### Evaluation

In [166]:
LLM_result_l = pd.read_csv('bi_together_text_list.csv')
manual_result = pd.read_excel('manual.xlsx')

In [167]:
manual_result = manual_result.iloc[[3]].drop(columns=['ID', 'AUTHOR', 'TITLE']).reset_index(drop=True)
manual_result

Unnamed: 0,Qualitative method,Quantitative method,Socio-demo-economic data,Environmental data,Individuals,Households,Subnational groups,National groups,International groups,Urban,Rural,Time frame considered,Foresight,Rainfall pattern / Variability,Temperature change,Food scarcity / Famine / Food security,Drought / Aridity / Desertification,Floods,Erosion / Soil fertility / Land degradation / Deforestation / Salinisation,Self assessment / Perceived environment,Labour migration,Marriage migration,Refugees,International migration,Cross-border migration,Internal migration,Rural to urban,Rural to rural,Circular / Seasonal,Long distance,Short distance,Temporal,Permanent,Age,Gender,Ethnicity / Religion
0,0,1,1,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0


In [168]:
value_cols_l = manual_result.columns # For comparisons
model_cols_all_l = LLM_result_l.columns[36:] # About models
model_cols_l = [c for c in model_cols_all_l if c != 'run'] # Except for 'run'

bool_df_l = (LLM_result_l[value_cols_l] == manual_result.iloc[0])
LLM_result_l['accuracy'] = bool_df_l.mean(axis=1)

df_model_acc_l = LLM_result_l[model_cols_l + ['accuracy']].reset_index(drop=True)

In [180]:
for provider in df_model_acc_l['provider'].unique():
    models_for_provider = df_model_acc_l[df_model_acc_l['provider'] == provider]['model'].unique()
    for model in models_for_provider:
        for few_shot in df_model_acc_l['few shot'].unique():
            plot(df_model_acc_l, provider, model, few_shot, 'plots/text_input_together_list')