#LLM Translation Judge for English-to-Filipino Translations via Prompt Engineering

## Installation of Dependencies

In [3]:
from google.colab import ai
import pandas as pd
import urllib.request
import numpy as np
from scipy.stats import spearmanr
import time
import random
import re

## Importing, Pre-processing, and Cleaning the dataset



### Importing the dataset

The dataset was downloaded from google sheets and uploaded into a public repository in github. From there the dataset is loaded into the notebook using urllib

In [4]:
url = 'https://raw.githubusercontent.com/CarandangR/CSC420M-LLM-Translation-Judge-for-English-to-Filipino-Translations/main/Datasets%20-%20Training.csv'
filename = 'training_dataset.csv'

urllib.request.urlretrieve(url, filename)

('training_dataset.csv', <http.client.HTTPMessage at 0x7adde15f5ed0>)

In [5]:
df = pd.read_csv(filename)

df.head()

Unnamed: 0,English,Filipino-Correct,Filipino-Flawed,Remarks,Contributor
0,Ang gnda na mura pa,It so beautiful and it's even affordable.,beautiful and cheap,flawed translation failed to express the 'na' ...,Charibeth Cheng
1,The Philippines is an archipelago made up of o...,"Ang Pilipinas ay isang kapulaang binubuo ng 7,...",Ang Pilipinas ay isang puno na binubuo ng mahi...,,Geena Tibule/Charlyne Arajoy Carabeo
2,Philippines is the world's second-largest arch...,Ang Pilipinas ang pangalawa sa pinakamalaking ...,Ang Pilipinas ay ang pangalawang malaking isla...,,Geena Tibule/Charlyne Arajoy Carabeo
3,Filipino and English are the two official lang...,Filipino at Ingles ang dalawang opisyal na lin...,Tagalog at Ingles ang dalawa opisyal lingwahe ...,,Geena Tibule/Charlyne Arajoy Carabeo
4,Tagalog is the most widely spoken native langu...,Tagalog ang pinakamalawak at ginagamit na katu...,Tagalog ay ang pinaka malawak sinasabi katutub...,,Geena Tibule/Charlyne Arajoy Carabeo


### Displaying the info of the dataset

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564 entries, 0 to 563
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   English           562 non-null    object
 1   Filipino-Correct  562 non-null    object
 2   Filipino-Flawed   562 non-null    object
 3   Remarks           333 non-null    object
 4   Contributor       562 non-null    object
dtypes: object(5)
memory usage: 22.2+ KB


## Dataset Cleaning

### Dropping Rows with missing `English` or `Filipino-Correct/Filipino/Flawed` Texts

In [7]:
df_cleaned = df.dropna(subset=['English', 'Filipino-Correct', 'Filipino-Flawed']).copy()

### Renaming of columns

This is for the columns to be easily handled by the LLM. This is how each column was renamed:
*   `English` was turned into `source_text`
*   `Filipino-Correct` was turned into `reference_translation`
*   `Filipino-Flawed` was turned into `translated_text`


In [8]:
df_cleaned.rename(columns={
    'English': 'source_text',
    'Filipino-Correct': 'reference_translation',
    'Filipino-Flawed': 'translated_text',
}, inplace=True)

### Checking of duplicate strings

In [9]:
df_cleaned = df_cleaned[
    (df_cleaned['source_text'].str.strip() != '') &
    (df_cleaned['translated_text'].str.strip() != '')
].drop_duplicates(subset=['source_text', 'translated_text'])

### Displaying Cleaned dataset

In [10]:
df_cleaned.sample(5)

Unnamed: 0,source_text,reference_translation,translated_text,Remarks,Contributor
409,The client reports intrusive thoughts related ...,May naiisip ang kliyente na paulit-ulit na tun...,May trauma ang kliyente kaya lagi siyang may i...,too vague,Darryl Ty/Rafael Yap
347,People leave strange little memories of themse...,May iniiwang kakaiba at maliliit na alaala ang...,Ang mga tao ay nag-iiwan ng kakaiba maliit na ...,"Missing “na” linker, word choice “memorya” is ...",Lia Guillermo/Lester Anthony Sityar
475,"You are Geralt of Rivia, mercenary monster sla...","Ikaw si Geralt ng Rivia, isang mersenaryong ta...","Ikaw si Geralt ng Rivia, tagapatay ng halimaw....","Repetitive (“halimaw” twice), vague",Aira Garganera / Nigel Nograles
337,"Anna feels that she is confusing ""who"" and ""wh...","Pakiramdam ni Anna, nagkakamali siya sa paggam...","Pakiramdam ni Anna ay nililito niya ang ""sino""...",unnecessary translation of words in quotations...,Enrique Lejano / Monica Manlises
118,Wasn't it enough that you disfigured me for ni...,Hindi pa ba sapat na sinira mo ang aking itsur...,Hindi ba sapat na dinisfigure mo ako ng siyam ...,,Elijah Rosario/John Kirsten Espiritu


## Declaring the Prompting template

This template was made to take into consideration the different requirements specified in the machine project specifications. Things like Accuracy, Fluency, Coherence, Cultural Appropriateness, and Completeness are criterias that the LLM will evaluate on.

In [11]:
prompt = """
You are a strict translation judge evaluating an English-to-Filipino translation. Your task is to Evaluate and score the translated text.

### Task Overview:
Generate an evaluation from the english source text, the translated text, and the reference translation.
English source text: {source_text}
Translated text: {translated_text}
Reference text: {reference_translation}
Note that the reference translation is the correct translation of the source text so it will get the perfect score. Keep in mind when evaluating

### Evaluation Criteria:
Each criteria below will be either 0 or 1 point each. The perfect score of the translation is 6 points and the lowest score is 0 points.
The criteria can only have a score of 1 if they meed the requirements of the criteria. otherwise it will always be 0.
1. Accuracy – Deduct if any part of the meaning differs from the reference or omits details.
2. Fluency – Deduct if grammar, style, or idiomatic expression is less natural than the reference.
3. Coherence – Deduct if the logical flow or structure differs from the reference without valid reason.
4. Cultural Appropriateness – Deduct if cultural tone, idioms, or respectful forms differ in a way that harms the message.
5. Guideline Adherence – Deduct if terminology/style differs from the reference in a way that breaks domain rules.
6. Completeness – Deduct if any information present in the reference is missing or altered.

### Sample Output
English - The actress has not given any further details about it.
Translated - Ang aktres ay hindi nagbigay ng anumang karagdagang mga detalye tungkol dito.
Accuracy: 1 point (The translated text accurately conveys the core meaning of the source text, which is that the actress has provided no new information.)
Fluency: 0 point (The use of the "Ang... ay..." construction is grammatically correct but less natural and fluent than the reference text's "Wala pang..." structure, which is more common in Filipino.)
Coherence: 0 point (The sentence structure, while logical, differs from the more integrated and seamless flow of the reference text. The reference's structure connects the ideas more smoothly.)
Cultural Appropriateness: 1 point (The tone of the translation is neutral and fact-based, which is culturally appropriate for the context. No inappropriate language is used.)
Guideline Adherence: 0 point (The translation deviates from the style of the reference text by using the formal "ay" construction and the literal "anumang" instead of the more idiomatic "Wala pang..." structure.)
Completeness: 1 point (The translated text includes all key pieces of information from the source sentence: the actress, the action of not giving, the further details, and the subject matter.)
Total Score: 3 points
"""

## Import and setup of the Large Language Model

### Import of the Large Language Model through the use of the API key

In [12]:
ai.list_models()

['google/gemini-2.0-flash',
 'google/gemini-2.0-flash-lite',
 'google/gemini-2.5-flash',
 'google/gemini-2.5-flash-lite',
 'google/gemini-2.5-pro',
 'google/gemma-3-12b',
 'google/gemma-3-1b',
 'google/gemma-3-27b',
 'google/gemma-3-4b']

### Selecting Random Prompts to test different types of Promts

In [13]:
random_idx = random.choice(df_cleaned.index)
random_row = df_cleaned.loc[random_idx]

In [14]:
source_text = random_row['source_text']
translated_text = random_row['translated_text']
reference_translation = random_row.get('reference_translation')

In [15]:
formatted_prompt = prompt.format(
    source_text=source_text,
    translated_text=translated_text,
    reference_translation=reference_translation
)

###

In [16]:
response = ai.generate_text(formatted_prompt, model_name='google/gemini-2.5-pro')

RateLimitError: Error code: 429 - {'message': 'Insufficient quota available to perform this operation. Try again later.', 'type': 'invalid_request_error'}

###Consistency Testing

Note: Due to the limited quota we had this error appeared during testing. This is also where Consistency was tested. Where we repeated the following prompt for several times:

`prompt` = """
You are a strict translation judge evaluating an English-to-Filipino translation. Your task is to Evaluate and score the translated text.

### Task Overview:
Generate an evaluation from the english source text, the translated text, and the reference translation.
English source text: A race condition happens when threads access shared data unpredictably.
Translated text: Race condition ay kapag may naunang thread.
Reference text: Nangyayari ang race condition kapag hindi kontrolado ang pag-access ng mga thread sa iisang data.
Note that the reference translation is the correct translation of the source text so it will get the perfect score. Keep in mind when evaluating

### Evaluation Criteria:
Each criteria below will be either 0 or 1 point each. The perfect score of the translation is 6 points and the lowest score is 0 points.
The criteria can only have a score of 1 if they meed the requirements of the criteria. otherwise it will always be 0.
1. Accuracy – Deduct if any part of the meaning differs from the reference or omits details.
2. Fluency – Deduct if grammar, style, or idiomatic expression is less natural than the reference.
3. Coherence – Deduct if the logical flow or structure differs from the reference without valid reason.
4. Cultural Appropriateness – Deduct if cultural tone, idioms, or respectful forms differ in a way that harms the message.
5. Guideline Adherence – Deduct if terminology/style differs from the reference in a way that breaks domain rules.
6. Completeness – Deduct if any information present in the reference is missing or altered.

### Sample Output
English - The actress has not given any further details about it.
Translated - Ang aktres ay hindi nagbigay ng anumang karagdagang mga detalye tungkol dito.
Accuracy: 1 point (The translated text accurately conveys the core meaning of the source text, which is that the actress has provided no new information.)
Fluency: 0 point (The use of the "Ang... ay..." construction is grammatically correct but less natural and fluent than the reference text's "Wala pang..." structure, which is more common in Filipino.)
Coherence: 0 point (The sentence structure, while logical, differs from the more integrated and seamless flow of the reference text. The reference's structure connects the ideas more smoothly.)
Cultural Appropriateness: 1 point (The tone of the translation is neutral and fact-based, which is culturally appropriate for the context. No inappropriate language is used.)
Guideline Adherence: 0 point (The translation deviates from the style of the reference text by using the formal "ay" construction and the literal "anumang" instead of the more idiomatic "Wala pang..." structure.)
Completeness: 1 point (The translated text includes all key pieces of information from the source sentence: the actress, the action of not giving, the further details, and the subject matter.)
Total Score: 3 points
"""

These results although was not shown due to them not being saved to a variable showed a consistency of 100% for all 5 runs with all runs being given a score of 1. This being under the criteria of cultural appropriateness.

In [None]:
print("Raw Response for Example 1:")
print(response)

Raw Response for Example 1:
Accuracy: 1 point (The translation perfectly captures the core meaning of the source text. "Wala nang respeto at pagmamahal" accurately conveys "no more respect or love," and "iwan mo na" is a precise and idiomatic translation of the command "leave them.")
Fluency: 1 point (The translation is highly fluent and natural in conversational Filipino. The sentence structure "Wala nang... yung tao para sa'yo" is a common and smooth way to express this idea, much more so than a literal, non-inverted structure.)
Coherence: 1 point (The logical flow of the original sentence—presenting a reason followed by a direct piece of advice—is perfectly mirrored in the translation. The two clauses are connected logically and clearly.)
Cultural Appropriateness: 1 point (The translation's tone is direct and candid, which is culturally appropriate for giving personal advice between peers in the Philippines. "Iwan mo na" is the standard and culturally understood phrase for this situ

## Testing using Validation set

### Loading Validation set

In [None]:
valurl = 'https://raw.githubusercontent.com/CarandangR/CSC420M-LLM-Translation-Judge-for-English-to-Filipino-Translations/main/Datasets%20-%20Human-Labeled%20Validation%20Set.csv'

valfilename = 'validation_dataset.csv'
urllib.request.urlretrieve(valurl, valfilename)

('validation_dataset.csv', <http.client.HTTPMessage at 0x7a6071fd1f90>)

In [None]:
val_df = pd.read_csv(valfilename)

val_df.head()

Unnamed: 0,Source Text (English),Target Text (Filipino),Final_score,Rater 1 Explanation,Rater 2 Explanation,Contributor
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Paul Ivan Enclonar/Alonzo Rimando
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,Paul Ivan Enclonar/Alonzo Rimando
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,Paul Ivan Enclonar/Alonzo Rimando
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,Paul Ivan Enclonar/Alonzo Rimando
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,Charibeth Cheng


### Validation Set Cleaning

Dropping rows with missign `Source Text (English)`, `Target Text (Filipino)`, and `Final Score (1 - lowest, 5 - highest)` columns

In [None]:
val_df_cleaned = val_df.dropna(subset=['Source Text (English)', 'Target Text (Filipino)', 'Final_score']).copy()

Renaming of Columns

This is for the columns to be easily handled by the LLM. This is how each column was renamed:
*   `Source Text (English)` was turned into `source_text`
*   `Target Text (Filipino)` was turned into `target_text`
*   `Final Score (1 - lowest, 5 - highest)` was turned into `score`


In [None]:
val_df_cleaned.rename(columns={
    'Source Text (English)': 'source_text',
    'Target Text (Filipino)': 'target_text',
    'Final_score': 'score',
}, inplace=True)

Checking of duplicate strings

In [None]:
val_df_cleaned = val_df_cleaned[
    (val_df_cleaned['source_text'].str.strip() != '') &
    (val_df_cleaned['target_text'].str.strip() != '')
].drop_duplicates(subset=['source_text', 'target_text'])

Checking Cleaned dataset

In [None]:
val_df_cleaned.sample(5)

Unnamed: 0,source_text,target_text,score,Rater 1 Explanation,Rater 2 Explanation,Contributor
57,Things outside you are projections of what’s i...,Ang mga bagay sa labas mo ay salamin ng nasa l...,3.0,"Accurate and clear, though some of the poetic ...",The translation replaces a complex phrase with...,Lia Guillermo/Lester Anthony Sityar
14,This would be easier if blood came in more col...,Ito ay magiging mas madali kung ang dugo ay du...,2.0,"flawed translation takes ""came in more colors""...",,Elijah Rosario/John Kirsten Espiritu
19,Which of the following tools is LEAST likely t...,Alin sa mga sumusunod ang PINAKABIHIRANG ginag...,5.0,Perfect translation. Sounds natural as well fo...,Accurate translation that would be used in eve...,Wrong verb-tense which makes the sentence soun...
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Paul Ivan Enclonar/Alonzo Rimando
25,The Binding of Isaac: Rebirth is a randomly ge...,Ang The Binding of Isaac: Rebirth ay isang ran...,2.0,The translations are too literal; some of the ...,"semantic errors and lexical errors, phrases li...",Aira Jin Garganera / Nigel Nograles


Define prompt to use for validation

In [None]:
valprompt = """
You are a strict translation judge evaluating an English-to-Filipino translation. Your task is to Evaluate and score the translated text.

### Task Overview:
Generate an evaluation from the english source text, the translated text, and the reference translation.
English source text: {source_text}
Translated text: {target_text}

### Evaluation Criteria:
Each criteria below will be either 0 or 1 point each. The perfect score of the translation is 6 points and the lowest score is 0 points.
The criteria can only have a score of 1 if they meed the requirements of the criteria. otherwise it will always be 0.
1. Accuracy – Deduct if any part of the meaning differs from the reference or omits details.
2. Fluency – Deduct if grammar, style, or idiomatic expression is less natural than the reference.
3. Coherence – Deduct if the logical flow or structure differs from the reference without valid reason.
4. Cultural Appropriateness – Deduct if cultural tone, idioms, or respectful forms differ in a way that harms the message.
5. Guideline Adherence – Deduct if terminology/style differs from the reference in a way that breaks domain rules.
6. Completeness – Deduct if any information present in the reference is missing or altered.

### Sample Output
English - The actress has not given any further details about it.
Translated - Ang aktres ay hindi nagbigay ng anumang karagdagang mga detalye tungkol dito.
Accuracy: 1 point (The translated text accurately conveys the core meaning of the source text, which is that the actress has provided no new information.)
Fluency: 0 point (The use of the "Ang... ay..." construction is grammatically correct but less natural and fluent than the reference text's "Wala pang..." structure, which is more common in Filipino.)
Coherence: 0 point (The sentence structure, while logical, differs from the more integrated and seamless flow of the reference text. The reference's structure connects the ideas more smoothly.)
Cultural Appropriateness: 1 point (The tone of the translation is neutral and fact-based, which is culturally appropriate for the context. No inappropriate language is used.)
Guideline Adherence: 0 point (The translation deviates from the style of the reference text by using the formal "ay" construction and the literal "anumang" instead of the more idiomatic "Wala pang..." structure.)
Completeness: 1 point (The translated text includes all key pieces of information from the source sentence: the actress, the action of not giving, the further details, and the subject matter.)
Total Score: 3 points
"""

Add Columns to the dataframe for the LLM to fill in

In [None]:
val_df_cleaned['llm_evaluation'] = None
val_df_cleaned['llm_score'] = None

In [None]:
for idx, row in val_df_cleaned.iterrows():

    formatted_prompt = valprompt.format(source_text=row['source_text'], target_text=row['target_text'])
    response = ai.generate_text(formatted_prompt, model_name='google/gemini-2.5-pro')
    evaluation_text = response
    val_df_cleaned.at[idx, 'llm_evaluation'] = evaluation_text

    # Extract total score using regex
    match = re.search(r'Total Score: (\d+) points', evaluation_text)
    if match:
        val_df_cleaned.at[idx, 'llm_score'] = int(match.group(1))

In [None]:
val_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 63
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   source_text          57 non-null     object 
 1   target_text          57 non-null     object 
 2   score                57 non-null     float64
 3   Rater 1 Explanation  57 non-null     object 
 4   Rater 2 Explanation  54 non-null     object 
 5   Contributor          57 non-null     object 
 6   llm_evaluation       57 non-null     object 
 7   llm_score            46 non-null     object 
dtypes: float64(1), object(7)
memory usage: 6.1+ KB


Exporting to .csv file

In [None]:
val_df_cleaned.to_csv('validated_dataset.csv', index=False)

Due to the prompts not being similar accross all prompts and responses. There were some empty fields in the in the score column. To make up for that, we decided to export the .csv file to be edited by a 3rd party program and exported back for results and analysis. The scores will be from the outputs generated by the LLM. So it would match the score given by the LLM.

## Results and Analysis

This section will analyze the results of the Prompt Engineered LLM judge to the results of human tested validation set.

In [17]:
valurl = 'https://raw.githubusercontent.com/CarandangR/CSC420M-LLM-Translation-Judge-for-English-to-Filipino-Translations/main/prompt_validated_dataset.csv'

valfilename = 'prompt_results.csv'
urllib.request.urlretrieve(valurl, valfilename)

('prompt_results.csv', <http.client.HTTPMessage at 0x7adddfa8ca50>)

In [18]:
prompt_results = pd.read_csv(valfilename)

In [19]:
prompt_results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   source_text          57 non-null     object
 1   target_text          57 non-null     object
 2   score                57 non-null     int64 
 3   Rater 1 Explanation  57 non-null     object
 4   Rater 2 Explanation  54 non-null     object
 5   Contributor          57 non-null     object
 6   llm_evaluation       57 non-null     object
 7   llm_score            57 non-null     int64 
dtypes: int64(2), object(6)
memory usage: 3.7+ KB


### Normalizing results of the LLM

In [20]:
def normalize_score(raw):
    if raw >= 5:
        return 5
    elif 3 <= raw <= 4:
        return 3
    else:
        return 1

In [21]:
prompt_results["llm_score_normalized"] = prompt_results["llm_score"].apply(normalize_score)

### Evaluation metrics

In [22]:
spearman_corr, p_value = spearmanr(prompt_results["score"], prompt_results["llm_score_normalized"])

variation = prompt_results.groupby("source_text")["llm_score_normalized"].std(ddof=0).fillna(0)
inconsistent_ratio = (variation > 0.1).mean() * 100

explainability_ratio = (prompt_results["llm_evaluation"].str.strip() != "").mean() * 100

coverage_metric = prompt_results["Contributor"].nunique()

# Display results
print(f"Spearman’s ρ: {spearman_corr:.3f} (p={p_value:.3g})")
print(f"Explainability ratio: {explainability_ratio:.2f}%")
print(f"Coverage (unique contributors): {coverage_metric}")

Spearman’s ρ: 0.409 (p=0.0016)
Explainability ratio: 100.00%
Coverage (unique contributors): 16


### Checking for Samples that has great effect in the Spearman Score

In [23]:
prompt_results["score_diff"] = (prompt_results["score"] - prompt_results["llm_score_normalized"]).abs()

largest_disagreements = prompt_results.sort_values(by="score_diff", ascending=False)

big_gaps = largest_disagreements[largest_disagreements["score_diff"] >= 2]

print(big_gaps[["source_text", "target_text", "score", "llm_score_normalized", "llm_evaluation"]])


                                          source_text  \
46  One of the purposes of life is to help others ...   
24  Welcome to a new world! In Monster Hunter: Wor...   
51  Some men are born mediocre, some men achieve m...   
25  For so long, you and me been finding each othe...   
4   alam mo ma'am masaya naman topics natin sa phi...   
31  The kernel handles low-level operations in the...   
3   Data normalization helps improve model perform...   
7                  Thank you for coming to the event.   
2   The algorithm efficiently identifies patterns ...   
36  Adobo is one of the most popular dishes in the...   
32  The address bus carries memory addresses to th...   
55  I wanted to invite my youngest to go get some ...   
42  The patient was given penicillin to prevent in...   
54  Sick of tea?! That's like being sick of breath...   
50  Things outside you are projections of what’s i...   
49  Taking crazy things seriously is a serious was...   
43  To be able to start fasting

In [24]:
big_gaps.to_csv('big_gaps.csv', index=False)

### Comparative metrics to comapre against Agentic System

These are more metrics that are used to provide more comparison

In [25]:
def snap_to_coarse_label(x):
    # nearest among 1,3,5
    choices = np.array([1,3,5])
    nearest = choices[np.argmin(np.abs(choices - x))]
    label_map = {1: 'poor', 3: 'good', 5: 'excellent'}
    return nearest, label_map[int(nearest)]

In [26]:
prompt_results['human_label'] = prompt_results['score'].apply(lambda x: snap_to_coarse_label(x)[1])

# For LLM normalized_total we expect values in {1,3,5} already; map to text
label_map = {1: 'poor', 3: 'good', 5: 'excellent'}
prompt_results['llm_label'] = prompt_results['llm_score_normalized'].map(label_map)


In [27]:
label_agreement = (prompt_results['score'] == prompt_results['llm_score_normalized']).mean()
print(f"\nLabel agreement (human vs LLM): {label_agreement*100:.1f}% ({int((label_agreement)*len(prompt_results))} / {len(prompt_results)})")

prompt_results['abs_diff'] = (prompt_results['score'] - prompt_results['llm_score_normalized']).abs()
mean_abs_diff = prompt_results['abs_diff'].mean()
print(f"\nMean absolute difference (|human - llm|): {mean_abs_diff:.3f}")


Label agreement (human vs LLM): 31.6% (18 / 57)

Mean absolute difference (|human - llm|): 1.158


## Output as JSON File

In [28]:
prompt_results.to_json("prompt_results.json", orient="records", force_ascii=False, indent=2)