#LLM Translation Judge for English-to-Filipino Translations via Prompt Engineering

## Installation of Dependencies

In [1]:
from google.colab import ai
import pandas as pd
import urllib.request
import numpy as np
from scipy.stats import spearmanr
import time
import random
import re

## Importing, Pre-processing, and Cleaning the dataset



### Importing the dataset

The dataset was downloaded from google sheets and uploaded into a public repository in github. From there the dataset is loaded into the notebook using urllib

In [2]:
url = 'https://raw.githubusercontent.com/CarandangR/CSC420M-LLM-Translation-Judge-for-English-to-Filipino-Translations/main/Datasets%20-%20Training.csv'
filename = 'training_dataset.csv'

urllib.request.urlretrieve(url, filename)

('training_dataset.csv', <http.client.HTTPMessage at 0x7bc2e7359950>)

In [3]:
df = pd.read_csv(filename)

df.head()

Unnamed: 0,English,Filipino-Correct,Filipino-Flawed,Remarks,Contributor
0,Ang gnda na mura pa,It so beautiful and it's even affordable.,beautiful and cheap,flawed translation failed to express the 'na' ...,Charibeth Cheng
1,The Philippines is an archipelago made up of o...,"Ang Pilipinas ay isang kapulaang binubuo ng 7,...",Ang Pilipinas ay isang puno na binubuo ng mahi...,,Geena Tibule/Charlyne Arajoy Carabeo
2,Philippines is the world's second-largest arch...,Ang Pilipinas ang pangalawa sa pinakamalaking ...,Ang Pilipinas ay ang pangalawang malaking isla...,,Geena Tibule/Charlyne Arajoy Carabeo
3,Filipino and English are the two official lang...,Filipino at Ingles ang dalawang opisyal na lin...,Tagalog at Ingles ang dalawa opisyal lingwahe ...,,Geena Tibule/Charlyne Arajoy Carabeo
4,Tagalog is the most widely spoken native langu...,Tagalog ang pinakamalawak at ginagamit na katu...,Tagalog ay ang pinaka malawak sinasabi katutub...,,Geena Tibule/Charlyne Arajoy Carabeo


### Displaying the info of the dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564 entries, 0 to 563
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   English           562 non-null    object
 1   Filipino-Correct  562 non-null    object
 2   Filipino-Flawed   562 non-null    object
 3   Remarks           333 non-null    object
 4   Contributor       562 non-null    object
dtypes: object(5)
memory usage: 22.2+ KB


## Dataset Cleaning

### Dropping Rows with missing `English` or `Filipino-Correct/Filipino/Flawed` Texts

In [5]:
df_cleaned = df.dropna(subset=['English', 'Filipino-Correct', 'Filipino-Flawed']).copy()

### Renaming of columns

This is for the columns to be easily handled by the LLM. This is how each column was renamed:
*   `English` was turned into `source_text`
*   `Filipino-Correct` was turned into `reference_translation`
*   `Filipino-Flawed` was turned into `translated_text`


In [6]:
df_cleaned.rename(columns={
    'English': 'source_text',
    'Filipino-Correct': 'reference_translation',
    'Filipino-Flawed': 'translated_text',
}, inplace=True)

### Checking of duplicate strings

In [7]:
df_cleaned = df_cleaned[
    (df_cleaned['source_text'].str.strip() != '') &
    (df_cleaned['translated_text'].str.strip() != '')
].drop_duplicates(subset=['source_text', 'translated_text'])

### Displaying Cleaned dataset

In [8]:
df_cleaned.sample(5)

Unnamed: 0,source_text,reference_translation,translated_text,Remarks,Contributor
294,Please take a seat.,Maupo po kayo.,Kumuha kayo ng upuan.,Literal translation,Sherwynn Angeles / Gregory Tiong
113,"Then you will pay for your transgressions, Ale...","Kung gayon, pagbabayaran mo ang iyong mga pagl...","Edi babayaran mo ang iyong mga mali, Alexandra...",flawed translation does not match formality of...,Elijah Rosario/John Kirsten Espiritu
271,I am playing games on my phone.,Naglalaro ako ng games sa akin selpon.,Naglalaro ako ng ng games sa aking selphone.,incorrect translation of cellphone.,Sherwynn Angeles / Gregory Tiong
187,"In object-oriented programming, inheritance pr...","Sa object-oriented programming, pinapalaganap ...",Inheritance ay nagpaparami ng code.,,Camron Ong / Arvin Tan
436,I know you're leaving in the morning when you ...,Alam kong aalis ka sa umaga pag-ising mo.,Alam ko aalis ka sa umaga pagkagising mo.,,Ryan Carandang/ Riley Veracruz


## Declaring the Prompting template

This template was made to take into consideration the different requirements specified in the machine project specifications. Things like Accuracy, Fluency, Coherence, Cultural Appropriateness, and Completeness are criterias that the LLM will evaluate on.

In [9]:
prompt = """
You are a strict translation judge evaluating an English-to-Filipino translation. Your task is to Evaluate and score the translated text.

### Task Overview:
Generate an evaluation from the english source text, the translated text, and the reference translation.
English source text: {source_text}
Translated text: {translated_text}
Reference text: {reference_translation}
Note that the reference translation is the correct translation of the source text so it will get the perfect score. Keep in mind when evaluating

### Evaluation Criteria:
Each criteria below will be either 0 or 1 point each. The perfect score of the translation is 6 points and the lowest score is 0 points.
The criteria can only have a score of 1 if they meed the requirements of the criteria. otherwise it will always be 0.
1. Accuracy – Deduct if any part of the meaning differs from the reference or omits details.
2. Fluency – Deduct if grammar, style, or idiomatic expression is less natural than the reference.
3. Coherence – Deduct if the logical flow or structure differs from the reference without valid reason.
4. Cultural Appropriateness – Deduct if cultural tone, idioms, or respectful forms differ in a way that harms the message.
5. Guideline Adherence – Deduct if terminology/style differs from the reference in a way that breaks domain rules.
6. Completeness – Deduct if any information present in the reference is missing or altered.

### Sample Output
{source_text}
{translated_text}
Accuracy: 1 point (The translated text accurately conveys the meaning of the source text, with no omitted details or changed meaning.)
Fluency: 0 point (The sentence structure is not as natural as the reference, and the use of "may" and "sawsawan" makes the text less fluent.)
Coherence: 0 point (The logical flow of the sentence is not as clear as the reference, and the use of "may" and "sawsawan" disrupts the coherence.)
Cultural Appropriateness: 1 point (The use of "bagoon alamang" and "kare-kare" is culturally appropriate and respectful.)
Guideline Adherence: 0 point (The use of "may" and "sawsawan" deviates from the guideline of using more natural and idiomatic expressions.)
Completeness: 1 point (The translated text includes all the information present in the reference.)
Total Score: 3 points
"""

## Import and setup of the Large Language Model

### Import of the Large Language Model through the use of the API key

In [10]:
ai.list_models()

['google/gemini-2.0-flash',
 'google/gemini-2.0-flash-lite',
 'google/gemini-2.5-flash',
 'google/gemini-2.5-flash-lite',
 'google/gemini-2.5-pro',
 'google/gemma-3-12b',
 'google/gemma-3-1b',
 'google/gemma-3-27b',
 'google/gemma-3-4b']

### Selecting Random Prompts to test different types of Promts

In [11]:
random_idx = random.choice(df_cleaned.index)
random_row = df_cleaned.loc[random_idx]

In [12]:
source_text = random_row['source_text']
translated_text = random_row['translated_text']
reference_translation = random_row.get('reference_translation')

In [13]:
formatted_prompt = prompt.format(
    source_text=source_text,
    translated_text=translated_text,
    reference_translation=reference_translation
)

###

In [14]:
response = ai.generate_text(formatted_prompt, model_name='google/gemini-2.5-pro')

In [15]:
print("Raw Response for Example 1:")
print(response)

Raw Response for Example 1:
The risk of developing certain cancers can be reduced by not smoking, maintaining a healthy weight, limiting alcohol intake, and eating plenty of vegetables.
Ang panganib na magkaroon ng ilang mga kanser ay maaaring mabawasan sa pamamagitan ng hindi paninigarilyo, pagpapanatili ng malusog na timbang, paglilimita sa pag-inom ng alak, at pagkain ng maraming gulay.
**Accuracy:** 0 point (The translated text uses "hindi paninigarilyo" (a state of not smoking), which differs in meaning from the reference's "pagiwas sa paninigarilyo" (the action of avoiding smoking). The reference more accurately captures the preventative action implied in the source.)
**Fluency:** 0 point (The phrase "hindi paninigarilyo" is a literal translation that is less natural and idiomatic than the reference's "pagiwas sa paninigarilyo," which is the standard and more fluent phrasing in this context.)
**Coherence:** 1 point (The overall sentence structure and logical flow are identical to

## Testing using Validation set

### Loading Validation set

In [16]:
valurl = 'https://raw.githubusercontent.com/CarandangR/CSC420M-LLM-Translation-Judge-for-English-to-Filipino-Translations/main/Datasets%20-%20Human-Labeled%20Validation%20Set.csv'

valfilename = 'validation_dataset.csv'
urllib.request.urlretrieve(valurl, valfilename)

('validation_dataset.csv', <http.client.HTTPMessage at 0x7bc2e695a350>)

In [17]:
val_df = pd.read_csv(valfilename)

val_df.head()

Unnamed: 0,Source Text (English),Target Text (Filipino),Final_score,Rater 1 Explanation,Rater 2 Explanation,Contributor
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Paul Ivan Enclonar/Alonzo Rimando
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,Paul Ivan Enclonar/Alonzo Rimando
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,Paul Ivan Enclonar/Alonzo Rimando
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,Paul Ivan Enclonar/Alonzo Rimando
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,Charibeth Cheng


### Validation Set Cleaning

Dropping rows with missign `Source Text (English)`, `Target Text (Filipino)`, and `Final Score (1 - lowest, 5 - highest)` columns

In [18]:
val_df_cleaned = val_df.dropna(subset=['Source Text (English)', 'Target Text (Filipino)', 'Final_score']).copy()

Renaming of Columns

This is for the columns to be easily handled by the LLM. This is how each column was renamed:
*   `Source Text (English)` was turned into `source_text`
*   `Target Text (Filipino)` was turned into `target_text`
*   `Final Score (1 - lowest, 5 - highest)` was turned into `score`


In [19]:
val_df_cleaned.rename(columns={
    'Source Text (English)': 'source_text',
    'Target Text (Filipino)': 'target_text',
    'Final_score': 'score',
}, inplace=True)

Checking of duplicate strings

In [20]:
val_df_cleaned = val_df_cleaned[
    (val_df_cleaned['source_text'].str.strip() != '') &
    (val_df_cleaned['target_text'].str.strip() != '')
].drop_duplicates(subset=['source_text', 'target_text'])

Checking Cleaned dataset

In [21]:
val_df_cleaned.sample(5)

Unnamed: 0,source_text,target_text,score,Rater 1 Explanation,Rater 2 Explanation,Contributor
56,Taking crazy things seriously is a serious was...,Ang pagseseryoso ng mga bagay na baliw ay isan...,5.0,"Captures tone and meaning well, using native-s...",It uses words that flow naturally and matches ...,Lia Guillermo/Lester Anthony Sityar
11,Mark accidentally let the cat out of the bag a...,Hindi sinasadyang nailabas ni Mark ang pusa mu...,3.0,"Complete, and gramatically correct. However, t...",Idiom was not translated correctly,Boris Victoria / Joel Ethan Batac
37,Jollibee is a famous fast food restaurant in t...,Ang Jollibee ay isang sikat na kainan sa Pilip...,4.0,Better to phrase it as “Ang Jollibee ay isa sa...,- accurate translation\r\n- good grammar,Joaquin de Guzman
41,The hospitability of Filipinos are known world...,Ang pagiging hospitable ng mga Pilipino ay kil...,4.0,kilala could be replaced with alam which would...,kilala could be replace with sikat,Charlyne Arajoy Carabeo/Geena Tibule
15,"This guy right here is Shadow, and he's kinda ...","Ang lalaking ito dito ay si Shadow, at siya an...",1.0,"""Ang lalaking ito dito"" should not have ""dito""","buzz-kill is not translated. also ""totoong buz...",Elijah Rosario/John Kirsten Espiritu


In [22]:
val_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 63
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   source_text          57 non-null     object 
 1   target_text          57 non-null     object 
 2   score                57 non-null     float64
 3   Rater 1 Explanation  57 non-null     object 
 4   Rater 2 Explanation  54 non-null     object 
 5   Contributor          57 non-null     object 
dtypes: float64(1), object(5)
memory usage: 3.1+ KB


Define prompt to use for validation

In [23]:
prompt = """
You are a strict translation judge evaluating an English-to-Filipino translation. Your task is to Evaluate and score the translated text.

### Task Overview:
Generate an evaluation from the english source text, the translated text, and the reference translation.
English source text: {source_text}
Translated text: {translated_text}

### Evaluation Criteria:
Each criteria below will be either 0 or 1 point each. The perfect score of the translation is 6 points and the lowest score is 0 points.
The criteria can only have a score of 1 if they meed the requirements of the criteria. otherwise it will always be 0.
1. Accuracy – Deduct if any part of the meaning differs from the reference or omits details.
2. Fluency – Deduct if grammar, style, or idiomatic expression is less natural than the reference.
3. Coherence – Deduct if the logical flow or structure differs from the reference without valid reason.
4. Cultural Appropriateness – Deduct if cultural tone, idioms, or respectful forms differ in a way that harms the message.
5. Guideline Adherence – Deduct if terminology/style differs from the reference in a way that breaks domain rules.
6. Completeness – Deduct if any information present in the reference is missing or altered.

### Sample Output
The algorithm efficiently identifies patterns in large datasets.
Mabisang kinikilala ng algoritmo ang mga pattern sa malalaking dataset.
Accuracy: 1 point (The translated text accurately conveys the meaning of the source text, with no omitted details or changed meaning.)
Fluency: 0 point (The sentence structure is not as natural as the reference, and the use of "may" and "sawsawan" makes the text less fluent.)
Coherence: 0 point (The logical flow of the sentence is not as clear as the reference, and the use of "may" and "sawsawan" disrupts the coherence.)
Cultural Appropriateness: 1 point (The use of "bagoon alamang" and "kare-kare" is culturally appropriate and respectful.)
Guideline Adherence: 0 point (The use of "may" and "sawsawan" deviates from the guideline of using more natural and idiomatic expressions.)
Completeness: 1 point (The translated text includes all the information present in the reference.)
Total Score: 3 points
"""

#Agentic System

## Defining LLM Wrapper (serves as the "Brain")

In [24]:
import requests

def check_fluency_languagetool(text):
    url = "https://api.languagetool.org/v2/check"
    data = {
        'text': text,
        'language': 'tl',  # Tagalog, or use 'en' for English etc.
    }
    response = requests.post(url, data=data).json()
    # Return the number of errors found or confidence score
    return len(response.get('matches', []))


In [25]:
# LLM wrapper for Google Gemini in Colab
class LLMWrapper:
    def __init__(self, model_name='google/gemini-2.5-pro'):
        self.model_name = model_name

    def generate(self, prompt):
        # Calls the built-in ai.generate_text method available in your Colab
        response = ai.generate_text(prompt, model_name=self.model_name)
        return response


##Memory Module

In [26]:
# Memory to store step-by-step evaluations and final results
class MemoryModule:
    def __init__(self):
        self.memory = {}

    def store(self, step_name, content):
        self.memory[step_name] = content

    def retrieve(self, step_name):
        return self.memory.get(step_name, "")

    def get_all(self):
        return self.memory


## Planning and Reasoning Engine (Prompt Builder and response parser) + LanguageTool for Fluency

In [27]:
# Handles structured evaluation prompts for each criterion
import requests

def check_fluency_languagetool(text):
    url = "https://api.languagetool.org/v2/check"
    data = {
        'text': text,
        'language': 'tl',  # Change to 'en' for English, etc.
    }
    try:
        response = requests.post(url, data=data)
        response.raise_for_status()  # Raises error for bad status
        result = response.json()
        error_count = len(result.get('matches', []))
        print(f"LanguageTool found {error_count} errors.")  # Debug print
        return error_count
    except Exception as e:
        print(f"LanguageTool API error: {e}")
        return -1  # Indicate failure

class EvaluationSteps:
    def __init__(self, llm, memory):
        self.llm = llm
        self.memory = memory

    def execute_step(self, step_name, prompt):
        print(f"\n--- Agent Thought: Executing {step_name} ---")
        response = self.llm.generate(prompt)
        print(f"--- Agent Observation ({step_name}): ---\n{response}\n")
        self.memory.store(step_name, response)
        return response

    def accuracy(self, source, translated, reference):
        return self.execute_step("accuracy", f"""
Source: "{source}"
Translated: "{translated}"
Reference: "{reference}"

Task: Evaluate ACCURACY.
Deduct a point if meaning differs or omits details.
Score 1 if accurate, 0 otherwise.

Format:
Score: <0 or 1>
Explanation: <your analysis>
""")

    def fluency(self, source, translated, reference):
        error_count = check_fluency_languagetool(translated)
        print(f"LanguageTool detected {error_count} errors in the translation.")
        prompt = f"""
Source: "{source}"
Translated: "{translated}"
Reference: "{reference}"

Additional Info:
LanguageTool detected {error_count} potential grammar/style/spelling errors in the translated text.

Task: Evaluate FLUENCY.
Deduct a point if grammar/style/idiomatic use is unnatural.
Score 1 if fluent, 0 otherwise.

Format:
Score: <0 or 1>
Explanation: <your analysis>
"""
        return self.execute_step("fluency", prompt)

    def coherence(self, source, translated, reference):
        return self.execute_step("coherence", f"""
Source: "{source}"
Translated: "{translated}"
Reference: "{reference}"

Task: Evaluate COHERENCE.
Deduct a point if logical flow differs without reason.
Score 1 if coherent, 0 otherwise.

Format:
Score: <0 or 1>
Explanation: <your analysis>
""")

    def cultural_appropriateness(self, source, translated, reference):
        return self.execute_step("cultural_appropriateness", f"""
Source: "{source}"
Translated: "{translated}"
Reference: "{reference}"

Task: Evaluate CULTURAL APPROPRIATENESS.
Deduct a point if tone/idioms/forms differ in a way that harms meaning.
Score 1 if appropriate, 0 otherwise.

Format:
Score: <0 or 1>
Explanation: <your analysis>
""")

    def guideline_adherence(self, source, translated, reference):
        return self.execute_step("guideline_adherence", f"""
Source: "{source}"
Translated: "{translated}"
Reference: "{reference}"

Task: Evaluate GUIDELINE ADHERENCE.
Deduct a point if terminology/style violates domain rules.
Score 1 if adheres, 0 otherwise.

Format:
Score: <0 or 1>
Explanation: <your analysis>
""")

    def completeness(self, source, translated, reference):
        return self.execute_step("completeness", f"""
Source: "{source}"
Translated: "{translated}"
Reference: "{reference}"

Task: Evaluate COMPLETENESS.
Deduct a point if information in reference is missing or altered.
Score 1 if complete, 0 otherwise.

Format:
Score: <0 or 1>
Explanation: <your analysis>
""")


## Synthesis Tool

In [28]:
class FinalSynthesisTool:
    def __init__(self, llm, memory):
        self.llm = llm
        self.memory = memory

    def synthesize(self):
        prompt = f"""
You have performed the following analyses:

- Accuracy: {self.memory.retrieve('accuracy')}
- Fluency: {self.memory.retrieve('fluency')}
- Coherence: {self.memory.retrieve('coherence')}
- Cultural Appropriateness: {self.memory.retrieve('cultural_appropriateness')}
- Guideline Adherence: {self.memory.retrieve('guideline_adherence')}
- Completeness: {self.memory.retrieve('completeness')}

Return a JSON object:
{{
  "scores": {{
    "accuracy": <0 or 1>,
    "fluency": <0 or 1>,
    "coherence": <0 or 1>,
    "cultural_appropriateness": <0 or 1>,
    "guideline_adherence": <0 or 1>,
    "completeness": <0 or 1>,
    "total": <sum>
  }},
  "explanation": "<comprehensive synthesis>"
}}
"""
        print("\n--- Agent Thought: Synthesizing Final Judgment ---")
        result_str = self.llm.generate(prompt)

        import json
        try:
            start = result_str.find("{")
            end = result_str.rfind("}") + 1
            result_json = json.loads(result_str[start:end])

            # Normalize total score and add label
            total = result_json["scores"].get("total", 0)
            if total >= 5:
                norm_score = 5
                label = "excellent"
            elif 3 <= total <= 4:
                norm_score = 3
                label = "good"
            else:
                norm_score = 1
                label = "poor"

            result_json["scores"]["normalized_total"] = norm_score
            result_json["scores"]["label"] = label

            return result_json

        except Exception as e:
            print(f"Error parsing JSON: {e}")
            return {"error": "Invalid JSON"}


## Orchestration Layer

In [29]:
# Controls the sequence: Thought → Action → Observation
class AgenticJudge:
    def __init__(self, llm):
        self.memory = MemoryModule()
        self.steps = EvaluationSteps(llm, self.memory)
        self.final_synthesizer = FinalSynthesisTool(llm, self.memory)

    def evaluate(self, source, translated, reference):
        print(f"Evaluating: \"{source}\" → \"{translated}\"\n")
        print("--- Model initialized successfully. ---\n")
        print("--- STARTING AGENTIC EVALUATION ---\n")

        self.steps.accuracy(source, translated, reference)
        self.steps.fluency(source, translated, reference)
        self.steps.coherence(source, translated, reference)
        self.steps.cultural_appropriateness(source, translated, reference)
        self.steps.guideline_adherence(source, translated, reference)
        self.steps.completeness(source, translated, reference)

        final_result = self.final_synthesizer.synthesize()
        print("\n--- FINAL EVALUATION ---")
        print(json.dumps(final_result, indent=2, ensure_ascii=False))
        return final_result


Accepting a translation pair

In [30]:
import json
import pandas as pd

In [31]:
# 1. Create the LLM "brain"
llm = LLMWrapper(model_name="google/gemini-2.5-pro")

# 2. Create the agent
judge = AgenticJudge(llm=llm)

# 3. Define your test inputs
source_text = "I love eating fresh mangoes in the summer."
translated_text = "Mahilig akong kumain ng bagong mangga tuwing bakasyon."
reference_text = "Mahilig akong kumain ng sariwang mangga tuwing tag-init."

# 4. Run the evaluation
result = judge.evaluate(source_text, translated_text, reference_text)


Evaluating: "I love eating fresh mangoes in the summer." → "Mahilig akong kumain ng bagong mangga tuwing bakasyon."

--- Model initialized successfully. ---

--- STARTING AGENTIC EVALUATION ---


--- Agent Thought: Executing accuracy ---
--- Agent Observation (accuracy): ---
Score: 0
Explanation: The translation is inaccurate. "Fresh" was translated as "bago" (new) instead of "sariwa". Additionally, "summer" was translated as "bakasyon" (vacation) instead of the correct term "tag-init". These are significant changes in meaning.

LanguageTool found 0 errors.
LanguageTool detected 0 errors in the translation.

--- Agent Thought: Executing fluency ---
--- Agent Observation (fluency): ---
Score: 0
Explanation: The translation is grammatically correct, but it is not fluent due to unnatural word choices. The word "bagong" translates to "new," which is not the correct term for "fresh" in the context of fruit; the proper word is "sariwa." Additionally, "bakasyon" means "vacation," not "summer.

## Batch Processing of Data

### finding last completed index

In [37]:
# Load original full validation dataframe
val_df_with_scores = val_df_cleaned.copy()

# Load partial results with normalized_total (may be shorter)
partial_df = pd.read_csv('validation_with_normalized_scores_partial.csv')

# Make sure partial_df has 'normalized_total' column
if 'normalized_total' in partial_df.columns:
    # Merge partial results by index into full df
    val_df_with_scores['normalized_total'] = None  # Initialize

    # Assign normalized_total from partial_df to full df for matching indices
    for idx, val in partial_df['normalized_total'].items():
        val_df_with_scores.at[idx, 'normalized_total'] = val
else:
    val_df_with_scores['normalized_total'] = None

# Find last completed index on full df
completed_rows = val_df_with_scores['normalized_total'].notnull()
last_completed_index = completed_rows[completed_rows].index.max() if completed_rows.any() else -1

print(f"Last completed index on full df: {last_completed_index}")

# Then continue with batch processing from last_completed_index+1 on full df
# (use the batch processing code from previous message here)


Last completed index on full df: 9


In [39]:
import json
import pandas as pd

batch_size = 10
val_df_with_scores = val_df_cleaned.copy()

# Initialize the full-length list with None
normalized_scores = [None] * len(val_df_with_scores)

for start_idx in range(0, len(val_df_with_scores), batch_size):
    end_idx = min(start_idx + batch_size, len(val_df_with_scores))
    print(f"Processing rows {start_idx} to {end_idx - 1}...")

    for idx in range(start_idx, end_idx):
        row = val_df_with_scores.iloc[idx]
        source = row['source_text']
        translated = row['target_text']

        try:
            final_eval = judge.evaluate(source, translated, reference=translated)
        except Exception as e:
            print(f"Error evaluating row {idx}: {e}")
            final_eval = {"scores": {"normalized_total": None}}

        norm_score = final_eval.get("scores", {}).get("normalized_total", None)
        normalized_scores[idx] = norm_score  # Update specific position

        print("\n--- FINAL EVALUATION ---")
        print(json.dumps(final_eval, indent=2, ensure_ascii=False))

    # Assign only the processed batch slice
    val_df_with_scores.iloc[start_idx:end_idx - 1, 'normalized_total'] = normalized_scores[start_idx:end_idx]

    # Save intermediate results after each batch
    val_df_with_scores.iloc[:end_idx].to_csv('validation_with_normalized_scores_partial.csv', index=False, encoding='utf-8-sig')

    print(f"Saved partial results up to row {end_idx - 1}")

print("Batch processing complete!")

# Save final results
val_df_with_scores['normalized_total'] = normalized_scores
val_df_with_scores.to_csv('validation_with_normalized_scores_final.csv', index=False, encoding='utf-8-sig')
print("Saved final results as validation_with_normalized_scores_final.csv")


Processing rows 0 to 9...
Evaluating: "The children laughed and played under the afternoon sun." → "Ang mga bata ay nagtawanan at naglaro sa ilalim ng hapon na araw."

--- Model initialized successfully. ---

--- STARTING AGENTIC EVALUATION ---


--- Agent Thought: Executing accuracy ---
--- Agent Observation (accuracy): ---
Score: 1
Explanation: The translation is a perfect and literal match for the source sentence. All components are accurately translated: "The children" as "Ang mga bata," "laughed and played" as "nagtawanan at naglaro," and "under the afternoon sun" as "sa ilalim ng hapon na araw." The meaning is fully preserved without any omissions or alterations.

LanguageTool found 2 errors.
LanguageTool detected 2 errors in the translation.

--- Agent Thought: Executing fluency ---


KeyboardInterrupt: 

### Process by batches of 10

In [40]:
import json
import pandas as pd

batch_size = 10
val_df_with_scores = val_df_cleaned.copy()

# Load partial results CSV if exists
try:
    partial_df = pd.read_csv('validation_with_normalized_scores_partial.csv')
    if 'normalized_total' in partial_df.columns:
        val_df_with_scores['normalized_total'] = None
        for idx, val in partial_df['normalized_total'].items():
            val_df_with_scores.at[idx, 'normalized_total'] = val
except FileNotFoundError:
    val_df_with_scores['normalized_total'] = None

# Find last completed index
completed_rows = val_df_with_scores['normalized_total'].notnull()
last_completed_index = completed_rows[completed_rows].index.max() if completed_rows.any() else -1

print(f"Last completed index on full df: {last_completed_index}")

start_idx = last_completed_index + 1
total_rows = len(val_df_with_scores)

normalized_scores = val_df_with_scores['normalized_total'].tolist()  # get existing scores list (with Nones)

for batch_start in range(start_idx, total_rows, batch_size):
    batch_end = min(batch_start + batch_size, total_rows)
    print(f"Processing rows {batch_start} to {batch_end - 1}...")

    for idx in range(batch_start, batch_end):
        row = val_df_with_scores.iloc[idx]
        source = row['source_text']
        translated = row['target_text']

        try:
            final_eval = judge.evaluate(source, translated, reference=translated)
        except Exception as e:
            print(f"Error evaluating row {idx}: {e}")
            final_eval = {"scores": {"normalized_total": None}}

        norm_score = final_eval.get("scores", {}).get("normalized_total", None)
        normalized_scores[idx] = norm_score

        print("\n--- FINAL EVALUATION ---")
        print(json.dumps(final_eval, indent=2, ensure_ascii=False))

    # Update dataframe column for this batch
    val_df_with_scores.iloc[batch_start:batch_end, val_df_with_scores.columns.get_loc('normalized_total')] = normalized_scores[batch_start:batch_end]


    # Save intermediate results
    val_df_with_scores.to_csv('validation_with_normalized_scores_partial.csv', index=False, encoding='utf-8-sig')
    print(f"Saved partial results up to row {batch_end - 1}")

print("Batch processing complete!")

# Save final results
val_df_with_scores['normalized_total'] = normalized_scores
val_df_with_scores.to_csv('validation_with_normalized_scores_final.csv', index=False, encoding='utf-8-sig')
print("Saved final results as validation_with_normalized_scores_final.csv")


Last completed index on full df: 9
Processing rows 10 to 19...
Evaluating: "I didn’t want to go to the dentist, but I had to bite the bullet and get it over with." → "Ayoko sanang pumunta sa dentista, pero kailangan kong tiisin ito para matapos na."

--- Model initialized successfully. ---

--- STARTING AGENTIC EVALUATION ---


--- Agent Thought: Executing accuracy ---
--- Agent Observation (accuracy): ---
Score: 1
Explanation: The translation is highly accurate. "Ayoko sanang pumunta sa dentista" perfectly captures the reluctance in "I didn’t want to go to the dentist." The idiom "bite the bullet" is appropriately translated into its meaning, which is to endure a difficult situation. "Kailangan kong tiisin ito" (I need to endure this) combined with "para matapos na" (so it will be over with) effectively conveys the entire meaning of "I had to bite the bullet and get it over with." All details and the overall sentiment are preserved.

LanguageTool found 0 errors.
LanguageTool detected 

In [42]:
val_df_with_scores = pd.read_csv("validation_with_normalized_scores_final.csv")

val_df_with_scores.head()

Unnamed: 0,source_text,target_text,score,Rater 1 Explanation,Rater 2 Explanation,Contributor,normalized_total
0,The children laughed and played under the afte...,Ang mga bata ay nagtawanan at naglaro sa ilali...,4.0,"Accurate, fluent, and natural translation. Cap...",Just slight error due to the literal translati...,Paul Ivan Enclonar/Alonzo Rimando,3.0
1,She took a break to gather her thoughts.,Nagpahinga siya para mag-isip-isip.,4.0,The translation is accurate. It was able to ca...,The translation would have been better if the ...,Paul Ivan Enclonar/Alonzo Rimando,5.0
2,The algorithm efficiently identifies patterns ...,Mabisang kinikilala ng algoritmo ang mga patte...,3.0,"The translation of ""identifies"" as ""kinikilala...",The translation would have been better if the ...,Paul Ivan Enclonar/Alonzo Rimando,5.0
3,Data normalization helps improve model perform...,Tumutulong sa pagpabuti ng model ang normalisa...,5.0,The translated text is natural and captures th...,The translation didn't literally translated th...,Paul Ivan Enclonar/Alonzo Rimando,5.0
4,alam mo ma'am masaya naman topics natin sa phi...,"You know, ma'am, we have a lot of fun philosop...",4.0,"flawed translation is close, but failed to tra...",,Charibeth Cheng,1.0


# Data analysis

In [43]:
df = val_df_with_scores.dropna(subset=['score','normalized_total']).copy()

# Ensure numeric types
df['score'] = pd.to_numeric(df['score'], errors='coerce')
df['normalized_total'] = pd.to_numeric(df['normalized_total'], errors='coerce')

df = df.dropna(subset=['score','normalized_total']).reset_index(drop=True)

print(f"Using {len(df)} examples for comparison.")

# 1) Spearman correlation between raw numeric human score and LLM normalized_total
rho, pval = spearmanr(df['score'], df['normalized_total'])
print(f"\nSpearman correlation (human score vs LLM normalized_total): ρ = {rho:.3f}, p-value = {pval:.3g}")

# 2) Map numeric -> coarse labels by snapping to nearest of [1,3,5], then text label
def snap_to_coarse_label(x):
    # nearest among 1,3,5
    choices = np.array([1,3,5])
    nearest = choices[np.argmin(np.abs(choices - x))]
    label_map = {1: 'poor', 3: 'good', 5: 'excellent'}
    return nearest, label_map[int(nearest)]

df['human_nearest_score'] = df['score'].apply(lambda x: snap_to_coarse_label(x)[0])
df['human_label'] = df['score'].apply(lambda x: snap_to_coarse_label(x)[1])

# For LLM normalized_total we expect values in {1,3,5} already; map to text
label_map = {1: 'poor', 3: 'good', 5: 'excellent'}
df['llm_label'] = df['normalized_total'].map(label_map)

# 3) Label agreement
label_agreement = (df['human_label'] == df['llm_label']).mean()
print(f"\nLabel agreement (human vs LLM): {label_agreement*100:.1f}% ({int((label_agreement)*len(df))} / {len(df)})")

# Confusion matrix (counts)
confusion_counts = pd.crosstab(df['human_label'], df['llm_label'], rownames=['human'], colnames=['llm'], margins=True)
print("\nConfusion counts (human_label x llm_label):")
print(confusion_counts)

# Normalized confusion (row-normalized to see, for each human label, LLM distribution)
confusion_norm = pd.crosstab(df['human_label'], df['llm_label'], normalize='index')
print("\nConfusion (row-normalized):")
print(confusion_norm)

# 4) Top numeric disagreements (absolute diff)
df['abs_diff'] = (df['score'] - df['normalized_total']).abs()
top_disagreements = df.sort_values('abs_diff', ascending=False).head(10)
print("\nTop 10 numeric disagreements (human score vs LLM normalized_total):")
display_cols = ['source_text','target_text','score','normalized_total','human_label','llm_label','abs_diff']
print(top_disagreements[display_cols].to_string(index=False))

# 5) Quick summary
mean_abs_diff = df['abs_diff'].mean()
print(f"\nMean absolute difference (|human - llm|): {mean_abs_diff:.3f}")

df.to_csv("comparison_human_vs_llm.csv", index=False, encoding='utf-8-sig')
print("\nSaved detailed comparison to comparison_human_vs_llm.csv")

Using 57 examples for comparison.

Spearman correlation (human score vs LLM normalized_total): ρ = 0.384, p-value = 0.00318

Label agreement (human vs LLM): 35.1% (20 / 57)

Confusion counts (human_label x llm_label):
llm        excellent  good  poor  All
human                                
excellent         13     0     1   14
good              24     3     2   29
poor               7     3     4   14
All               44     6     7   57

Confusion (row-normalized):
llm_label    excellent      good      poor
human_label                               
excellent     0.928571  0.000000  0.071429
good          0.827586  0.103448  0.068966
poor          0.500000  0.214286  0.285714

Top 10 numeric disagreements (human score vs LLM normalized_total):
                                                                                                                                                                                                                                                 