# LLM-as-a-Judge Evaluation Pipeline Testing

## Validation Framework: LLM-Human Agreement Analysis

This notebook implements the second validation experiment from the thesis framework, focusing on **LLM-as-a-judge inter-rater reliability** testing. The analysis evaluates whether automated LLM evaluation can achieve substantial agreement with human expert judgments across the eight empirically derived quality metrics.

### Research Context
- **Objective**: Validate LLM-based evaluation reliability compared to human expert assessments
- **Sample**: Same job listings evaluated by both human experts and LLM judges
- **Methodology**: Dual LLM assessment (GPT-4o and Claude-3.5)
- **Significance**: Establishes foundation for automated quality evaluation in recruitment contexts

---

## 1. Evaluation Framework Architecture

### LLM-as-a-Judge Implementation
The `IndividualMetricEvaluator` class implements the core evaluation framework using structured prompts that incorporate the operational definitions from Table 2 of the thesis. The system employs:

**Dual Model Strategy:**
- **GPT-4o-20240513**: Selected for demonstrated consistency with expert judgments in textual analysis
- **Claude-3.5-sonnet-20240620**: Chosen for minimal sensitivity to evaluation biases
- **Temperature = 0.0**: Ensures deterministic outputs for reliability assessment

**Quality Dimensions Assessment:**
- **Clarity**: Language level appropriateness and syntax/grammar adherence
- **Relevance**: Motivating text effectiveness and creativity level appropriateness  
- **Correctness**: Content rules compliance and tone of voice consistency
- **Completeness**: Applicant profile reflection and company information provision



In [12]:
import numpy as np
import pandas as pd
import re
import os

from datetime import datetime
from typing import Literal, Union

from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic



In [13]:
en_check = pd.read_excel('english_shuffled_job_listings.xlsx')
nl_check = pd.read_excel('dutch_shuffled_job_listings.xlsx')

## 1. Evaluation Metrics (TEST)
### 1.1 Quantifiable metrics (TEST)


In [14]:

class JobListingValidator:    
    @staticmethod
    def validate_word_count(text: str, min_words: int, max_words: int):
        word_count = len(text.split())
        return min_words <= word_count <= max_words

    @staticmethod
    def validate_sentence_length(text: str, max_length: int):
        import re
        sentences = re.split(r'[.!?]+', text)
        long_sentences = []

        for sentence in sentences:
            if len(sentence.split()) > max_length:
                long_sentences.append(sentence.strip())

        return long_sentences

    @staticmethod
    def check_gendered_language(text: str, language: str = "nl"):
        gendered_terms = {
            "nl": ["hij", "zij", "hem", "haar", "zijn", "de hare", "man", "vrouw", "kerel", 
                   "meid", "verkoper", "verkoopster", "zakenman", "zakenvrouw", "politieman", 
                   "politievrouw"],
            "en": [
                'he', 'she', 'him', 'her', 'his', 'hers', 'man', 'woman',
                'guy', 'gal', 'salesman', 'saleswoman', 'businessman', 'businesswoman',
                'policeman', 'policewoman', 'spokesman', 'spokeswoman']}
    
        terms = gendered_terms.get(language.lower(), [])
        found_terms = []
        text_lower = text.lower()
    
        for term in terms:
            if term in text_lower:
                found_terms.append(term)
    
        return found_terms

## 2. Experimental Design

### Sample Composition
Following the thesis methodology (Section 6.2):
- **Total listings**: 40 job listings (20 Dutch, 20 English)
- **Generation split**: 50% human-written, 50% LLM-generated per language
- **Level distribution**: 5 entry-level, 5 mid/senior-level positions per generation type
- **Company representation**: Companies A, C, D (Dutch); Companies B, E (English)

### Bias Mitigation Strategy
The framework addresses **self-enhancement bias** concerns through:
- Multi-model evaluation approach (GPT-4o vs. Claude-3.5)
- Structured prompt engineering with explicit evaluation criteria
- Temperature settings optimized for consistent assessment
- Blind evaluation (models unaware of generation source)

In [180]:


class IndividualMetricEvaluator:    
    def __init__(self, model: str = "gpt-4o", temperature: float = 0, language: str = "dutch"):
        self.model = model
        self.temperature = temperature
        self.language = language.lower()
        
        if self.language not in ["dutch", "english"]:
            raise ValueError("Language must be either 'dutch' or 'english'")
        
        if model.startswith("claude"):
            self.llm = ChatAnthropic(
                model_name=model, 
                temperature=temperature,
            )
        else:  
            self.llm = ChatOpenAI(model_name=model, temperature=temperature)
    
    def _evaluate_language_level(self, listing: str, applicant_profile: str = "") -> int:
        """Evaluate if language level is appropriate for potential applicants"""
        if self.language == "dutch":
            prompt = f"""
Je bent een expert in taalgebruik en recruitmentcommunicatie.

Evalueer deze vacaturetekst op taalniveau en geschiktheid voor de doelgroep.

VACATURETEKST:
{listing}

DOELGROEP PROFIEL: {applicant_profile}

EVALUATIECRITERIUM: Het taalgebruik in de vacaturetekst is passend voor de beoogde doelgroep.

Beoordeel specifiek:
- Is de woordkeuze geschikt voor het opleidingsniveau van kandidaten?
- Is de complexiteit van zinnen passend?
- Wordt vakjargon op het juiste niveau gebruikt?
- Is de tekst toegankelijk voor de beoogde kandidaten?

Geef een score volgens de schaal:
1 = Helemaal mee oneens (taal helemaal niet passend)
2 = Oneens (taal grotendeels ongeschikt)
3 = Voldoende (taal redelijk geschikt)
4 = Eens (taal goed geschikt)
5 = Helemaal mee eens (taal perfect geschikt)

Antwoord alleen met het cijfer (1, 2, 3, 4, of 5):"""
        else:
            prompt = f"""
You are an expert in language use and recruitment communication.

Evaluate this job listing on language level and suitability for the target audience.

JOB LISTING:
{listing}

TARGET AUDIENCE PROFILE: {applicant_profile}

EVALUATION CRITERION: The language used in the job listing is appropriate for the potential applicants.

Assess specifically:
- Is the vocabulary suitable for the education level of candidates?
- Is the sentence complexity appropriate?
- Is professional jargon used at the right level?
- Is the text accessible to the intended candidates?

Give a score according to the scale:
1 = Strongly disagree (language completely inappropriate)
2 = Somewhat disagree (language largely inappropriate)
3 = Satisfactory (language reasonably appropriate)
4 = Somewhat agree (language well-suited)
5 = Strongly agree (language perfectly appropriate)

Answer only with the number (1, 2, 3, 4, or 5):"""
        
        response = self.llm.invoke(prompt)
        return self._extract_score(response.content)
    
    def _evaluate_syntax_grammar(self, listing: str, syntax_rules: str = "") -> int:
        """Evaluate adherence to grammar and syntax conventions"""
        if self.language == "dutch":
            prompt = f"""
Je bent een taalexpert gespecialiseerd in zakelijke communicatie en recruitment.

Evalueer deze vacaturetekst op grammatica en zinsbouwregels.

VACATURETEKST:
{listing}

GRAMMATICA- EN ZINSBOUWREGELS VAN HET RECRUITMENTBEDRIJF:
{syntax_rules}

EVALUATIECRITERIUM: De vacaturetekst volgt de grammatica- en zinsbouwregels zoals vastgesteld door het recruitmentbedrijf.

Beoordeel met focus op:
- Correcte grammatica volgens Nederlandse taalregels
- Naleving van bedrijfsspecifieke stijlregels
- Juiste zinsbouw en interpunctie
- Consistentie in taalgebruik

Let op: Beoordeel realistisch - kleine onvolkomenheden hoeven niet tot een lage score te leiden als de communicatie helder blijft.

Score volgens schaal:
1 = Helemaal mee oneens (veel grammaticafouten)
2 = Oneens (enkele fouten maar nog leesbaar)
3 = Voldoende (acceptabel niveau, kleine onvolkomenheden)
4 = Eens (goed geschreven)
5 = Helemaal mee eens (uitstekende grammatica en zinsbouw)

Score: """
        else:
            prompt = f"""
You are a language expert specialized in business communication and recruitment.

Evaluate this job listing on grammar and syntax conventions.

JOB LISTING:
{listing}

GRAMMAR AND SYNTAX RULES FROM THE RECRUITMENT COMPANY:
{syntax_rules}

EVALUATION CRITERION: The job listing adheres to the grammar and syntax conventions set by the recruitment company.

Assess with focus on:
- Correct grammar according to English language rules
- Compliance with company-specific style rules
- Proper sentence structure and punctuation
- Consistency in language use

Note: Assess realistically - minor imperfections should not lead to a low score if communication remains clear.

Score according to scale:
1 = Strongly disagree (many grammar errors)
2 = Somewhat disagree (some errors but still readable)
3 = Satisfactory (acceptable level, minor imperfections)
4 = Somewhat agree (well-written)
5 = Strongly agree (excellent grammar and syntax)

Score: """
        
        response = self.llm.invoke(prompt)
        return self._extract_score(response.content)
    
    def _evaluate_motivating_text(self, listing: str) -> int:
        """Evaluate effectiveness in motivating candidates to apply"""
        if self.language == "dutch":
            prompt = f"""
Je bent een recruitment specialist met expertise in kandidaatmotivatie.

Evalueer deze vacaturetekst op motiverende werking.

VACATURETEKST:
{listing}

EVALUATIECRITERIUM: De vacaturetekst motiveert potentiële kandidaten op effectieve wijze om te solliciteren op deze functie.

Beoordeel specifiek:
- Wordt de functie aantrekkelijk gepresenteerd?
- Creëert de tekst enthousiasme voor de positie?
- Worden voordelen en kansen duidelijk gecommuniceerd?

Score volgens schaal:
1 = Helemaal mee oneens (niet motiverend)
2 = Oneens (beperkt motiverend)
3 = Voldoende (redelijk motiverend)
4 = Eens (goed motiverend)
5 = Helemaal mee eens (zeer effectief motiverend)

Score: """
        else:
            prompt = f"""
You are a recruitment specialist with expertise in candidate motivation.

Evaluate this job listing on motivational effectiveness.

JOB LISTING:
{listing}

EVALUATION CRITERION: The job listing effectively motivates potential candidates to apply for this position.

Assess specifically:
- Is the position presented attractively?
- Does the text create enthusiasm for the position?
- Are benefits and opportunities clearly communicated?

Score according to scale:
1 = Strongly disagree (not motivating)
2 = Somewhat disagree (limited motivation)
3 = Satisfactory (reasonably motivating)
4 = Somewhat agree (well-motivating)
5 = Strongly agree (very effectively motivating)

Score: """
        
        response = self.llm.invoke(prompt)
        return self._extract_score(response.content)
    
    def _evaluate_creativity_level(self, listing: str) -> int:
        """Evaluate appropriate level of creativity and originality"""
        if self.language == "dutch":
            prompt = f"""
Je bent een expert in creatieve communicatie binnen de recruitment sector.

Evalueer deze vacaturetekst op creativiteitsniveau en originaliteit.

VACATURETEKST:
{listing}

EVALUATIECRITERIUM: De vacaturetekst toont een passend niveau van creativiteit en originaliteit in de presentatie.

Beoordeel met recruitment context in gedachten:
- Is de presentatie origineel en onderscheidend?
- Past het creativiteitsniveau bij het type functie?
- Maakt creativiteit de tekst aantrekkelijker zonder onduidelijkheid?

Let op: Vacatureteksten hoeven niet extreem creatief te zijn - helderheid en functionaliteit zijn belangrijker dan pure originaliteit.

Score volgens schaal:
1 = Helemaal mee oneens (zeer droog, geen creativiteit)
2 = Oneens (overwegend standaard)
3 = Voldoende (evenwichtige mix van helderheid en aantrekkelijkheid)
4 = Eens (goed gebruik van creatieve elementen)
5 = Helemaal mee eens (uitstekend creatief binnen professionele context)

Score: """
        else:
            prompt = f"""
You are an expert in creative communication within the recruitment sector.

Evaluate this job listing on creativity level and originality.

JOB LISTING:
{listing}

EVALUATION CRITERION: The job listing demonstrates an appropriate level of creativity and originality in its presentation.

Assess with recruitment context in mind:
- Is the presentation original and distinctive?
- Does the creativity level match the type of position?
- Does creativity make the text more attractive without causing confusion?

Note: Job listings don't need to be extremely creative - clarity and functionality are more important than pure originality.

Score according to scale:
1 = Strongly disagree (very dry, no creativity)
2 = Somewhat disagree (predominantly standard)
3 = Satisfactory (balanced mix of clarity and attractiveness)
4 = Somewhat agree (good use of creative elements)
5 = Strongly agree (excellently creative within professional context)

Score: """
        
        response = self.llm.invoke(prompt)
        return self._extract_score(response.content)
    
    def _evaluate_content_rules(self, listing: str, content_rules: str = "") -> int:
        """Evaluate inclusion of all content rules set by recruitment company"""
        if self.language == "dutch":
            prompt = f"""
Je bent een compliance expert voor recruitment communicatie.

Evalueer deze vacaturetekst op naleving van inhoudsregels.

VACATURETEKST:
{listing}

INHOUDSREGELS VAN HET RECRUITMENTBEDRIJF:
{content_rules}

EVALUATIECRITERIUM: De vacaturetekst voldoet aan alle inhoudsregels die zijn opgesteld door het recruitmentbedrijf.

Controleer systematisch:
- Zijn alle verplichte inhoudselementen aanwezig?
- Is alle voorgeschreven informatie correct geintegreerd?


Beoordeel strikt op completheid van de regelnaleving.

Score volgens schaal:
1 = Helemaal mee oneens (geen van de regels gevolgd)
2 = Oneens (enkele regels maar veel ontbreekt)
3 = Voldoende (meeste regels gevolgd, kleine tekortkomingen)
4 = Eens (bijna alle regels correct gevolgd)
5 = Helemaal mee eens (alle inhoudsregels perfect nageleefd)

Score: """
        else:
            prompt = f"""
You are a compliance expert for recruitment communication.

Evaluate this job listing on adherence to content rules.

JOB LISTING:
{listing}

CONTENT RULES FROM THE RECRUITMENT COMPANY:
{content_rules}

EVALUATION CRITERION: The job listing includes all content rules set by the recruitment company.

Check systematically:
- Are all mandatory content elements present?
- Is all prescribed information correctly incorporated?

Assess strictly on completeness of rule compliance.

Score according to scale:
1 = Strongly disagree (none of the rules followed)
2 = Somewhat disagree (some rules but much is missing)
3 = Satisfactory (most rules followed, minor shortcomings)
4 = Somewhat agree (almost all rules correctly followed)
5 = Strongly agree (all content rules perfectly adhered to)

Score: """
        
        response = self.llm.invoke(prompt)
        return self._extract_score(response.content)
    
    def _evaluate_tone_of_voice(self, listing: str, tone_of_voice: str = "") -> int:
        """Evaluate consistent adherence to specified tone of voice"""
        if self.language == "dutch":
            prompt = f"""
Je bent een brand communication specialist.

Evalueer deze vacaturetekst op tone of voice consistentie.

VACATURETEKST:
{listing}

VOORGESCHREVEN TONE OF VOICE VAN HET RECRUITMENTBEDRIJF:
{tone_of_voice}

EVALUATIECRITERIUM: De vacaturetekst houdt consequent de voorgeschreven tone of voice aan, passend bij de beoogde communicatiestijl.

Beoordeel specifiek:
- Is de tone of voice consistent door de hele tekst?
- Zijn er afwijkingen van de voorgeschreven stijl?

Score volgens schaal:
1 = Helemaal mee oneens (houdt zich niet aan tone of voice)
2 = Oneens (beperkte aansluiting)
3 = Voldoende (redelijke aansluiting)
4 = Eens (goede aansluiting)
5 = Helemaal mee eens (perfecte aansluiting bij tone of voice)

Score: """
        else:
            prompt = f"""
You are a brand communication specialist.

Evaluate this job listing on tone of voice consistency.

JOB LISTING:
{listing}

PRESCRIBED TONE OF VOICE FROM THE RECRUITMENT COMPANY:
{tone_of_voice}

EVALUATION CRITERION: The job listing consistently adheres to the specified tone of voice, aligning with the intended communicative style.

Assess specifically:
- Is the tone of voice consistent throughout the text?
- Are there deviations from the prescribed style?

Score according to scale:
1 = Strongly disagree (does not adhere to tone of voice)
2 = Somewhat disagree (limited alignment)
3 = Satisfactory (reasonable alignment)
4 = Somewhat agree (good alignment)
5 = Strongly agree (perfect alignment with tone of voice)

Score: """
        
        response = self.llm.invoke(prompt)
        return self._extract_score(response.content)
    
    def _evaluate_applicant_profile(self, listing: str, applicant_profile: str = "") -> int:
        """Evaluate reflection of competencies for target candidate"""
        if self.language == "dutch":
            prompt = f"""
Je bent een recruitment consultant gespecialiseerd in candidate profiling.

Evalueer deze vacaturetekst op reflectie van competenties voor de doelkandidaat.

VACATURETEKST:
{listing}

PROFIEL VAN DE BEOOGDE KANDIDAAT:
{applicant_profile}

EVALUATIECRITERIUM: De vacaturetekst weerspiegelt de competenties voor de beoogde kandidaat.

Beoordeel specifiek:
- Worden de vereiste competenties duidelijk weergegeven?
- Worden relevante verantwoordelijkheden genoemd?

Score volgens schaal:
1 = Helemaal mee oneens (competenties helemaal niet weerspiegeld)
2 = Oneens (beperkte weerspiegeling)
3 = Voldoende (redelijke weerspiegeling)
4 = Eens (goede weerspiegeling)
5 = Helemaal mee eens (perfecte weerspiegeling van competenties)

Score: """
        else:
            prompt = f"""
You are a recruitment consultant specialized in candidate profiling.

Evaluate this job listing on reflection of competencies for the target candidate.

JOB LISTING:
{listing}

PROFILE OF THE TARGET CANDIDATE:
{applicant_profile}

EVALUATION CRITERION: The job listing reflects the competencies for the target candidate.

Assess specifically:
- Are the required competencies clearly represented?
- Are relevant responsibilities mentioned?

Score according scale:
1 = Strongly disagree (competencies not reflected at all)
2 = Somewhat disagree (limited reflection)
3 = Satisfactory (reasonable reflection)
4 = Somewhat agree (good reflection)
5 = Strongly agree (perfect reflection of competencies)

Score: """
        
        response = self.llm.invoke(prompt)
        return self._extract_score(response.content)
    
    def _evaluate_company_information(self, listing: str) -> int:
        """Evaluate provision of contextual, relevant company information"""
        if self.language == "dutch":
            prompt = f"""
Je bent een employer branding specialist.

Evalueer deze vacaturetekst op bedrijfsinformatie.

VACATURETEKST:
{listing}

EVALUATIECRITERIUM: De vacaturetekst bevat contextuele en relevante informatie over het bedrijf.

Beoordeel specifiek:
- Is er voldoende informatie over het bedrijf aanwezig?
- Is de bedrijfsinformatie relevant?
- Geeft de tekst een goed beeld van de organisatie?

Score volgens schaal:
1 = Helemaal mee oneens (geen relevante bedrijfsinformatie)
2 = Oneens (zeer beperkte informatie)
3 = Voldoende (basis bedrijfsinformatie aanwezig)
4 = Eens (goede contextuele informatie)
5 = Helemaal mee eens (uitstekende, relevante bedrijfsinformatie)

Score: """
        else:
            prompt = f"""
You are an employer branding specialist.

Evaluate this job listing on company information.

JOB LISTING:
{listing}

EVALUATION CRITERION: The job listing provides contextual, relevant information about the company.

Assess specifically:
- Is there sufficient information about the company present?
- Is the company information relevant?
- Does the text provide a good picture of the organization?

Score according to scale:
1 = Strongly disagree (no relevant company information)
2 = Somewhat disagree (very limited information)
3 = Satisfactory (basic company information present)
4 = Somewhat agree (good contextual information)
5 = Strongly agree (excellent, relevant company information)

Score: """
        
        response = self.llm.invoke(response.content)
        return self._extract_score(response.content)

    def _extract_score(self, response_text: str) -> int:
        """Extract numerical score from response"""
        patterns = [
            r"score:\s*(\d+)",
            r"(\d+)\s*/\s*5",
            r"^(\d+)$",
            r"(\d+)\s*$",
            r"(\d+)\s*=",
            r"=\s*(\d+)",
        ]
        
        response_lower = response_text.lower().strip()
        
        for pattern in patterns:
            match = re.search(pattern, response_lower, re.MULTILINE)
            if match:
                try:
                    score = int(match.group(1))
                    if 1 <= score <= 5:
                        return score
                except (ValueError, IndexError):
                    continue
        
        all_numbers = re.findall(r'\b([1-5])\b', response_text)
        if all_numbers:
            return int(all_numbers[-1])
        print("Gaat iets fout")
        return 3
    
    def evaluate_all_metrics(self, listing: str, syntax_rules: str = "", 
                           content_rules: str = "", tone_of_voice: str = "", 
                           applicant_profile: str = "") -> dict:        
        results = {}
        

        # CLARITTY
        results['language_level_score'] = self._evaluate_language_level(
            listing, applicant_profile)
        
        results['syntax_grammar_score'] = self._evaluate_syntax_grammar(
            listing, syntax_rules)
        
        # RELEVANCE
        results['motivating_text_score'] = self._evaluate_motivating_text(listing)
        
        results['creativity_level_score'] = self._evaluate_creativity_level(listing)
        
        # CORRECTNESS
        results['content_rules_score'] = self._evaluate_content_rules(
            listing, content_rules)
        
        results['tone_of_voice_score'] = self._evaluate_tone_of_voice(
            listing, tone_of_voice)
        
        # COMPLETENESS
        results['applicant_profile_score'] = self._evaluate_applicant_profile(
            listing, applicant_profile)
        
        results['company_information_score'] = self._evaluate_company_information(listing)
        
        # Add model_info
        results['evaluation_model'] = self.model
            
        return results

def process_dataset_research_framework(df, model: str = "gpt-4o", num_rows=None, batch_size=10, language="dutch"):    
    evaluator = IndividualMetricEvaluator(model=model, language=language)
    results = []
    
    test_df = df.head(num_rows) if num_rows else df
    print(f"Processing {len(test_df)} job listings with research framework metrics using {model}...")
    
    for i in range(0, len(test_df), batch_size):
        batch = test_df.iloc[i:i+batch_size]
        print(f"Batch {i//batch_size + 1}/{(len(test_df)-1)//batch_size + 1}")
        
        for idx, row in batch.iterrows():
            unique_id_str = str(row['unique_id'])
            third_char = unique_id_str[2] if len(unique_id_str) > 2 else '0'
            
            if third_char == '0':
                listing_text = row['job_listing']
                listing_type = "human-written"
            else:
                listing_text = row['generated_listing']
                listing_type = "AI-generated"
            
            if pd.isna(listing_text) or not str(listing_text).strip():
                print(f"Empty listing for {row['unique_id']}")
                continue
            
            scores = evaluator.evaluate_all_metrics(
                listing=str(listing_text),
                syntax_rules=str(row['additional_syntax_rules']) if pd.notna(row['additional_syntax_rules']) else "",
                content_rules=str(row['additional_content_rules']) if pd.notna(row['additional_content_rules']) else "",
                tone_of_voice=str(row['tone_of_voice']) if pd.notna(row['tone_of_voice']) else "",
                applicant_profile=str(row['ideal_candidate_traits']) if pd.notna(row['ideal_candidate_traits']) else ""
            )
            
            scores['unique_id'] = row['unique_id']
            scores['listing_type'] = listing_type
            results.append(scores)
            
            print(f"Processed {listing_type} - ID: {row['unique_id']} with {model}")
                

    
    results_df = pd.DataFrame(results)
    final_df = df.merge(results_df, on='unique_id', how='left')
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    model_name = model.replace(":", "_").replace("/", "_")
    filename = f'job_listings_research_framework_{model_name}_{timestamp}.csv'
    final_df.to_csv(filename, index=False)
    
    print(f"Complete! Saved to: {filename}")
    print(f"Total evaluations: {len(results_df)}")
    

    for metric in ['language_level_score', 'syntax_grammar_score', 'motivating_text_score', 
                   'creativity_level_score', 'content_rules_score', 'tone_of_voice_score',
                   'applicant_profile_score', 'company_information_score']:
        if metric in results_df.columns:
            mean_score = results_df[metric].mean()
    
    return final_df




# Dutch evaluations
# results_claude = process_dataset_research_framework(nl_check, model="claude-3-5-sonnet-20240620", num_rows=50)
# results_gpt = process_dataset_research_framework(nl_check, model="gpt-4o-20240513", num_rows=50)


# English evaluations
# results_claude_en = process_dataset_research_framework(en_check, model="claude-3-5-sonnet-20240620", num_rows=20, language="english")
# results_gpt_en = process_dataset_research_framework(en_check, model="gpt-4o-20240513", num_rows=20, language="english")



## 3.Formatting adjustment for comparison to human raters

In [110]:
df_dutch = pd.read_excel('/Users/Mick/Master_Thesis/3. evaluation_pipeline/Dutch_surveys_ICR_2.xlsx')

In [130]:

score_columns = ['language_level_score', 'syntax_grammar_score', 'motivating_text_score',
                'creativity_level_score', 'content_rules_score', 'tone_of_voice_score',
                'applicant_profile_score', 'company_information_score']


new_df = []
for idx, row in results_gpt.iterrows():
    for response_id, score_col in enumerate(score_columns, 1):
        new_df.append({
            'ResponseId': response_id,
            'unique_id': row['unique_id'],
            'LLM_judge_openai': row[score_col]
        })

df_results_long = pd.DataFrame(new_df)


In [131]:
df_dutch['LLM_judge_openai'] = df_results_long.LLM_judge_openai


In [None]:
new_df = []
for idx, row in results_claude.iterrows():
    for response_id, score_col in enumerate(score_columns, 1):
        new_df.append({
            'ResponseId': response_id,
            'unique_id': row['unique_id'],
            'LLM_judge_claude': row[score_col]
        })

df_results_long = pd.DataFrame(new_df)


In [126]:
df_dutch['LLM_judge_claude'] = df_results_long.LLM_judge_claude

In [145]:
df_dutch.groupby("Generated")[['LLM_judge_openai',"LLM_judge_claude"]].mean()

Unnamed: 0_level_0,LLM_judge_openai,LLM_judge_claude
Generated,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3.3,3.3875
1,3.6,3.7625


In [133]:
df_dutch.to_excel('IRR_testing_6.xlsx')

In [137]:
df_english = pd.read_excel('/Users/Mick/Master_Thesis/3. evaluation_pipeline/English_surveys_ICR_2.xlsx')

In [138]:
new_df = []
for idx, row in results_gpt_en.iterrows():
    for response_id, score_col in enumerate(score_columns, 1):
        new_df.append({
            'ResponseId': response_id,
            'unique_id': row['unique_id'],
            'LLM_judge_openai': row[score_col]
        })

df_results_long = pd.DataFrame(new_df)

df_english['LLM_judge_openai'] = df_results_long.LLM_judge_openai


In [140]:
new_df = []
for idx, row in results_claude_en.iterrows():
    for response_id, score_col in enumerate(score_columns, 1):
        new_df.append({
            'ResponseId': response_id,
            'unique_id': row['unique_id'],
            'LLM_judge_claude': row[score_col]
        })

df_results_long = pd.DataFrame(new_df)
df_english['LLM_judge_claude'] = df_results_long.LLM_judge_claude

In [143]:
df_english.groupby("Generated")[['LLM_judge_openai',"LLM_judge_claude"]].mean()

Unnamed: 0_level_0,LLM_judge_openai,LLM_judge_claude
Generated,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3.125,3.1125
1,3.675,3.7


In [144]:
df_english.to_excel('IRR_testing_en_2.xlsx')


## 4. Implications

### Framework Validation
This testing validates whether **LLM-based evaluation can replicate expert judgment patterns** while providing:
- **Scalable quality assessment** for recruitment applications
- **Consistent evaluation standards** across different contexts
- **Reduced human evaluation burden** while maintaining quality standards
- **Empirical foundation** for automated recruitment content evaluation

### Future Research Directions
Results inform:
- **Metric Prompt refinement** based on agreement patterns
- **Sample size planning** for full-scale validation studies  
- **Language-specific adaptations** for international recruitment contexts
- **Integration strategies** for real-world recruitment workflows