# LLM-Based Evaluation Pipeline

This notebook presents an implementation of an LLM-Based Evaluation Pipeline for automated assessment of job listing quality.

## Key Components
1. **Evaluation Metrics**
- Implements metrics derived from expert interviews and literature
- Provides quantitative scoring across multiple quality dimensions with multiple techniques

2. **Analysis Reporting**
- Tracks efficiency through step counting
- Documents pipeline performance across evaluation iterations
- Generates detailed breakdowns of metric-specific assessments


3. **Experiments**
- Experimenting with the performance of each metric
- Modifying the metrics until high agreement with human evaluations
- Revising prompts to generate consistent job listings


In [18]:
import numpy as np
import pandas as pd
import os
from datetime import datetime


import dspy
import openai

import phoenix as px
from phoenix.experiments import run_experiment, evaluate_experiment
from phoenix.experiments.types import Example



## 1. Evaluation Metrics
Example quality metrics:
1. Clarity


In [3]:
CLARITY_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a job listing. Your objective is to evaluate the clarity 
of the job listing. A "clear" job listing is one that is well-structured, concise, easy to read, 
and directly communicates the necessary information without ambiguity or unnecessary complexity. 
An "unclear" job listing is one that deviates from the specified job, is vague, disorganized, overly complex, or difficult to understand.

Your response should be a single word: either "clear" or "unclear," indicating whether the listing is easy to understand. Do not include any other text or characters in your answer.

After providing your response, you must write a detailed explanation justifying your reasoning. 
Avoid stating the final label at the beginning of your explanation. Your reasoning should focus on specific aspects of the job listing that affect clarity, such as grammar, organization, and conciseness.

[BEGIN DATA]
Input: {job_listing}
Answer: {response}
[END DATA]

EXPLANATION: Provide your reasoning step by step, evaluating aspects like structure, language, and readability.
LABEL: "clear" or "unclear"
"""

#### Think about section chunking for specific metrics

In [4]:
def evaluate_clarity(output: str, input: str) -> bool:
    if output is None:
        return False
    df = pd.DataFrame({"query": [input.get("question")],
                       "response": [output.get("final_output")]})
    response = llm_classify(
        data=df,
        template=CLARITY_LLM_JUDGE_PROMPT,
        rails=["clear", "unclear"],
        model=eval_model,
        provide_explanation=True
    )
    return response['label'] == 'clear'

## 2. Analysis reporting

In [6]:
px.launch_app()

üåç To view the Phoenix app in your browser, visit http://localhost:6006/
üìñ For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x30d5ae8d0>

In [7]:
px_client = px.Client()

In [11]:
experiment_example_data = pd.DataFrame([
    {'job_listing': "Software Engineer - Join our dynamic team to build cutting-edge web applications. Requirements: 3+ years of experience with Python, JavaScript, and React. Responsibilities include developing new features, optimizing performance, and collaborating with cross-functional teams.",
     'rating': 1,
     'explanation': "This job listing clearly outlines responsibilities and requirements. The technical skills are specific (Python, JavaScript, React) and the experience level (3+ years) is clear. The listing would benefit from salary information and details about benefits or company culture."},
    {'job_listing': "Data Scientist - Looking for an experienced data scientist to analyze large datasets and develop machine learning models. Must have strong background in Python, R, statistics, and machine learning frameworks. PhD preferred.",
     'rating': 0,
     'explanation': "The listing provides a good description of required skills but lacks information about specific projects or problems the data scientist would work on. The PhD preference might unnecessarily limit the candidate pool. Missing details about team structure and career growth."},
    {'job_listing': "Marketing Manager - Drive our digital marketing strategy and manage a team of 5. Requirements: 5+ years of experience in digital marketing, proficiency in Google Analytics, and proven track record of successful campaigns.",
     'rating': 1,
     'explanation': "Excellent job listing with clear responsibilities (managing a team of 5) and specific skill requirements. The mention of digital marketing strategy and Google Analytics gives candidates a good understanding of the role. Could include more about KPIs and measurement of success."},
    {'job_listing': "UX Designer - Create intuitive user experiences for our mobile app. Requires expertise in Figma, user research, and prototyping. 2+ years of experience in a similar role required.",
     'rating': 0,
     'explanation': "This listing lacks specific details about the mobile app and industry. While it mentions required tools (Figma), it doesn't elaborate on the design process or team structure. The experience requirement is clear but overall the listing is too vague about day-to-day responsibilities."},
    {'job_listing': "DevOps Engineer - Seeking a skilled DevOps engineer to maintain and improve our cloud infrastructure. Experience with AWS, Docker, Kubernetes, and CI/CD pipelines required. Must be available for on-call rotations.",
     'rating': 0,
     'explanation': "Good technical requirements (AWS, Docker, Kubernetes) but the on-call requirement is mentioned without details about rotation frequency or compensation. The listing would be improved with information about team size and specific projects or infrastructure challenges."},
    {'job_listing': "Product Manager - Lead product development from conception to launch. Must have excellent communication skills, agile methodology experience, and technical background. MBA preferred.",
     'rating': 1,
     'explanation': "The listing provides a good overview of the role but uses generic phrases like 'excellent communication skills' without specifics. The MBA preference is mentioned without context. Would benefit from details about product types and industry-specific challenges."},
    {'job_listing': "Front-end Developer - Create responsive and accessible web interfaces using modern frameworks. Requirements: Proficiency in HTML, CSS, JavaScript, and experience with React or Vue.js.",
     'rating': 0,
     'explanation': "Clear technical requirements for front-end development but lacks information about the products or services being developed. The listing would be improved with details about the development team size and structure, as well as specific projects or challenges."},
    {'job_listing': "Customer Success Manager - Help our clients achieve their goals with our SaaS platform. Strong communication skills, customer service experience, and technical aptitude required.",
     'rating': 1,
     'explanation': "This job listing outlines the purpose of the role clearly but lacks specifics about the SaaS platform and customer base. The requirements are somewhat vague ('strong communication skills', 'technical aptitude') and would benefit from more concrete examples or metrics."},
    {'job_listing': "Machine Learning Engineer - Develop and deploy machine learning models at scale. Deep knowledge of neural networks, NLP, and computer vision required. PhD or equivalent experience preferred.",
     'rating': 0,
     'explanation': "The listing includes specific technical requirements but doesn't describe application domains or projects. The PhD preference might exclude qualified candidates with equivalent practical experience. Missing information about team structure and development processes."},
    {'job_listing': "Technical Writer - Create clear documentation for our APIs and software products. Experience with markup languages, technical writing tools, and ability to explain complex concepts simply.",
     'rating': 1,
     'explanation': "Very clear and specific job listing that outlines both the skills required (markup languages, technical writing tools) and the nature of the work (API documentation). The expectation to 'explain complex concepts simply' gives candidates a good understanding of the role's challenges."}
])

In [12]:
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

dataset = px_client.upload_dataset(dataframe=experiment_example_data, 
                                   dataset_name=f"overall_experiment_inputs-{now}", 
                                   input_keys=["job_listing"], 
                                   output_keys=["rating", "explanation"])

üì§ Uploading dataset...
üíæ Examples uploaded: http://localhost:6006/datasets/RGF0YXNldDox/examples
üóÑÔ∏è Dataset version ID: RGF0YXNldFZlcnNpb246MQ==


## 3. Experiments

In [None]:
def run_agent(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")
    
    if not any(
            isinstance(message, dict) and message.get("role") == "system" for message in messages
        ):
            system_prompt = {"role": "system", "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."}
            messages.append(system_prompt)
            print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router")
            
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools,
        )

        messages.append(response.choices[0].message.model_dump())
        tool_calls = response.choices[0].message.tool_calls
        print("Received response with tool calls:", bool(tool_calls))
        
        if tool_calls:
            print("Processing tool calls")
            tool_calls = response.choices[0].message.tool_calls
            messages = handle_tool_calls(tool_calls, messages)
        else:
            print("No tool calls, returning final response")
            return messages

In [16]:
def run_agent_task(example: Example) -> str:
    print("Starting agent with messages:", example.input.get("job_listing"))
    messages = [{"role": "user", "content": example.input.get("question")}]
    ret = run_agent(messages)
    return process_messages(ret)

In [19]:
experiment = run_experiment(dataset,
                            run_agent_task,
                            evaluators=[evaluate_clarity],
                            experiment_name="Testing Clarity Experiment",
                            experiment_description="Evaluating the Clarity")

üêå!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


üß™ Experiment started.
üì∫ View dataset experiments: http://localhost:6006/datasets/RGF0YXNldDox/experiments
üîó View this experiment: http://localhost:6006/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDox


I0000 00:00:1744714654.835040 7601425 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers


running tasks |          | 0/10 (0.0%) | ‚è≥ 00:00<? | ?it/s

Starting agent with messages: Software Engineer - Join our dynamic team to build cutting-edge web applications. Requirements: 3+ years of experience with Python, JavaScript, and React. Responsibilities include developing new features, optimizing performance, and collaborating with cross-functional teams.
[91mTraceback (most recent call last):
  File "/opt/anaconda3/envs/Thesis/lib/python3.12/site-packages/phoenix/experiments/functions.py", line 238, in sync_run_experiment
    _output = task(*bound_task_args.args, **bound_task_args.kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/70/p7sg7nvs5zj0gb1rprkw31k40000gp/T/ipykernel_47009/3463184274.py", line 4, in run_agent_task
    ret = run_agent(messages)
          ^^^^^^^^^
NameError: name 'run_agent' is not defined

The above exception was the direct cause of the following exception:

RuntimeError: task failed for example id 'RGF0YXNldEV4YW1wbGU6MQ==', repetition 1
[0m
Starting agent with 

üêå!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


Starting agent with messages: Product Manager - Lead product development from conception to launch. Must have excellent communication skills, agile methodology experience, and technical background. MBA preferred.
[91mTraceback (most recent call last):
  File "/opt/anaconda3/envs/Thesis/lib/python3.12/site-packages/phoenix/experiments/functions.py", line 238, in sync_run_experiment
    _output = task(*bound_task_args.args, **bound_task_args.kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/70/p7sg7nvs5zj0gb1rprkw31k40000gp/T/ipykernel_47009/3463184274.py", line 4, in run_agent_task
    ret = run_agent(messages)
          ^^^^^^^^^
NameError: name 'run_agent' is not defined

The above exception was the direct cause of the following exception:

RuntimeError: task failed for example id 'RGF0YXNldEV4YW1wbGU6Ng==', repetition 1
[0m
Starting agent with messages: Front-end Developer - Create responsive and accessible web interfaces using modern 

running experiment evaluations |          | 0/10 (0.0%) | ‚è≥ 00:00<? | ?it/s


üîó View this experiment: http://localhost:6006/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDox

Experiment Summary (04/15/25 12:57 PM +0200)
--------------------------------------------
| evaluator        |   n |   n_scores |   avg_score |   n_labels | top_2_labels   |
|:-----------------|----:|-----------:|------------:|-----------:|:---------------|
| evaluate_clarity |  10 |         10 |           0 |         10 | {'False': 10}  |

Tasks Summary (04/15/25 12:57 PM +0200)
---------------------------------------
|   n_examples |   n_runs |   n_errors | top_error                                    |
|-------------:|---------:|-----------:|:---------------------------------------------|
|           10 |       10 |         10 | NameError("name 'run_agent' is not defined") |
