# Application Evaluation with LLM
This notebook is to evaluate the application process of our community project.
Please note that the actual application selection process is done manually. We are using LLM for experimentation only.

We are going to evaluate the applicats based on:
  - Python Knowledge
  - Project Experience
  - Motivation and Eagerness
  - Availability
  - Teamwork Experience
  - Overall Suitability for the Program

### Define the Pydantic model
Eventhough this is a "simple" script, we'll use Pydantic to define a model for the data structure.
This ensures that the data is validated when we read the CSV file and can be easily manipulated later when we interact with LLMs.

Note the use of `Field` to map the CSV column to the model field. It is a neat feature that I really like in Pydantic.

In [1]:
from pydantic import BaseModel, Field


class Applicant(BaseModel):
    num: int
    name: str
    email: str
    location: str
    python_skills: str = Field(
        alias="How would you rate your Python programming skills?"
    )
    worked_on_python_projects: str = Field(
        alias="Have you worked on any Python projects?"
    )
    project_experience: str = Field(
        alias="Have you worked on any projects related to web development, AI, or mobile apps?"
    )
    completed_courses: str = Field(
        alias="Have you completed any coding courses or training programs?"
    )
    motivation: str = Field(
        alias="What motivates you to apply for the program and what do you hope to achieve from this program?"
    )
    career_goals: str = Field(
        alias="What are your long-term career goals, and how does our program align with your aspirations?"
    )
    hours_per_week: int = Field(
        alias="How many hours per week can you dedicate to our program?"
    )
    commitment: str = Field(
        alias="Are you willing to commit to the program for at least 6 months?"
    )
    additional_info: str = Field(
        alias="Any additional information you'd like to share about your application."
    )
    education_background: str = Field(alias="Education Background")

    class Config:
        populate_by_name = True

### Read and parse the CSV file with Pandas
First, we exported the applicants' information from the Form to a CSV file.
Then we read the CSV file with Pandas and parse the information into a list of Applicant objects.

In [2]:
import pandas as pd


def read_csv(file_path: str):
    df = pd.read_csv(file_path)
    applicants = []
    for row in df.to_dict(orient="records"):
        try:
            applicants.append(Applicant(**row))
        except Exception as e:
            print(f"Error processing row {row['num']}: {str(e)}")
    return applicants

In [3]:
file_path = "./data/applicants.csv"
applicants = read_csv(file_path)

applicants[0].model_dump()

{'num': 1,
 'name': 'Aung Kyaw',
 'email': 'aungkyaw@example.com',
 'location': 'Myanmar',
 'python_skills': 'Beginner (At least finished one course)',
 'worked_on_python_projects': 'I have never done a project. I have only learnt to do simple problem solvings with Python',
 'project_experience': 'NGO website for supporting stray dogs in Myanmar,\nProject Leader/Frontend Developer, Social media platform for educational content, Project Leader/Mobile Developer (Flutter)/Backend Developer (Laravel), Budget management application for an organization, Mobile Developer (Flutter)/Backend Developer (Laravel),\nJavaScript Games, Personal websites to practice JavaScript, HTML, CSS with games,\nRetail Store System, Java program for retail store management, IoT project for designing a biometric locker using Arduino',
 'completed_courses': 'Python 101 Beginner Course',
 'motivation': 'Although I know Python, I was never able to do projects and go back to the start is being in a cycle. \n- I want s

### Evaluate each applicant based on their answers
Before we evaluate the applicants with LLM, let's evaluate them based on their answers.
This is kind of similar to how we evaluate manually. The marks that we are giving here are arbitrary.
In reality, we read the sections like motivation, project experience, etc manually and evaluate based on those.
This however give us a good estimate of how well the applicant is doing.

In [4]:
def evaluate_applicant(applicant: Applicant) -> dict:
    scores = {
        "python_skills": rate_python_skills(applicant.python_skills),
        "project_experience": rate_project_experience(applicant.project_experience),
        "motivation": rate_motivation(applicant.motivation),
        "availability": rate_availability(applicant.hours_per_week),
        "commitment": rate_commitment(applicant.commitment),
        "location": rate_location(applicant.location),
    }
    overall_score = calculate_overall_score(scores)
    return {**scores, "overall_score": overall_score}


def calculate_overall_score(scores: dict) -> str:
    total_score = scores["python_skills"] * 2 + sum(
        value for key, value in scores.items() if key != "python_skills"
    )
    max_score = (len(scores) + 1) * 4  # +1 because python_skills is counted twice
    percentage = (total_score / max_score) * 100

    grade_boundaries = [
        (90, "A+"),
        (85, "A"),
        (80, "A-"),
        (75, "B+"),
        (70, "B"),
        (65, "B-"),
        (60, "C+"),
        (55, "C"),
        (50, "C-"),
    ]

    for boundary, grade in grade_boundaries:
        if percentage >= boundary:
            return grade
    return "D"


def rate_python_skills(python_skills: str) -> int:
    skill_levels = ["No experience", "Beginner", "Intermediate", "Advanced"]
    return skill_levels.index(
        next((level for level in skill_levels if level in python_skills), 0)
    )


# This ain't efficient and will not be good enough, but it's a quick way to get the score for progamatically doing.
def rate_project_experience(project_experience: str) -> int:
    keywords = [
        "web",
        "ai",
        "mobile",
        "app",
        "html",
        "css",
        "javascript",
        "js",
        "react",
        "angular",
        "vue",
        "android",
        "ios",
        "flask",
        "django",
        "fastapi",
        "machine learning",
        "data science",
        "backend",
        "frontend",
        "full stack",
        "game",
        "automation",
        "scripting",
    ]
    return min(sum(keyword in project_experience.lower() for keyword in keywords), 5)


def rate_motivation(motivation: str) -> int:
    # Detailed analysis can be done using an LLM, here we use a simple heuristic
    return min(len(motivation.split()) // 10, 4)


def rate_availability(hours_per_week: int) -> int:
    return min(hours_per_week // 8, 4)


def rate_commitment(commitment: str) -> int:
    return 2 if "yes" in commitment.lower() else 0


def rate_location(location: str) -> int:
    locations = {"Myanmar": 4, "Thailand": 2}
    return locations.get(location, 0)


In [5]:
report = evaluate_applicant(applicants[0])
report

{'python_skills': 1,
 'project_experience': 5,
 'motivation': 4,
 'availability': 2,
 'commitment': 2,
 'location': 4,
 'overall_score': 'B-'}

This example applicant is indeed doing well with their extensive project experience. However, the applicant lacks python skills which is kind of like a mandatory criteria for the program. Hence, the applicant ended up getting only a B- grade. This is kind of like a correct evaluation and align with our manual evaluation.

Let's see how we can evaluate them with LLM.

#### Agent Cost
Since it is so troublesome to check the costs on the dashboard, we are going to calculate the cost directly here.

In [6]:
def calculate_agent_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = {
        "claude-3-opus-20240229": (15.00, 75.00),
        "claude-3-haiku-20240307": (0.25, 1.25),
        "claude-3-sonnet-20240229": (3.00, 15.00),
        "claude-3-5-sonnet-20240620": (3.00, 15.00),
        "gpt-4o": (5.00, 15.00),
    }

    input_cost_per_mtok, output_cost_per_mtok = pricing[model]

    return (
        input_tokens * input_cost_per_mtok + output_tokens * output_cost_per_mtok
    ) / 1_000_000

### Evaluation with OpenAI GPT-4o
For this experiement, we are using OpenAI GPT-4o to evaluate the applicants. The prompt itself is refined with GPT-4o.
Note that I am using <tags> to format the output. It is one of the best practices that work well with Anthropic's LLMs.
I noticed that gpt-4o struggle to format the outputs within the <tags> I have defined if the temperature is low. Hence, I am using temperature of 0.5.

In [7]:
import os
from typing import Tuple

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

oai_client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
)

GPT_MODEL = "gpt-4o"


def detailed_evaluation(applicant: Applicant) -> Tuple[str, float]:
    messages = (
        {
            "role": "user",
            "content": f"""
                    Evaluate the following junior developer applicant for a community program aimed at gaining work experience.

                    Applicant Information:
                    Name: {applicant.name}
                    Location: {applicant.location}
                    Python Level: {applicant.python_skills}
                    Python Project Experience: {applicant.worked_on_python_projects}
                    Other Project Experience: {applicant.project_experience}
                    Completed Courses: {applicant.completed_courses}
                    Motivation: {applicant.motivation}
                    Career Goals: {applicant.career_goals}
                    Hours per Week: {applicant.hours_per_week} (8 hours/week is minimum, 16 hours/week is average, 24-48 hours/week is high)
                    Additional Information: {applicant.additional_info}
                    Education Background: {applicant.education_background}

                    Review the applicant's information, focusing on their Python knowledge, motivation, availability, and overall suitability for the program. If Python knowledge is lacking, consider their project experience and other technical skills.

                    Provide a detailed evaluation in a <reasoning> tag, addressing the following criteria:
                    - Python Knowledge: Assess the applicant's Python skills based on their courses, projects, and self-reported experience.
                    - Project Experience: Evaluate the applicant's hands-on project experience in other technology stack. Consider this if the applicant doesn't have python knowledge.
                    - Motivation and Eagerness: Determine the applicant's motivation and eagerness to learn and grow as a developer.
                    - Availability: Assess the applicant's time commitment to the program.
                    - Education Background: Consider the relevance of the applicant's educational background to the program.
                    - Overall Suitability: Summarize the applicant's strengths and weaknesses and their fit for the program.

                    Remember that python skill is a must for this program.

                    Conclude with an <overall_suitability> tag summarizing the applicant's fit for the program in 3-5 sentences, and a <score> tag with a letter grade from A+ to C-.
                """,
        },
    )

    response = oai_client.chat.completions.create(
        model=GPT_MODEL,
        messages=messages,
        temperature=0.5,
    )

    cost = calculate_agent_cost(
        GPT_MODEL,
        response.usage.prompt_tokens,
        response.usage.completion_tokens,
    )

    print(f"GPT-4o Cost: ${cost:.4f}")

    return response.choices[0].message.content, cost

### Evaluation with Claude 3.5 Sonnet
We are going to use Claude 3.5 Sonnet to evaluate the applicants.
Some modification has been made to the prompt to fit the format of the output we want.

In [8]:
import os

import anthropic
from dotenv import load_dotenv

load_dotenv()

client = anthropic.Anthropic(
    api_key=os.getenv("ANTHROPIC_API_KEY"),
)

SONNET_MODEL = "claude-3-5-sonnet-20240620"


def detailed_evaluation_with_claude(applicant: Applicant) -> Tuple[str, float]:
    response = client.messages.create(
        model=SONNET_MODEL,
        max_tokens=1000,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"You are tasked with evaluating a junior developer applicant for a community program aimed at giving the work experience. The applicant's information will be provided to you in the following format:\n\n<applicant_info>\nName: {applicant.name}\nLocation: {applicant.location}\nPython Level: {applicant.python_skills}\nPython Project Experience: {applicant.worked_on_python_projects}\nOther Project Experience: {applicant.project_experience}\nCompleted Courses: {applicant.completed_courses}\nMotivation: {applicant.motivation}\nCareer Goals: {applicant.career_goals}\nHours per Week: {applicant.hours_per_week}\nAdditional Information: {applicant.additional_info}\nEducation Background: {applicant.education_background}\n</applicant_info>\n\nCarefully review the applicant's information, focusing on their Python knowledge, motivation, availability, and overall suitability for the program. If Python knowledge is lacking, consider their project experience on other technology stacks and other technical skills.\n\nProvide a detailed evaluation by addressing the following criteria:\n\n1. Python Knowledge: Assess the applicant's Python skills based on their courses, projects, and self-reported experience.\n2. Project Experience: Evaluate the applicant's hands-on project experience in both Python and other technology stacks. Consider non-Python experience if the applicant lacks Python knowledge.\n3. Motivation and Eagerness: Determine the applicant's motivation and eagerness to learn and grow as a developer.\n4. Availability: Assess the applicant's time commitment to the program. (Note: 8 hours/week is minimum, 16 hours/week is average, 24-48 hours/week is high)\n5. Education Background: Consider the relevance of the applicant's educational background to the program.\n6. Overall Suitability: Summarize the applicant's strengths and weaknesses and their fit for the program.\n\nRemember that Python skill is a must for this program. However, python project experience are not a mandatory.\n\nPresent your evaluation in the following format:\n\n<reasoning>\n[Provide a detailed evaluation addressing each of the criteria listed above. Use paragraph breaks to separate each criterion.]\n</reasoning>\n\n<overall_suitability>\n[Summarize the applicant's fit for the program in 3-5 sentences.]\n</overall_suitability>\n\n<score>\n[Provide a letter grade from A+ to C- based on your evaluation.]\n</score>\n\nEnsure that your reasoning is thorough and objective, and that your overall suitability summary and score accurately reflect your detailed evaluation.",
                    }
                ],
            }
        ],
    )

    cost = calculate_agent_cost(
        SONNET_MODEL,
        response.usage.input_tokens,
        response.usage.output_tokens,
    )

    print(f"Claude 3.5 Sonnet Cost: ${cost:.4f}")

    return response.content[0].text, cost


In our prompt, we added that the output should be formated with different kind of tags.
Thus, we need to parse those tags individually <reasoning>, <overall_suitability> and <score>. And then we extract the content within those tags.

For future use, we are generating our report in JSON format.

In [9]:
from typing import List


def generate_report(applicants: List[Applicant], evaluator: str) -> Tuple[List[dict], float]:
    report = []
    total_cost = 0
    for i, applicant in enumerate(applicants, start=1):
        print(f"Processing applicant {i} of {len(applicants)}")

        if evaluator == "claude":
            evaluation, cost = detailed_evaluation_with_claude(applicant)
        elif evaluator == "gpt":
            evaluation, cost = detailed_evaluation(applicant)
        else:
            raise ValueError(f"Invalid evaluator: {evaluator}")

        reasoning = extract_tag_content(evaluation, "reasoning")
        overall_suitability = extract_tag_content(evaluation, "overall_suitability")
        score = extract_tag_content(evaluation, "score")

        report.append(
            {
                "applicant_number": i,
                "name": applicant.name,
                "location": applicant.location,
                "python_knowledge": extract_section(reasoning, "Python Knowledge"),
                "project_experience": extract_section(reasoning, "Project Experience"),
                "motivation_and_eagerness": extract_section(
                    reasoning, "Motivation and Eagerness"
                ),
                "availability": extract_section(reasoning, "Availability"),
                "education_background": extract_section(
                    reasoning, "Education Background"
                ),
                "overall_suitability": overall_suitability,
                "score": score,
            }
        )

        total_cost += cost

    return report, total_cost


def extract_tag_content(text: str, tag: str) -> str:
    start_tag = f"<{tag}>"
    end_tag = f"</{tag}>"
    start = text.find(start_tag) + len(start_tag)
    end = text.find(end_tag)
    return text[start:end].strip()


def extract_section(text: str, section_name: str) -> str:
    start = text.find(f"{section_name}:") + len(section_name) + 1
    end = text.find("\n", start)
    return text[start:end].strip()

Now, let's see how well the LLM evaluated the applicants.

In [10]:
report, total_cost = generate_report(applicants, "gpt")
claude_report, claude_total_cost = generate_report(applicants, "claude")

print(f"GPT-4o Report generated {report}")
print(f"Total cost: ${total_cost:.4f}")

print("----------------")

print(f"Claude 3.5 Sonnet Report generated {claude_report}")
print(f"Total cost: ${claude_total_cost:.4f}")

Processing applicant 1 of 1
GPT-4o Cost: $0.0099
Processing applicant 1 of 1
Claude 3.5 Sonnet Cost: $0.0118
GPT-4o Report generated [{'applicant_number': 1, 'name': 'Aung Kyaw', 'location': 'Myanmar', 'python_knowledge': 'edge**: Aung Kyaw has completed a beginner-level Python course and has experience solving simple problems with Python. However, he lacks hands-on project experience in Python, which is a critical requirement for this program.', 'project_experience': 'ge**: Aung Kyaw has completed a beginner-level Python course and has experience solving simple problems with Python. However, he lacks hands-on project experience in Python, which is a critical requirement for this program.', 'motivation_and_eagerness': 'Aung Kyaw has completed a beginner-level Python course and has experience solving simple problems with Python. However, he lacks hands-on project experience in Python, which is a critical requirement for this program.', 'availability': 'nowledge**: Aung Kyaw has complete

Potential improvements can be done with parallel processing by sending multiple requests to the OpenAI API.
It will speed up the process if we have a lot of rows to process. Currently, it took around 10 minutes to process 100 rows.