<a href="https://colab.research.google.com/github/QiyuanTan/AI-study-guide-backend/blob/master/Quantitative-Eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

This notebook implemented an AI-agent workflow that allows to evaluate anything with a group of customized personas and criteria

The workflow goes as follows:  
1. Prompt set up
  1. Set evaluation subject and items
  2. Create persona definitions and criteria
  3. Set up the prompt template for personas
2. Generate evaluations
  1. For each persona, apply the prompt template and make an API call
  2. Parse and store the evaluations
3. Process evaluations
  1. Normalize the results
  2. Apply weights
4. Display the results

# Instructions

## Get a Google API key
> You could skip this step if you already have one in Google AI Studio

1. Select the `Secrets` secrets.png tab on left
2. Go to `Gemini API keys` -> `Manage API keys in Google AI Studio`



# Environment setup

In [None]:
!pip install -q -r https://raw.githubusercontent.com/AI-Agents-Prompts-to-Multi-Agent-Sys/Quantitative-Eval/master/requirements.txt

In [None]:
import os
import json
import operator
import re
from copy import deepcopy
from typing import TypedDict, Annotated, List

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.constants import END, START
from langgraph.graph import StateGraph
from tqdm import tqdm
from tenacity import retry, stop_after_attempt

# Load environment variables (GOOGLE_API_KEY should be set either in .env file or in the secrets)
try:
    from google.colab import userdata
    os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')
except ImportError:
    from dotenv import load_dotenv
    load_dotenv()
except KeyError:
    raise KeyError("Please set the GOOGLE_API_KEY in your secrets.")

# LLM config
# At here you can change the model, tweak its parameters, or even use different LLM provider
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-preview-05-20", temperature=0.7)

# Prompt setup

First setup evaluation subject and items to evaluate

In [None]:
# What to evaluate
EVALUATION_SUBJECT = "band"

# List of items to evaluate
ITEMS = [
    "The Beatles", "Led Zeppelin", "Pink Floyd", "Queen", "The Rolling Stones",
    "Metallica", "Megadeth", "Black Sabbath", "Iron Maiden", "Tool"
]

Use One of the following two cells to initialize the personas and the evaluation criteria.

In [None]:
# Initialize personas and criteria by a llm
persona_nums = 5
criterion_nums = 5

prompt = f"""
You are setting up an experiment that asks diverse group of virtual individuals to evaluate {len(ITEMS)} {EVALUATION_SUBJECT}s. You now need to set define these personas and the criteria for them to follow

The definition of personas should follow the following principle:
- The group should have a great diversity so that they could reflect the diverse opinions on {EVALUATION_SUBJECT}s.
- Do not give simular/repeated persona definitions.
- You are speaking to the personas when you give their definitions. Your persona definitions should start with "You're".

The definition of the criteria should follow the following principle:
- Be clear and specific
- Each criterion should measure a unique aspect
- The personas will fill in a number from 1-5 for each criterion, so make sure each it can be answered with a number
- The higher the score, the more that {EVALUATION_SUBJECT} is preferred.

You are asked to give {persona_nums} persona definitions and {criterion_nums} criteria

Your response should follow the following JSON format
```json
{{
    "personas": {{
        "persona_name_here": "persona's word view here",
        // ...More personas
    }},
    "criteria": {{
        "criterion_name_here": "criterion definition here",
        // ... More criteria
    }},
    "persona_role": "Describe the personas' identity in short words, eg. critics. You are not supposed to say they are virtual",
    "instruction": "here you give a base instruction for the persona s, eg. you were asked to evaluate {EVALUATION_SUBJECT}s"
}}
```

Do not include any commentary outside the JSON block.
"""

def clean_json_string(text: str) -> str:
    cleaned = re.sub(r"```(?:json)?", "", text)
    return cleaned.replace("```", "").strip()

@retry(stop=stop_after_attempt(3))
def generate_persona_criteria():
    response = llm.invoke(prompt)
    response_cleaned = clean_json_string(response.content)
    data = json.loads(response_cleaned)
    return data["personas"], data['criteria'], data['persona_role'], data['instruction']

PERSONAS, CRITERIA, PERSONA_ROLE, INSTRUCTION = generate_persona_criteria()
print(PERSONAS)
print(CRITERIA)
print(PERSONA_ROLE)
print(INSTRUCTION)

In [None]:
# Initialize personas and criteria manually

# Persona definitions
# Each persona has generates a call to the LLM, so please be aware of the rate limits of your LLM provider
# For Gemini, the rate limit is 10 calls per minute for 2.5 flash models and 5 calls per minute for 2.5 pro models.
PERSONAS = {
    "metalhead": "You're in your 30s, a lifelong metal fan. You value power, aggression, instrumental mastery, and complexity. You dismiss pop and overproduced music as shallow.",
    "popstar": "You're in your 20s, immersed in social media culture. You love global accessibility, emotional resonance, and catchy choruses. You believe great bands bring joy and unity.",
    "boomer": "You're in your 70s. You grew up during the golden age of rock and believe greatness is rooted in legacy, songwriting, and timeless appeal. Newer music feels synthetic to you.",
    "genz": "You're a teenager, online-native, and value diversity, identity, and innovation in music. You're drawn to bands that say something real or break genre rules.",
    "indie": "You're in your 30s, an art-school type who craves authenticity, emotion, and underground cool. You dislike commercial polish and love expressive weirdness.",
}

# Criteria for evaluation
CRITERIA = {
    "Musical Innovation": "Pioneering ideas, new sounds, genre blending.",
    "Cultural Impact": "Broader societal influence, pop culture penetration.",
    "Lyrical or Thematic Depth": "Narrative richness, philosophical weight, relatability.",
    "Technical Proficiency": "Musical complexity, virtuosity, performance execution.",
    "Live Performance Strength": "Energy, presence, crowd connection on stage.",
    "Legacy & Longevity": "Enduring influence across generations and artists."
}

# Instructions/background information for the personas
PERSONA_ROLE = "music critic"
INSTRUCTION = f"You have been asked to evaluate the greatness of {len(ITEMS)} historically significant bands across genres including rock, metal, pop, and progressive."

Establish the template and test the prompt

In [None]:
# Prompt template
def make_prompt(persona_description):
    return f"""You are a {PERSONA_ROLE} with the following worldview:

{persona_description}

{INSTRUCTION}

Evaluate each {EVALUATION_SUBJECT} based on the following {len(CRITERIA)} criteria, scoring from 1 (low) to 5 (high):

{"".join(f"{key}: {value}{chr(10)}" for key, value in CRITERIA.items())}
Here are the {EVALUATION_SUBJECT}s to evaluate:
{chr(10).join('- ' + item for item in ITEMS)}

Please respond ONLY in the following strict JSON format:

```json
{{
  "ratings": [
    {{
      "item": "the corresponding {EVALUATION_SUBJECT} name here, following the ordering in the given list"{"".join(f',{chr(10)}      "{criteria}": int' for criteria in CRITERIA)}
    }},
    // ...More {EVALUATION_SUBJECT} evaluations here
  ],
  "justification": "Your paragraph explaining the ratings here.",
  "ranking": ["{EVALUATION_SUBJECT}1", "{EVALUATION_SUBJECT}2", ..., "{EVALUATION_SUBJECT}{len(ITEMS)}"]
}}
```

- The ratings list must include all {len(ITEMS)} {EVALUATION_SUBJECT}s.
- The ranking list must be in your personal order (1st to {len(ITEMS)}th).
- Do not include any commentary outside the JSON block.
"""

print("Example prompt:")
print(make_prompt(list(PERSONAS.values())[0]))

# Generate Evaluations

In [None]:
print("Starting evaluation...")

# Clean LLM output
def clean_json_string(text: str) -> str:
    cleaned = re.sub(r"```(?:json)?", "", text)
    return cleaned.replace("```", "").strip()

# Parse JSON
def parse_json_response(response):
    response_cleaned = clean_json_string(response)
    data = json.loads(response_cleaned)
    ratings = data["ratings"]
    justification = data["justification"]
    ranking = data["ranking"]

    ranking_column = []
    for i, item in enumerate(ranking):
        ranking_column += [{"item": item, "rank": i + 1}]

    df = pd.DataFrame(ratings)

    df = pd.merge(df, pd.DataFrame(ranking_column), on="item", how="left")
    df.columns = [EVALUATION_SUBJECT] + list(CRITERIA.keys()) + ["Rank"]
    return df, justification

# Get llm response
@retry(stop=stop_after_attempt(3))
async def get_llm_response(prompt):
    response = await llm.ainvoke(prompt)
    return parse_json_response(response.content)

# State definition
class Vote(TypedDict):
    df: pd.DataFrame
    justification: str
    persona: str

class State(TypedDict):
    votes: Annotated[List[Vote], operator.add]

# Initialize progress bar
try:
    pbar.close()
except NameError:
    pass

pbar = tqdm(f"Evaluating {EVALUATION_SUBJECT}s with personas", total=len(PERSONAS), unit="persona")

# Agent node
def make_agent_node(persona_key):
    async def node(state):
        persona = PERSONAS[persona_key]
        prompt = make_prompt(persona)
        df, justification = await get_llm_response(prompt)

        state['votes'] = [{
                "df": df,
                "justification": justification,
                "persona": persona_key,
        }]

        pbar.update(1)
        return state
    return node

# Graph build
agent_keys = list(PERSONAS.keys())

graph = StateGraph(State)
for agent in agent_keys:
    graph.add_node(agent, make_agent_node(agent))

# Graph edges
for agent in agent_keys:
    graph.add_edge(START, agent)
graph.add_edge([agent for agent in agent_keys], END)

# Run
compiled = graph.compile()
results = await compiled.ainvoke({
    "votes": [],
})

votes = results['votes']
criteria_keys = list(CRITERIA.keys())

# Display results
for vote in votes:
    print(f"Evaluation by persona: {vote['persona']}")
    display(vote['df'])

# Process Evaluations

## Optional: Normalize the scores
This step reduces the bias from individual scoring tendencies.

In [None]:
# Normalize the scores
for vote in votes:
    all_values = vote['df'][criteria_keys].values.flatten()
    mean = all_values.mean()
    std_dev = all_values.std()
    vote['df'][criteria_keys] = (vote['df'][criteria_keys] - mean) / std_dev

# Ensure all scores are non-negative
min_z = min([vote['df'][criteria_keys].min().min() for vote in votes])
for vote in votes:
    vote['df'][criteria_keys] = vote['df'][criteria_keys] - min_z

# Display normalized scores
for vote in votes:
    print(f"Normalized scores for persona: {vote['persona']}")
    display(vote['df'])

## Set the weight for each criterion
> Please note that the size and order of the weights array must match the number of criteria defined in the `CRITERIA` dictionary.

In [None]:
# Here's your criteria keys for reference
for key in criteria_keys:
    print(key)

In [None]:
# Set the weight for each criterion

weights = [1, 1, 1, 1, 1, 1] # Change me

weighted_votes = [ deepcopy(vote) for vote in votes ]

for weighted_vote, vote in zip(weighted_votes, votes):
    for i, key in enumerate(criteria_keys):
        weighted_vote['df'][key] = vote['df'][key] * weights[i]
    weighted_vote['df']['score_sum'] = weighted_vote['df'][criteria_keys].sum(axis=1)

# Display weighted scores
for weighted_vote in weighted_votes:
    print(f"Weighted scores for persona: {weighted_vote['persona']}")
    display(weighted_vote['df'])

## Calculate the final scores

In [None]:
final_scores = weighted_votes[0]['df'].copy().drop(columns=['score_sum'])
for vote in weighted_votes[1:]:
    final_scores[criteria_keys] += vote['df'][criteria_keys]

final_scores = final_scores.drop(columns=['Rank'])
final_scores['total score'] = final_scores[criteria_keys].sum(axis=1)
# Sort by total score
final_scores = final_scores.sort_values(by='total score')
final_scores = final_scores.drop(columns=['total score'])

final_scores = final_scores.set_index(EVALUATION_SUBJECT)

final_scores

# Visualization

## Score Breakdown
This shows the breakdown of scores for each band across all criteria.

In [None]:
final_scores.plot(kind='barh', stacked=True, figsize=(12, 7), colormap='tab20c')
plt.title("Score Breakdown")
plt.xlabel("Total Score")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

## Radar Chart

In [None]:
items = final_scores.mean(axis=1).sort_values(ascending=False).index

for item in items:
    values = final_scores.loc[item, criteria_keys].values.flatten().tolist()
    values += values[:1]

    angles = np.linspace(0, 2 * np.pi, len(criteria_keys), endpoint=False).tolist()
    angles += angles[:1]

    plt.figure(figsize=(6, 6))
    ax = plt.subplot(111, polar=True)
    ax.plot(angles, values, linewidth=2, label=item)
    ax.fill(angles, values, alpha=0.3)
    ax.set_thetagrids(np.degrees(angles[:-1]), criteria_keys)
    ax.set_title(f"{item} Score Profile")
    plt.show()

## Heatmap of Criteria Correlation

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(final_scores[criteria_keys].corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Between Evaluation Criteria")
plt.show()

## Heatmap of individual rankings

In [None]:
rank_matrix = []
band_names = weighted_votes[0]['df'][EVALUATION_SUBJECT].tolist()
for weighted_vote in weighted_votes:
    rank_row = dict(zip(weighted_vote['df'][EVALUATION_SUBJECT], list(weighted_vote['df']['Rank'])))
    rank_matrix.append(rank_row)

rank_df = pd.DataFrame(rank_matrix, index=[i['persona'] for i in weighted_votes], columns=band_names)
plt.figure(figsize=(14, 6))
sns.heatmap(rank_df, cmap="coolwarm", annot=True, fmt="d", cbar_kws={"label": "Rank (lower is better)"})
plt.title(f"{EVALUATION_SUBJECT} Rankings by Persona")
plt.xlabel(EVALUATION_SUBJECT)
plt.ylabel("Persona")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Statistics

## Summary on rankings

In [None]:
item_ranks = {item: [] for item in ITEMS}
for vote in weighted_votes:
    for item, rank in zip(vote['df'][EVALUATION_SUBJECT], vote['df']['Rank']):
        item_ranks[item].append(rank)

stats = {
    item: {
        "mean_rank": np.mean(ranks),
        "std_dev": np.std(ranks),
        "min_rank": min(ranks),
        "max_rank": max(ranks)
    }
    for item, ranks in item_ranks.items()
}

# change this value to sort by a different statistic ⬇️. This currently shows the diversity of opinions.
rank_stats_df = pd.DataFrame(stats).T.sort_values("std_dev", ascending=False)

rank_stats_df


## Summary on scores

In [None]:
item_scores = {f'{item}: {criteria}': [] for item in ITEMS for criteria in criteria_keys}
for vote in weighted_votes:
    for criteria in criteria_keys:
        for item, score in zip(vote['df'][EVALUATION_SUBJECT], vote['df'][criteria]):
            item_scores[f'{item}: {criteria}'].append(score)

stats = {
    item: {
        "mean_score": np.mean(ranks),
        "std_dev": np.std(ranks),
        "min_score": min(ranks),
        "max_score": max(ranks)
    }
    for item, ranks in item_scores.items()
}

# change this value to sort by a different statistic ⬇️. This currently shows the diversity of opinions.
score_stats_df = pd.DataFrame(stats).T.sort_values("std_dev", ascending=False)

score_stats_df