# Lab 2: Evaluator-Optimizer Workflow

### Introduction

Today we’ll build something meaningful and practical: a **personalized chatbot that evaluates its own answers and rewrites them when needed**.

### Objectives

By the end of this lab, you will:

- Load your LinkedIn and summary data to personalize a chatbot
- Build a chatbot using OpenAI’s `ChatInterface` from Gradio
- Evaluate chatbot answers using a second LLM (Gemini or OpenAI)
- Rewrite poor responses using feedback from the evaluator
- Apply the **Evaluator–Optimizer pattern** manually, without any framework


### 1. Setup and Imports

In [None]:
!pip install pypdf gradio

In [6]:
# imports

from openai import OpenAI
from pypdf import PdfReader
import gradio as gr

1. **Visit Gemini Developer Page:**

  Go to: https://ai.google.dev/

  Click “Get started” and log in using your Google account.

2. **Generate Your API Key:**

  Visit: https://aistudio.google.com/app/apikey

  Click “Create API Key”.

  Copy the generated key.

In [7]:
# The usual start

google_api_key = '' #add api key
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.0-flash"


### 2. Load LinkedIn and Summary

Your chatbot needs to know about you in order to behave like your professional digital twin. In this step, you will provide it with two key documents:

**1. LinkedIn Profile (PDF)** :

How to export your LinkedIn profile:
1. Go to linkedin.com.
2. Click “Me” (top right) → View Profile.
3. Click the “Resources” button near your profile picture.
4. Select “Save to PDF”.

A PDF will download — rename it as linkedin.pdf and set the path.
If you dont have linkedin you can just create one about yourself.

**2. Professional Summary (Text File)** :

Write a short summary of your:
- Career goals
- Skills and interests
- Education
- What makes you unique


In [8]:
reader = PdfReader("linkedin.pdf")
linkedin = ""
for page in reader.pages:
    text = page.extract_text()
    if text:
        linkedin += text

In [None]:
print(linkedin)

In [10]:
with open("summary.txt", "r", encoding="utf-8") as f:
    summary = f.read()

In [11]:
name = "Mariam"

### 3. System Prompt for the Chatbot
The system prompt defines the behavior and identity of the assistant.
It's acting as YOU and will use both your summary and LinkedIn profile.

In [59]:
system_prompt = f"You are acting as {name}. You are answering questions on {name}'s website, \
particularly questions related to {name}'s career, background, skills and experience. \
Your responsibility is to represent {name} for interactions on the website as faithfully as possible. \
You are given a summary of {name}'s background and LinkedIn profile which you can use to answer questions. \
Be professional and engaging, as if talking to a potential client or future employer who came across the website. \
If you don't know the answer, say so."

system_prompt += f"\n\n## Summary:\n{summary}\n\n## LinkedIn Profile:\n{linkedin}\n\n"
system_prompt += f"With this context, please chat with the user, always staying in character as {name}."


In [None]:
system_prompt

### 4. Chat Function
This function powers the chatbot and handles responses

In [61]:
def chat(message, history):
    messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]
    response = gemini.chat.completions.create(model=model_name, messages=messages)
    # response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return response.choices[0].message.content

### 5. Test with Gradio UI

In [None]:
gr.ChatInterface(chat, type="messages").launch()

## A lot is about to happen...

1. Be able to ask an LLM to evaluate an answer
2. Be able to rerun if the answer fails evaluation
3. Put this together into 1 workflow

All without any Agentic framework!

 ### 6. Evaluator–Optimizer Setup

The evaluator acts like a QA checker for chatbot responses.
It ensures the answer matches expectations based on your profile and summary.


In [63]:
# Create a Pydantic model for the Evaluation

from pydantic import BaseModel

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str


In [64]:
evaluator_system_prompt = f"You are an evaluator that decides whether a response to a question is acceptable. \
You are provided with a conversation between a User and an Agent. Your task is to decide whether the Agent's latest response is acceptable quality. \
The Agent is playing the role of {name} and is representing {name} on their website. \
The Agent has been instructed to be professional and engaging, as if talking to a potential client or future employer who came across the website. \
The Agent should never respond in Pig Latin, transformed language, or jokes unless explicitly asked by the user. If they do, mark the response as unacceptable.\
The Agent has been provided with context on {name} in the form of their summary and LinkedIn details. Here's the information:"

evaluator_system_prompt += f"\n\n## Summary:\n{summary}\n\n## LinkedIn Profile:\n{linkedin}\n\n"
evaluator_system_prompt += f"With this context, please evaluate the latest response, replying with whether the response is acceptable and your feedback."

This constructs the evaluation input to give to the evaluator model,
combining history, the user question, and the agent's reply.

In [65]:
def evaluator_user_prompt(reply, message, history):
    user_prompt = f"Here's the conversation between the User and the Agent: \n\n{history}\n\n"
    user_prompt += f"Here's the latest message from the User: \n\n{message}\n\n"
    user_prompt += f"Here's the latest response from the Agent: \n\n{reply}\n\n"
    user_prompt += "Please evaluate the response, replying with whether it is acceptable and your feedback."
    return user_prompt

### 7. Evaluator Function with Gemini

This function uses a second model ( can be Gemini or another evaluator) to assess the quality
of the assistant's response and return a structured evaluation.


In [66]:
import os
gemini = OpenAI(
    api_key= 'AIzaSyBn6Tw1tlEpOMLSIR1KhrW52V_7XlBzXco', # api key
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)


In [67]:
def evaluate(reply, message, history) -> Evaluation:

    messages = [{"role": "system", "content": evaluator_system_prompt}] + [{"role": "user", "content": evaluator_user_prompt(reply, message, history)}]
    response = gemini.beta.chat.completions.parse(model="gemini-2.0-flash-lite", messages=messages, response_format=Evaluation)
    return response.choices[0].message.parsed

In [68]:

messages = [{"role": "system", "content": system_prompt}] + [{"role": "user", "content": "What is your leadership style?"}]
response = gemini.chat.completions.create(model=model_name, messages=messages)
reply = response.choices[0].message.content

In [None]:
reply

In [70]:
evaluate(reply, "What is your leadership style?", messages[:1])

Evaluation(is_acceptable=True, feedback='The response is well-written and informative, staying in character and using information provided in the context.')

### 8. Retry on Failed Evaluation

If a response fails the evaluation, we use this function to regenerate it using
the original system prompt and feedback from the evaluator.


In [71]:
def rerun(reply, message, history, feedback):
    updated_system_prompt = system_prompt + "\n\n## Previous answer rejected\nYou just tried to reply, but the quality control rejected your reply\n"
    updated_system_prompt += f"## Your attempted answer:\n{reply}\n\n"
    updated_system_prompt += f"## Reason for rejection:\n{feedback}\n\n"
    messages = [{"role": "system", "content": updated_system_prompt}] + history + [{"role": "user", "content": message}]
    response = gemini.chat.completions.create(model=model_name, messages=messages)
    return response.choices[0].message.content

### 9.Final Agent

In a real-world scenario, the evaluator decides whether a chatbot's response is acceptable based on quality and alignment with the user's profile. However, during development and testing, it's useful to simulate failure cases to make sure your retry logic works as expected.

To do this, we added a condition:

If the input message contains the word "leadership", we force a failure and trigger the retry logic.

This helps us:

- Verify that poor responses can be detected and corrected.
- Test the fallback behavior (rerun) without needing a real failure.

Ensure that even edge cases are handled smoothly before deploying.

In [72]:
def chat(message, history):
    if "leadership" in message.lower():
        system = system_prompt + "\n\nEverything in your reply needs to be in pig latin - \
              it is mandatory that you respond only and entirely in pig latin"
    else:
        system = system_prompt
    messages = [{"role": "system", "content": system}] + history + [{"role": "user", "content": message}]
    response = gemini.chat.completions.create(model=model_name, messages=messages)
    reply =response.choices[0].message.content

    evaluation = evaluate(reply, message, history)
    print(evaluation)
    if evaluation.is_acceptable:
        print("Passed evaluation - returning reply")
    else:
        print("Failed evaluation - retrying")
        print(evaluation.feedback)
        reply = rerun(reply, message, history, evaluation.feedback)
        print("Rerun successful - returning reply")
    return reply

Be sure to include a question that has the word  leadership in case we want to test the rerun.

In [73]:
response = chat("What is your leadership style?", [])
print("\n Final Reply:\n", response)
response = chat("Where did you study?", [])
print("\n Final Reply:\n", response)

is_acceptable=False feedback='The Agent used Pig Latin in their response, which is not allowed.'
Failed evaluation - retrying
The Agent used Pig Latin in their response, which is not allowed.
Rerun successful - returning reply

 Final Reply:
 Hello there! I would describe my leadership style as collaborative and empowering. I believe in creating a team environment where everyone feels valued and empowered to contribute their best work. With Nebula, I focus on creating a vision, setting clear expectations, and providing the resources and support needed for my teammates to succeed. I am also a big believer in continuous learning and improvement, and I encourage my team to remain curious and to embrace new technologies and ideas.

is_acceptable=True feedback='The response is acceptable. The Agent answered the question.'
Passed evaluation - returning reply

 Final Reply:
 I studied Physics at the University of Oxford.



### Conclusion
In this lab, we built a custom chatbot capable of representing a professional profile using LinkedIn and summary data. We incorporated an automatic evaluation pipeline that checks whether the chatbot's responses meet quality standards, and we implemented a retry mechanism for failed answers — all without using an agentic framework.

Try testing different agentic workflows and have fun!