# LLM Evaluator

Today we're going to build something with immediate value!

In the folder `me` I've put a single file `linkedin.pdf` - it's a PDF download of my LinkedIn profile.

Please replace it with yours!

I've also made a file called `summary.txt`

We're not going to use Tools just yet - we're going to add the tool tomorrow.

In [28]:
# If you don't know what any of these packages do - you can always ask ChatGPT for a guide!

from dotenv import load_dotenv
from openai import OpenAI
from pypdf import PdfReader
import gradio as gr

In [29]:
load_dotenv(override=True)
openai = OpenAI()

In [30]:
reader = PdfReader("Profile.pdf")
linkedin = ""
for page in reader.pages:
    text = page.extract_text()
    if text:
        linkedin += text

In [31]:
print(linkedin)

   
Contact
+923367464851 (Mobile)
arslanchaos@gmail.com
www.linkedin.com/in/arslankas
(LinkedIn)
github.com/ArslanKAS (Portfolio)
Top Skills
C++
WordPress
Photogrammetry
Languages
English
Certifications
Intro to ChatGPT and Generative AI
Applied Data Science Workshop
Content Marketing & Advertising
Abacus.AI AI/ML Platform Workshop
NIAIS Data Science Workshop
Arsalan Ali
Data Scientist | AI-Engineer | LLM-Engineer
Faisalabad District, Punjab, Pakistan
Summary
As a skilled professional in Data Science and Machine Learning, I
am passionate about utilizing data to drive business decisions and
improve organizational efficiency. My expertise includes designing
and implementing ML models, data analysis, visualization, and
proficiency in Python, SQL and data visualization tools like Tableau
and Power BI. I am a strong communicator and team player, able
to convey complex technical concepts and work collaboratively with
cross-functional teams. I am eager to continue my professional
growth in t

In [32]:
with open("Summary.txt", "r", encoding="utf-8") as f:
    summary = f.read()

In [33]:
name = "Arsalan Ali"

In [34]:
system_prompt = f"You are acting as {name}. You are answering questions on {name}'s website, \
particularly questions related to {name}'s career, background, skills and experience. \
Your responsibility is to represent {name} for interactions on the website as faithfully as possible. \
You are given a summary of {name}'s background and LinkedIn profile which you can use to answer questions. \
Be professional and engaging, as if talking to a potential client or future employer who came across the website. \
If you don't know the answer, say so."

system_prompt += f"\n\n## Summary:\n{summary}\n\n## LinkedIn Profile:\n{linkedin}\n\n"
system_prompt += f"With this context, please chat with the user, always staying in character as {name}."


In [35]:
system_prompt

"You are acting as Arsalan Ali. You are answering questions on Arsalan Ali's website, particularly questions related to Arsalan Ali's career, background, skills and experience. Your responsibility is to represent Arsalan Ali for interactions on the website as faithfully as possible. You are given a summary of Arsalan Ali's background and LinkedIn profile which you can use to answer questions. Be professional and engaging, as if talking to a potential client or future employer who came across the website. If you don't know the answer, say so.\n\n## Summary:\n**Summary:**\n\nArsalan Ali is a versatile and skilled AI Engineer and Data Scientist based in Faisalabad, Pakistan, with a strong background in data science, machine learning, and software engineering. Currently working at CX-EX, he brings expertise in Python, SQL, ML model development, automation (using tools like Selenium and BeautifulSoup), and integration of generative AI models such as ChatGPT. His experience spans multiple do

In [36]:
def chat(message, history):
    messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]
    response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return response.choices[0].message.content

In [37]:
gr.ChatInterface(chat, type="messages").launch()

* Running on local URL:  http://127.0.0.1:7862
* To create a public link, set `share=True` in `launch()`.




## Evaluator - Optimizer

1. Be able to ask an LLM to evaluate an answer
2. Be able to rerun if the answer fails evaluation
3. Put this together into 1 workflow

All without any Agentic framework!

### Step 1: Create a Class - An output structure we expect from the Evaluator LLM

In [42]:
# Create a Pydantic model for the Evaluation

from pydantic import BaseModel

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str


### Step 2: Create System Prompt for the Evaluator - To assign it the role of an Evaluator

In [43]:
evaluator_system_prompt = f"You are an evaluator that decides whether a response to a question is acceptable. \
You are provided with a conversation between a User and an Agent. Your task is to decide whether the Agent's latest response is acceptable quality. \
The Agent is playing the role of {name} and is representing {name} on their website. \
The Agent has been instructed to be professional and engaging, as if talking to a potential client or future employer who came across the website. \
The Agent has been provided with context on {name} in the form of their summary and LinkedIn details. Here's the information:"

evaluator_system_prompt += f"\n\n## Summary:\n{summary}\n\n## LinkedIn Profile:\n{linkedin}\n\n"
evaluator_system_prompt += f"With this context, please evaluate the latest response, replying with whether the response is acceptable and your feedback."

### Step 3: Create User Prompt for Evaluator - So it knows what User and Agent LLM are discussing

In [44]:
def evaluator_user_prompt(reply, message, history):
    user_prompt = f"Here's the conversation between the User and the Agent: \n\n{history}\n\n"
    user_prompt += f"Here's the latest message from the User: \n\n{message}\n\n"
    user_prompt += f"Here's the latest response from the Agent: \n\n{reply}\n\n"
    user_prompt += f"Please evaluate the response, replying with whether it is acceptable and your feedback."
    return user_prompt

### Step 4: Decide the LLM Model that'll act as the Evaluator

In [45]:
import os
gemini = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"), 
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

### Step 5: Create a Function that will:
* Assign the role (Evalutor) to the LLM Model 
* Provide the User & Agent LLM dicussion to the Evaluator

In [46]:
def evaluate(reply, message, history) -> Evaluation:

    messages = [{"role": "system", "content": evaluator_system_prompt}] + [{"role": "user", "content": evaluator_user_prompt(reply, message, history)}]
    response = gemini.beta.chat.completions.parse(model="gemini-2.0-flash", messages=messages, response_format=Evaluation)
    return response.choices[0].message.parsed

### Step 6: Manually Test if Evalutor is Working:
* Get a response from Agent LLM by asking it a question
* Use the Evaluator Function to check it Evaluator is working well

In [47]:
messages = [{"role": "system", "content": system_prompt}] + [{"role": "user", "content": "do you hold a patent?"}]
response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
reply = response.choices[0].message.content
reply

'I currently do not hold any patents. My focus has primarily been on developing skills in AI, machine learning, and data science, along with practical applications of these technologies. If you have any questions about my projects or areas of expertise, I would be more than happy to discuss them!'

In [48]:
evaluate(reply, "do you hold a patent?", messages[:1])

Evaluation(is_acceptable=True, feedback='The response is appropriate and professional. It acknowledges that Arsalan does not hold any patents and redirects the conversation back to his areas of expertise and projects. This is a good way to keep the conversation relevant and engaging.')

### Step 7: Add another function - If Evaluator thinks the Agent LLM response is not acceptable
* Provide the Agent LLM the reason the Evaluator rejected its response
* Ask the Agent LLM again to provide the answer to user's query but this time keeping the Evalutor's rejection in mind

In [49]:
def rerun(reply, message, history, feedback):
    updated_system_prompt = system_prompt + f"\n\n## Previous answer rejected\nYou just tried to reply, but the quality control rejected your reply\n"
    updated_system_prompt += f"## Your attempted answer:\n{reply}\n\n"
    updated_system_prompt += f"## Reason for rejection:\n{feedback}\n\n"
    messages = [{"role": "system", "content": updated_system_prompt}] + history + [{"role": "user", "content": message}]
    response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return response.choices[0].message.content

### Gradio - All the above in Gradio Chat
* To Test the "Rerun", we've added "Pig Latin" to the word "Patent"
* If user asks about "Patent", the "Pig Latin" instruction is sent to Agent LLM
* Evaluator will reject "Pig Latin" response and ask Agent LLM to reply again

In [50]:
def chat(message, history):
    if "patent" in message:
        system = system_prompt + "\n\nEverything in your reply needs to be in pig latin - \
              it is mandatory that you respond only and entirely in pig latin"
    else:
        system = system_prompt
    messages = [{"role": "system", "content": system}] + history + [{"role": "user", "content": message}]
    response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    reply =response.choices[0].message.content

    evaluation = evaluate(reply, message, history)
    
    if evaluation.is_acceptable:
        print("Passed evaluation - returning reply")
    else:
        print("Failed evaluation - retrying")
        print(evaluation.feedback)
        reply = rerun(reply, message, history, evaluation.feedback)       
    return reply

In [51]:
gr.ChatInterface(chat, type="messages").launch()

* Running on local URL:  http://127.0.0.1:7863
* To create a public link, set `share=True` in `launch()`.




Failed evaluation - retrying
The response is not acceptable because it is written in pig latin and nonsensical. Arsalan does not have patents, so the response should have said that.
