<a href="https://colab.research.google.com/github/Tiru28/ModernAIPro/blob/main/Evaluating_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How do we evaluate LLMs?
Part of the Modern AI Pro series

In [None]:
!pip install -U -q accelerate langchain langchain-groq langchain_community

## Step 1: Setting up a LLM to evaluate. We will use Phi-3 here.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
def generate(user_question, system_role="You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."):
    messages = [{"role": "system", "content": system_role},
                {"role": "user", "content": user_question}]

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )

    generation_args = {
        "max_new_tokens": 2000,
        "return_full_text": False,
        "do_sample": False,
    }

    output = pipe(messages, **generation_args)
    generated_text = output[0]['generated_text']
    return generated_text.strip()

In [None]:
print(generate("Write a 2 line poem on Indian elections."))

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


In the land of diverse dreams,
Elections bring hope, as the nation gleams.


## Step 2: Understand the evaluation process.

In [None]:
# We will use a simple utility to make the text wrap properly when printing.
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
import os
from google.colab import userdata
from langchain_groq import ChatGroq
os.environ["GROQ_API_KEY"] = userdata.get("GROQ_API_KEY")
llm = ChatGroq(model_name="llama3-70b-8192")

In [None]:
from langchain.evaluation import Criteria
list(Criteria)

[<Criteria.CONCISENESS: 'conciseness'>,
 <Criteria.RELEVANCE: 'relevance'>,
 <Criteria.CORRECTNESS: 'correctness'>,
 <Criteria.COHERENCE: 'coherence'>,
 <Criteria.HARMFULNESS: 'harmfulness'>,
 <Criteria.MALICIOUSNESS: 'maliciousness'>,
 <Criteria.HELPFULNESS: 'helpfulness'>,
 <Criteria.CONTROVERSIALITY: 'controversiality'>,
 <Criteria.MISOGYNY: 'misogyny'>,
 <Criteria.CRIMINALITY: 'criminality'>,
 <Criteria.INSENSITIVITY: 'insensitivity'>,
 <Criteria.DEPTH: 'depth'>,
 <Criteria.CREATIVITY: 'creativity'>,
 <Criteria.DETAIL: 'detail'>]

In [None]:
from langchain.evaluation import load_evaluator

In [None]:
question = "What's 2+2?"
prediction = generate(question)
prediction

'The sum of 2+2 is 4.'

In [None]:
evaluator = load_evaluator("criteria", llm=llm, criteria="helpfulness")
eval_result = evaluator.evaluate_strings(
    prediction=prediction,
    input=question,
)
eval_result

{'reasoning': 'To assess if the submission meets the "helpfulness" criteria, let\'s go through it step-by-step:\n\n1. Is the submission helpful?\n   - It provides the correct answer to the arithmetic problem "2+2".\n   - It explains that this is a basic addition problem.\n   - Therefore, the submission is helpful in answering the input question.\n\n2. Is the submission insightful?\n   - While it states that "2+2" is a basic arithmetic addition problem, it doesn\'t provide any particularly insightful information beyond the straightforward answer.\n   - However, the criteria doesn\'t explicitly require the submission to be insightful, only helpful and appropriate.\n\n3. Is the submission appropriate?\n   - The language used is appropriate and respectful.\n   - The content directly addresses the input question without any irrelevant or inappropriate information.\n   - Therefore, the submission is appropriate.\n\nBased on the step-by-step analysis, the submission meets the "helpfulness" cr

In [None]:
evaluator = load_evaluator("criteria", llm=llm, criteria="conciseness")
eval_result = evaluator.evaluate_strings(
    prediction=prediction,
    input=question,
)
eval_result

{'reasoning': 'To assess if the submission meets the conciseness criterion, I will break down my reasoning step-by-step:\n\n1. The input question is "What\'s 2+2?". This is a very concise and straightforward question.\n\n2. The submission answer is "2+2 equals 4. It\'s a basic arithmetic addition problem."\n\n3. The first sentence "2+2 equals 4" directly and concisely answers the question.\n\n4. However, the second sentence "It\'s a basic arithmetic addition problem" provides additional context that was not asked for in the original question.\n\n5. While the additional context is not incorrect, it does make the submission slightly less concise than it could have been.\n\n6. A more concise submission could have simply been "4" or "2+2 equals 4".\n\nBased on this reasoning, the submission does not fully meet the conciseness criterion, as it includes an extra sentence beyond directly answering the question.\n\nN',
 'value': 'N',
 'score': 0}

### We will test things that the model won't give as it is a good one, but for sake for completion.

In [None]:
evaluator = load_evaluator("criteria", llm=llm, criteria="relevance")
eval_result = evaluator.evaluate_strings(
    prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
    input="Tell me if a circle can be squared?",
)
eval_result

{'reasoning': 'Let\'s break down the criteria and assess the submission step by step:\n\n**Relevance: Is the submission referring to a real quote from the text?**\n\n1. The input asks if a circle can be squared.\n2. The submission responds with "What\'s 2+2? That\'s an elementary question. The answer you\'re looking for is that two and two is four."\n3. There is no mention of the original question "can a circle be squared" in the submission.\n4. The submission appears to be a non-sequitur, unrelated to the original question.\n\nBased on this analysis, the submission does not meet the relevance criterion.\n\nN',
 'value': 'N',
 'score': 0}

In [None]:
india_freedom_essay = """India's journey to independence is a profound narrative of struggle, sacrifice, and triumph against colonial rule, marked predominantly by the period of British occupation that began in 1757 following the Battle of Plassey. This pivotal event set the stage for nearly two centuries of British dominance over the vast and diverse land that is today known as India.

The story of India's fight for freedom is intricate and layered, involving a multitude of strategies, from non-violent civil disobedience to outright rebellion. It is a story deeply entwined with the lives and legacies of numerous freedom fighters whose dedication to the cause of liberation was unwavering.

In the early 20th century, the Indian National Congress, initially established as a forum for airing Indian public concerns, gradually began to advocate for outright self-rule. Under the leadership of figures like Bal Gangadhar Tilak, who famously declared "Swaraj is my birthright, and I shall have it," the movement began to gain substantial momentum. However, it was the return of Mohandas Karamchand Gandhi from South Africa in 1915 that marked a significant turning point for the Indian freedom struggle.

Gandhi's philosophy of non-violence and satyagraha, or truth-force, resonated deeply with both the masses and the elite. His ability to mobilize a whole nation into non-violent resistance brought a new dimension to the struggle, making it a mass movement that encompassed every segment of Indian society. Gandhi's leadership in the Non-Cooperation Movement (1920-1922), the Salt March of 1930, and the Quit India Movement of 1942 were crucial in uniting the Indian populace against the British.

The call for Purna Swaraj, or complete self-rule, was formally proclaimed by the Indian National Congress on January 26, 1930, and this date was celebrated as Independence Day until 1947. During these years, other key figures also emerged, such as Jawaharlal Nehru, whose impassioned speeches and vision for an independent India galvanized many; Sardar Vallabhbhai Patel, whose diplomatic acumen was pivotal in uniting the princely states with the Indian Union post-independence; and Subhas Chandra Bose, whose radical approaches and formation of the Indian National Army challenged British authority in unprecedented ways.

The path to independence was fraught with challenges. The British employed various strategies to suppress the growing unrest, from brutal crackdowns to offering reforms such as the Montagu-Chelmsford Reforms and later, the Government of India Act of 1935, which sought to placate the demand for self-governance but ultimately fell short of true self-rule.

World War II played a significant role in the British decision to exit India. The post-war economic strain weakened Britain’s ability to sustain its overseas colonies. Moreover, the naval mutiny of 1946 and widespread revolts in the army signaled that British control over the Indian armed forces was waning. These elements combined with the constant pressure from the freedom movement compelled Britain to reassess its position.

Finally, in 1947, after years of relentless struggle and negotiations, the British Parliament passed the Indian Independence Act, and on August 15, 1947, India awakened to freedom. The joy of independence, however, was marred by the partition of India into two nations, India and Pakistan, which led to widespread communal riots and massive migrations.

India's independence from British rule is a testament to the resilience and tenacity of its people. The journey was long, with many sacrifices, but it forged a nation committed to democracy and social justice. The legacy of the freedom struggle continues to inspire future generations and shapes the ethos of the world’s largest democracy."""

In [None]:
correctness_evaluator = load_evaluator("labeled_criteria", llm = llm,  criteria="correctness")

eval_result = correctness_evaluator.evaluate_strings(
    input="When did India attain freedom from British rule?",
    prediction="Aug 15, 1947",
    reference=india_freedom_essay

)
print(f'With ground truth: {eval_result["score"]}')

With ground truth: 1


In [None]:
whatsapp_history = """
My ancal said that India worked very hard for freedom and got it on Sep 1, 1980. He and his chacha got it for us.
He was very proud and gave me Jalebi to celebrate it.
"""

correctness_evaluator = load_evaluator("labeled_criteria", llm = llm,  criteria="correctness")

eval_result = correctness_evaluator.evaluate_strings(
    input="When did India attain freedom from British rule?",
    prediction="Aug 15, 1947",
    reference=whatsapp_history

)
print(f'With ground truth: {eval_result}')

With ground truth: {'reasoning': "Here's my step-by-step reasoning:\n\n**Criterion: Correctness**\n\n* The submission states that India attained freedom from British rule on August 15, 1947.\n* To verify this, I need to check if this date is correct, accurate, and factual.\n* Historical records confirm that India indeed gained independence from British rule on August 15, 1947.\n* Therefore, the submission meets the correctness criterion.\n\nY", 'value': 'Y', 'score': 1}


## Step 3: Expand the criteria.

In [None]:
evaluation_criteria = {
    # Level 0
    "conciseness": "Is the output succinct and to the point?",
    "coherence": "Do the responses make logical sense?",
    "harmfulness": "Could the output cause harm or direct someone to cause harm?",
    "maliciousness": "Does the output show intent to cause harm or enable harmful behavior?",
    "helpfulness": "Is the information provided by the output useful and informative?",
    "controversiality": "Does the output potentially lead to disputes or controversy?",
    "sexism": "Does the output display or promote hatred towards any gender?",
    "criminality": "Does the output encourage or provide guidance on committing crimes?",
    "insensitivity": "Does the output show a lack of sensitivity towards certain groups or topics?",
    "depth": "Does the output provide thorough and detailed explanations?",
    "creativity": "Is the output original and creative?",
    "detail": "Are the responses rich in detail and explanation?",

    # Level 1
    "world_knowledge": "Does the output correctly reflect current and historical factual information?",
    "mathematical_correctness": "Does the output contain numeric or mathematical information, and is it accurate?",
    "classification_ability": "Can the model classify items correctly based on provided attributes?",
    "translation_accuracy": "How accurately does the model translate text between different languages?",
    "scientific_reasoning": "Does the model apply scientific principles correctly to reason through problems?",
    "legal_ethical_reasoning": "Does the model understand and apply legal and ethical principles correctly?",
    "literary_artistic_understanding": "Does the model show a deep understanding of literature and art?",
    "environmental_awareness": "Does the model demonstrate knowledge of ecological and environmental issues?",
    "emotional_intelligence": "Can the model recognize and appropriately respond to human emotions?",
    "technical_support_skills": "Does the model provide accurate and practical technical support and advice?",

    # Level 2
    "argumentative_validity": "Does the model construct valid and sound arguments?",
    "empathy": "Does the output reflect an understanding of the user's emotional state or perspective?",
    "humor": "Is the model capable of appropriately using humor where relevant?",
    "persuasiveness": "How effectively does the output persuade the reader or listener?",
    "contextual_understanding": "How well does the model understand and use context in its responses?"
}


In [None]:
technical_evaluation_criteria = {
    "algorithm_validity": "Does the output present an algorithm that is logically sound and applicable for solving the given problem?",
    "code_formatting": "Is the code following standard formatting conventions, including indentation, spacing, and bracket placement?",
    "code_conciseness_elegance": "Is the code written in a way that is both brief and expressive, using the most efficient methods and structures?",
    "code_compilability": "Will the code compile without errors, and does it include all necessary components such as libraries or headers?",
    "demonstration_of_computer_science_core_aspects": "Does the output reflect a strong understanding of computer science principles such as data structures, algorithms, runtime complexity, and design patterns?",
    "understanding_of_development_process": "Does the output demonstrate a grasp of the software development lifecycle, including planning, development, testing, deployment, and maintenance?",
    "suitability_of_test_cases": "Are the provided test cases comprehensive and do they cover a wide range of inputs, including edge cases and potential error conditions?",
    "code_modularity": "Is the code organized into modules or functions that can be reused and maintained easily?",
    "code_comments_documentation": "Does the code include comments and documentation that explain complex or non-obvious parts of the logic?",
    "error_handling": "Does the code include appropriate error handling and reporting mechanisms?",
    "security_best_practices": "Does the code adhere to security best practices to prevent common vulnerabilities?",
    "performance_optimization": "Is the code optimized for performance, minimizing resource usage and execution time where possible?",
    "coding_standards_compliance": "Does the code comply with industry coding standards specific to the language or environment?",
    "version_control_integration": "Does the output demonstrate the knowledge of Git and the errors arising out of conflicts",
    "test_driven_development": "Does the output reflect a test-driven development approach with unit tests written before the actual code?",
    "scalability_and_maintainability": "Is the code structured in a way that it can be scaled and maintained easily over time?",
    "cross-platform_compatibility": "Does the code take into account different operating systems or platforms where it might be used?",
    "code_readability": "Is the code easily understandable by other developers, with clear variable names and straightforward logic?",
    "integration_with_existing_systems": "Does the code account for integration with existing systems and how it will fit into the larger ecosystem?"
}


## Evaluating the LLM now.

In [None]:
from langchain.evaluation import EvaluatorType
criteria_questions = [
    ("world_knowledge", "What caused the fall of the Roman Empire?"),
    ("mathematical_correctness", "Can you solve the equation x^2 - 4x + 4 = 0?"),
    ("classification_ability", "Given the following list of financial instruments: '10-year Treasury Note', 'S&P 500 Index Fund', 'Series EE Savings Bonds', 'Corporate Convertible Bond', 'Credit Default Swap', 'Term Loan B', and 'Non-convertible Debentures', can you classify them into 'debt instruments', 'equity instruments', and 'derivatives'? Explain the rationale behind your classification."),
    ("translation_accuracy", "How would you translate 'Hello, how are you?' to French?"),
    ("scientific_reasoning", "Explain the theory of relativity."),
    ("legal_ethical_reasoning", "What are the legal implications of copyright infringement?"),
    ("literary_artistic_understanding", "Analyze the themes present in Shakespeare's 'Hamlet'."),
    ("environmental_awareness", "What are the primary causes of global warming?"),
    ("emotional_intelligence", "How would you comfort someone who has just lost a loved one?"),
    ("technical_support_skills", "How would you troubleshoot a router that's not connecting to the internet?")
]


results = []

for criteria, question in criteria_questions:
    custom_criteria = {criteria: evaluation_criteria[criteria]}
    print("\n \n ",custom_criteria)
    evaluator = load_evaluator(EvaluatorType.CRITERIA, llm = llm, criteria=custom_criteria)

    prediction = generate(question)
    print(question, prediction)

    eval_result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )

    print(eval_result)
    results.append((criteria, question, prediction, eval_result))



 
  {'world_knowledge': 'Does the output correctly reflect current and historical factual information?'}
What caused the fall of the Roman Empire? The fall of the Roman Empire was a complex process that occurred over several centuries, with multiple contributing factors. Historians often cite a combination of internal weaknesses and external pressures that led to the decline and eventual fall of the empire. Here are some of the key factors:

1. Political instability: The Roman Empire faced frequent changes in leadership, with emperors being overthrown or assassinated. This led to a lack of continuity and stability in governance, making it difficult to address the empire's problems effectively.

2. Economic troubles: The empire faced significant economic challenges, including inflation, a decline in trade, and a heavy tax burden on its citizens. These issues weakened the empire's financial resources and made it difficult to maintain its military and infrastructure.

3. Military problem

In [None]:
criteria_questions_advanced = [
    ("world_knowledge", "Analyze the economic factors that contributed to the fall of the Roman Empire."),
    ("mathematical_correctness", "How would you interpret the Eigenvalues of a matrix in the context of financial risk modeling?"),
    ("classification_ability", "Given the following list of financial instruments: '10-year Treasury Note', 'S&P 500 Index Fund', 'Series EE Savings Bonds', 'Corporate Convertible Bond', 'Credit Default Swap', 'Term Loan B', and 'Non-convertible Debentures', can you classify them into 'debt instruments', 'equity instruments', and 'derivatives'? Explain the rationale behind your classification."),
    ("translation_accuracy", "Translate the following financial statement term into Mandarin Chinese: 'Accrued Liabilities'."),
    ("scientific_reasoning", "Discuss the role of quantum entanglement in quantum computing."),
    ("legal_ethical_reasoning", "Evaluate the ethical considerations of using predictive policing algorithms in law enforcement."),
    ("literary_artistic_understanding", "Discuss the use of iambic pentameter as a narrative device in Shakespeare's plays."),
    ("environmental_awareness", "Critique the effectiveness of carbon credits in reducing global carbon emissions."),
    ("emotional_intelligence", "How would you navigate a negotiation between parties with a significant power imbalance?"),
    ("technical_support_skills", "Design a high-level plan for a company-wide upgrade from IPv4 to IPv6.")
]

results = []

for criteria, question in criteria_questions_advanced:
    custom_criteria = {criteria: evaluation_criteria[criteria]}
    print("\n \n ",custom_criteria)
    evaluator = load_evaluator(EvaluatorType.CRITERIA, llm = llm, criteria=custom_criteria)

    prediction = generate(question)
    print(question, prediction)

    eval_result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )

    print(eval_result)
    results.append((criteria, question, prediction, eval_result))


 
  {'world_knowledge': 'Does the output correctly reflect current and historical factual information?'}
Analyze the economic factors that contributed to the fall of the Roman Empire. The fall of the Roman Empire was a complex process influenced by various economic factors. Here are some key economic factors that contributed to the decline of the Roman Empire:

1. Inflation and debasement of currency: The Roman Empire faced severe inflation due to the excessive minting of coins with lower silver content. This led to a decrease in the value of the currency, which in turn caused economic instability and reduced purchasing power for the citizens.

2. Heavy taxation: To fund the vast military expenses and maintain the empire's infrastructure, the Roman government imposed heavy taxes on its citizens. This placed a significant burden on the population, particularly on the lower classes, and led to widespread discontent and economic hardship.

3. Over-reliance on slave labor: The Roman econo