# Evaluate libray

Learn to use the HuggingFace evaluate library

**Note** Library doesn't support the LLM-as-a-Judge metric

https://huggingface.co/docs/evaluate/en/index

**Classes**

https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes

https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/main_classes

## Installation

In [1]:
# !pip install --quiet evaluate, langchain-groq

## 1. Evaluate basics

In [1]:
import evaluate

print("Accuracy description: ", evaluate.load("accuracy").description)
print("Accuracy features: ", evaluate.load("accuracy").features)

# print("Precision features: ", evaluate.load("precision").features)
# print("Precision features: ", evaluate.load("precision").features)

Accuracy description:  
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative

Accuracy features:  {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}


### Compute single metric

In [2]:
accuracy = evaluate.load("accuracy")

result = accuracy.compute(predictions=[1,0,1], references=[1,1,1])

result

{'accuracy': 0.6666666666666666}

### Compute multiple metrics

In [3]:
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
result = clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])
result

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## 2. Metric evaluation : Accuracy

### Setup Chain

In [4]:
from dotenv import load_dotenv
import sys
import json

from langchain.prompts import PromptTemplate, ChatPromptTemplate


# Load the file that contains the API keys - OPENAI_API_KEY
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

# setting path
sys.path.append('../')





In [5]:

from utils.create_chat_llm import create_gpt_chat_llm, create_cohere_chat_llm, create_hugging_face_chat_llm, create_groq_chat_llm
from datasets import Dataset

system = """
You are a helpful assistant that assigns a sentiment classification to the given input text. Provide your output as a single number representing the sentiment:

1 for positive sentiment
2 for negative sentiment
0 for neutral sentiment
Examples of expected output:
Input: "it was beautiful"
Output: 1

Input: "i did not like it"
Output: 2

For each new input, output only the number corresponding to its sentiment, with no additional text or explanation.

"""

human = "{text}"
prompt = ChatPromptTemplate.from_messages({("system", system), ("human", human)})

# Utility function for creating the chain


def setup_llm(model_name):
    llm = create_groq_chat_llm(model_name=model_name, args={"temperature":0})
    return llm
    
def setup_llm_chain(model_name):

    llm = setup_llm(model_name)

    chain = prompt | llm

    return chain


### Setup test dataset

In [6]:
test_dataset = [
  {"text": "I love the new design of this website!", "reference": 1},
  {"text": "The product stopped working after just two days.", "reference": 2},
  {"text": "It’s an average movie, nothing special.", "reference": 0},
  {"text": "The customer service was exceptional and helpful.", "reference": 1},
  {"text": "I am extremely disappointed with this purchase.", "reference": 2},
  {"text": "The presentation was neither good nor bad.", "reference": 0},
  {"text": "This is the best book I have read this year!", "reference": 1},
  {"text": "The food was cold and tasted terrible.", "reference": 2},
  {"text": "I feel indifferent about the changes.", "reference": 0},
  {"text": "The concert was a fantastic experience.", "reference": 1}
]


In [7]:
model_names = ["mixtral-8x7b-32768", "llama3-8b-8192"]
model_index = 0

model_name =model_names[model_index]

chain = setup_llm_chain(model_name)

for i, test in enumerate(test_dataset):
    response = chain.invoke({"text": "I am doing great."})
    output = response.content
    test_dataset[i]['prediction'] = int(output)

test_dataset[:2]

[{'text': 'I love the new design of this website!',
  'reference': 1,
  'prediction': 1},
 {'text': 'The product stopped working after just two days.',
  'reference': 2,
  'prediction': 1}]

### Evaluate : Accuracy

In [8]:
# Flatten the dataset (separate array for each column)
flattened_data = Dataset.from_list(test_dataset)

# Calculate accuracy
result = accuracy.compute(predictions=flattened_data['prediction'], references=flattened_data['reference'])

result

{'accuracy': 0.4}

## 3. Task specific metric



### Checkout response format for Q&A pipeline/task

In [9]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

result = nlp(
                question='Why is model conversion important?', 
                context='The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
)

result

{'score': 0.2117149531841278,
 'start': 59,
 'end': 84,
 'answer': 'gives freedom to the user'}

### QuestionAnsweringEvaluator

https://github.com/huggingface/evaluate/blob/v0.4.0/src/evaluate/evaluator/question_answering.py#L143

https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes#evaluate.QuestionAnsweringEvaluator

In [10]:
from datasets import Dataset, load_dataset

from evaluate import evaluator
from evaluate import QuestionAnsweringEvaluator


qa_evaluator = QuestionAnsweringEvaluator()

qa_dataset = load_dataset("rajpurkar/squad_v2", split="validation[:10]")

results = qa_evaluator.compute(
    model_or_pipeline='distilbert/distilbert-base-cased-distilled-squad',
    data=qa_dataset,
    metric="squad_v2",
    squad_v2_format=True
)

# Display results
print("Evaluation Results:", results)

Evaluation Results: {'exact': 60.0, 'f1': 60.0, 'total': 10, 'HasAns_exact': 100.0, 'HasAns_f1': 100.0, 'HasAns_total': 6, 'NoAns_exact': 0.0, 'NoAns_f1': 0.0, 'NoAns_total': 4, 'best_exact': 60.0, 'best_exact_thresh': 0.996135950088501, 'best_f1': 60.0, 'best_f1_thresh': 0.996135950088501, 'total_time_in_seconds': 1.6325096000218764, 'samples_per_second': 6.125538250964034, 'latency_in_seconds': 0.16325096000218764}


### Text Classification Evaluator

https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes#evaluate.TextClassificationEvaluator

In [12]:
from evaluate import TextClassificationEvaluator

text_classification_evaluator = TextClassificationEvaluator() # evaluator("text-classification")

model_index = 1
model_name =model_names[model_index]

llm = setup_llm(model_name)



In [21]:
test_dataset = [
  {"text": "I love the new design of this website!", "label": 1.0},
  {"text": "The product stopped working after just two days.", "label": 2.0},
  {"text": "It’s an average movie, nothing special.", "label": 0.0},
  {"text": "The customer service was exceptional and helpful.", "label": 1.0}
]

label_mapping={"POSITIVE": 1.0, "NEGATIVE": 2.0, "NEUTRAL": 0.0}

test_dataset_flattened = Dataset.from_list(test_dataset)

results = text_classification_evaluator.compute(
    model_or_pipeline = "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    data = test_dataset_flattened,
    metric = "accuracy",
    label_mapping=label_mapping,
)

result

{'accuracy': 0.4}