# Evaluating LLM Systems
* Comparing faithfulness, answer_relevancy, context_recall, context_precision using OpenAI API and Grok API

[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LinkedInLearning/generative-ai-and-llmops-deploying-and-managing-llms-in-production-4465782/blob/solution/ch-05/challenge_evaluating_LLM_systems.ipynb)

In [None]:
!pip install ragas -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.0/397.0 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.0/383.0 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from datasets import load_dataset
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from ragas import evaluate
import os
import getpass

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your Open AI API Key: ")

Enter your Open AI API Key: ··········


In [None]:
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


0000.parquet:   0%|          | 0.00/115k [00:00<?, ?B/s]

Generating baseline split:   0%|          | 0/30 [00:00<?, ? examples/s]

DatasetDict({
    baseline: Dataset({
        features: ['question', 'ground_truths', 'answer', 'contexts'],
        num_rows: 30
    })
})

In [None]:
data=fiqa_eval["baseline"].select(range(10))

In [None]:
data

Dataset({
    features: ['question', 'ground_truths', 'answer', 'contexts'],
    num_rows: 10
})

In [None]:
data[0]

{'question': 'How to deposit a cheque issued to an associate in my business into my business account?',
 'ground_truths': ["Have the check reissued to the proper payee.Just have the associate sign the back and then deposit it.  It's called a third party cheque and is perfectly legal.  I wouldn't be surprised if it has a longer hold period and, as always, you don't get the money if the cheque doesn't clear. Now, you may have problems if it's a large amount or you're not very well known at the bank.  In that case you can have the associate go to the bank and endorse it in front of the teller with some ID.  You don't even technically have to be there.  Anybody can deposit money to your account if they have the account number. He could also just deposit it in his account and write a cheque to the business."],
 'answer': '\nThe best way to deposit a cheque issued to an associate in your business into your business account is to open a business account with the bank. You will need a state-is

In [None]:
# Using the map function to add a new 'reference' column to the dataset which is a duplicate of the 'ground_truths' column (required for context_recall)
data = data.map(lambda data: {'reference': data['ground_truths']})

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
data

Dataset({
    features: ['question', 'ground_truths', 'answer', 'contexts', 'reference'],
    num_rows: 10
})

In [None]:
data[0]

{'question': 'How to deposit a cheque issued to an associate in my business into my business account?',
 'ground_truths': ["Have the check reissued to the proper payee.Just have the associate sign the back and then deposit it.  It's called a third party cheque and is perfectly legal.  I wouldn't be surprised if it has a longer hold period and, as always, you don't get the money if the cheque doesn't clear. Now, you may have problems if it's a large amount or you're not very well known at the bank.  In that case you can have the associate go to the bank and endorse it in front of the teller with some ID.  You don't even technically have to be there.  Anybody can deposit money to your account if they have the account number. He could also just deposit it in his account and write a cheque to the business."],
 'answer': '\nThe best way to deposit a cheque issued to an associate in your business into your business account is to open a business account with the bank. You will need a state-is

In [None]:
# Using the map function to convert reference column to str value (avoiding apparent integers/special characters present)
data = data.map(lambda x: {"reference": str(x.get("reference", "No Reference"))})

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
# Evaluating the dataset using the OpenAI model
openairesult = evaluate(
    data,
    metrics=[
        answer_relevancy,  # Measures how relevant the model's answer is to the question
        context_recall,    # Measures the proportion of relevant context retrieved by the model
        context_precision, # Measures how accurately the context retrieved matches the ground truth
        faithfulness       # Measures how faithful the generated answer is to the retrieved context
    ]
)
print(openairesult)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'answer_relevancy': 0.8559, 'context_recall': 0.6808, 'context_precision': 0.9000, 'faithfulness': 0.8877}


In [None]:
os.environ["Grok_API_KEY"] = getpass.getpass("Enter your Grok API Key: ")

Enter your Grok API Key: ··········


In [None]:
#Grok API
grokresult = evaluate(
    data,
    metrics=[
        answer_relevancy,
        context_recall,
        context_precision,
        faithfulness
    ]
)
print(grokresult)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'answer_relevancy': 0.8579, 'context_recall': 0.7124, 'context_precision': 0.9000, 'faithfulness': 0.8377}
