# Code for Evaluating the LLM Models

In [1]:
#Imports
import time
import pandas as pd
from llm import getChatChain
from app import load_documents_into_database
from langchain_community.llms import Ollama
from langchain.evaluation import load_evaluator
from langchain_community.vectorstores import Chroma

In [2]:
# Função que avalia a Precisão e Accuracy do Modelo LLM
def evaluate(llm_model_name: str, db: Chroma, inicio: float) -> tuple:
    accuracy_criteria = {
    "accuracy": """
        Score 1: The answer is completely irrelevant or incoherent in relation to the reference.
        Score 2: The answer is mostly irrelevant, with few or no correct parts.
        Score 3: The answer has some relevance but is mostly incorrect or out of context.
        Score 4: The answer has moderate relevance but contains several significant inaccuracies.
        Score 5: The answer has moderate relevance but contains some notable inaccuracies.
        Score 6: The answer is generally correct but contains a reasonable number of minor errors or omissions.
        Score 7: The answer is mostly correct and relevant but contains some minor errors or omissions.
        Score 8: The answer is very correct and relevant, with only small inaccuracies or omissions.
        Score 9: The answer is almost entirely accurate and relevant, with only one or two small inaccuracies or omissions.
        Score 10: The answer is completely accurate and perfectly aligns with the reference, with no errors or omissions."""
    }

    evaluator = load_evaluator(
        "labeled_score_string",
        criteria=accuracy_criteria,
        llm=Ollama(model=llm_model_name),
    )

    chat = getChatChain(Ollama(model=llm_model_name), db)
    df = pd.read_csv("evaluate.csv")
    f = open("Stats.csv", "a")
    print("\n[INFO] Evaluating model: ", llm_model_name)
    
    for index, row in df.iterrows():
        question = row['question']
        reference_answer = row['answer']
        model_answer = chat(question=question)
        try:
            evaluation = evaluator.evaluate_strings(
                prediction=model_answer,
                reference=reference_answer,
                input=question
            )
            score = evaluation.get('score', '')
            print(evaluation)
            
            f.write(f"{llm_model_name},{score},{time.time() - inicio}\n")
            print("\n[QUESTION] " + evaluation.get('reasoning', ''), score)
        except ValueError as e:
            print("\n[EXCEPTION] ", str(e))
            f.write(f"{llm_model_name},{score},{time.time() - inicio}\n")

# Ensure the CSV has the correct header
with open("Stats.csv", "w") as f:
    f.write("model,score,time\n")

# Mistral

In [3]:
# Avaliação do Mistral segundo o Tempo, a Precisão e a Accuracy.
inicio = time.time()
db = load_documents_into_database("mistral", "nomic-embed-text", "../Final PDF Files", True)
evaluate("mistral", db, inicio)
fim = time.time()
print("O Modelo demorou " + str(round((fim-inicio), 2)) + " segundos a gerar as respostas.")

Loading documents
Loading .pdf files


100%|██████████| 25/25 [00:04<00:00,  6.07it/s]


Loading .md files


100%|██████████| 1/1 [00:00<00:00, 1100.29it/s]


Creating embeddings and loading documents into Chroma

[INFO] Evaluating model:  mistral
 I cannot answer your question with the provided research. The research focuses on the benefits of an exercise method for promoting joint stability and balanced muscular development, and does not mention anything about the number of parts in the human chest.
[EXCEPTION]  Invalid output:  Based on the information provided in the user question and the assistant's response, I would rate the quality of the assistant's response as follows:

Explanation: The assistant correctly identified that the given research does not provide an answer to the user question. However, it could have mentioned that the human chest is typically divided into three parts: the upper chest, middle chest, and lower chest. This information would have been helpful for the user. Therefore, the response was relevant but mostly incorrect in providing a complete answer.

Rating: 4.

The assistant's response did acknowledge that it co

# Llama2

In [4]:
#Avaliação do Llama2 segundo o Tempo, a Precisão e a Accuracy.
inicio = time.time()
db = load_documents_into_database("llama2","nomic-embed-text","../Final PDF Files",True)
evaluate("llama2",db,inicio)
fim = time.time()
print("O Modelo demorou " + str(round((fim-inicio),2)) + " segundos a gerar as respostas.")

Loading documents
Loading .pdf files


100%|██████████| 25/25 [00:04<00:00,  6.11it/s]


Loading .md files


100%|██████████| 1/1 [00:00<00:00, 1081.01it/s]


Creating embeddings and loading documents into Chroma

[INFO] Evaluating model:  llama2
Based on the provided research documents, there are three parts to the human chest: upper, middle, and lower. According to the "Training Muscles" document, these parts make up 80% of the chest mass, and it is recommended to focus on working these parts with more sets in flat bench presses/flyes than incline ones.
[EXCEPTION]  Invalid output: Rating: [8]

In my evaluation, the response provided by the AI assistant is generally correct and relevant to the question asked. The assistant accurately identified the three parts of the human chest (upper, middle, and lower) based on the provided research documents. The answer is well-structured and easy to understand, with clear explanations for each part of the chest.

However, I would deduct a point for minor inaccuracies or omissions. For instance, the assistant did not provide any information about the size or muscles of each part of the chest, which cou

# Zephyr

In [5]:
#Avaliação do Zephyr segundo o Tempo, a Precisão e a Accuracy.
inicio = time.time()
db = load_documents_into_database("zephyr","nomic-embed-text","../Final PDF Files",True)
evaluate("zephyr",db,inicio)
fim = time.time()
print("O Modelo demorou " + str(round((fim-inicio),2)) + " segundos a gerar as respostas.")

Loading documents
Loading .pdf files


100%|██████████| 25/25 [00:04<00:00,  6.15it/s]


Loading .md files


100%|██████████| 1/1 [00:00<00:00, 1059.44it/s]


Creating embeddings and loading documents into Chroma

[INFO] Evaluating model:  zephyr
The provided research documents mention three parts that make up the human chest:

1. Grains group: Whole grains such as brown rice and oats are included in this group. These foods provide carbohydrates, fiber, and other important nutrients. (Source: ../Final PDF Files/Nutrition_Facts.pdf, Page 1)

2. Meat, Fish, and Beans group: This group includes both animal and plant-based sources of protein such as chicken, fish, beans, and lentils. (Source: ../Final PDF Files/Nutrition_Facts.pdf, Page 1)

3. Milk group: Low-fat cheese is a part of this group that provides calcium, protein, and other essential nutrients. (Source: ../Final PDF Files/Nutrition_Facts.pdf, Page 1)

Note: The provided research documents do not explicitly mention the three parts of the human chest. This question seems to be unrelated to the given context. However, if you are asking about the respiratory system, then the answer would 