# LLM RAG Evaluation with MLflow using llama2-as-judge Example Notebook

In this notebook, we will demonstrate how to evaluate various a RAG system with MLflow. We will use llama2-70b as the judge model, via a Databricks serving endpoint.

In [1]:
import os

We need to set our OpenAI API key.

In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:

`OPENAI_API_KEY=<your openai API key>`

In [2]:
os.environ["DATABRICKS_HOST"] = "REDACTED"
os.environ["DATABRICKS_TOKEN"] = "REDACTED"

Set the deployment target to "databricks" for use with Databricks served models.

In [4]:
from mlflow.deployments import set_deployments_target

set_deployments_target("databricks")

## Create a RAG system

Use Langchain and Chroma to create a RAG system that answers questions based on the MLflow documentation.

In [2]:
# !pip install langchain
# !pip install transformers
# !pip install chromadb
# !pip install sentence_transformers
# !pip install accelerate
# !pip install mlflow
# !pip install bitsandbytes
# !pip install rank_bm25 > /dev/null
import os, glob, textwrap, time
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceBgeEmbeddings
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores.chroma import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain import PromptTemplate
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.chains.question_answering import load_qa_chain
import pandas as pd
import mlflow

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
def loadSplitDocuments(file_path, chunk_size, chunk_overlap):
  loader = TextLoader(file_path)
  documents = loader.load()
  text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap= chunk_overlap)
  text  = text_splitter.split_documents(documents)
  return text


text = loadSplitDocuments("Basic Structure of the Local High Voltage Product _parsed.txt", chunk_size = 600, chunk_overlap=60)

In [4]:
import torch

print(torch.__version__)

print("Torch version:",torch.__version__)

print("Is CUDA enabled?",torch.cuda.is_available())

2.2.1
Torch version: 2.2.1
Is CUDA enabled? True


In [5]:
model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": True}

bgeEmbeddings = HuggingFaceBgeEmbeddings(
                                     model_name=model_name,
                                     model_kwargs=model_kwargs,
                                     encode_kwargs=encode_kwargs)

In [6]:
directory = 'db'

vectordb = Chroma.from_documents(
                    documents = text,
                    embedding = bgeEmbeddings,
                    persist_directory= directory )

vectordb.persist()

In [7]:
directory = 'db'
vectordb = Chroma(persist_directory = directory, embedding_function= bgeEmbeddings)

retriever = vectordb.as_retriever(search_type = "similarity", search_kwargs = {"k":5})

docs = retriever.get_relevant_documents("When will the lab heating was started early morning at approximately?")
docs

[Document(page_content='Basic Structure of the Local High\n\ncase where  battery  relay  opens  then the inverter  output  off alert should  not be generated  since  we already\nknow  that the output  has turned  off due to relay  opening\n\n//Cloud  Manager\nThis is responsible  for uploading  all stats and alerts  to the cloud.  Usually  after every  2 minutes  the cloud  will\nrequest  information  from  the system.\n\nWiFi Wizard\nChecks  connection  of the system  with Wiﬁ.', metadata={'source': 'Basic Structure of the Local High Voltage Product _parsed.txt'}),
 Document(page_content='Basic Structure of the Local High\n\ncase where  battery  relay  opens  then the inverter  output  off alert should  not be generated  since  we already\nknow  that the output  has turned  off due to relay  opening\n\n//Cloud  Manager\nThis is responsible  for uploading  all stats and alerts  to the cloud.  Usually  after every  2 minutes  the cloud  will\nrequest  information  from  the system.\n\nW

In [1]:
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, AutoModelForCausalLM, AutoTokenizer

Tokenizer = LlamaTokenizer.from_pretrained("TheBloke/wizardLM-7B-HF")
model = LlamaForCausalLM.from_pretrained("TheBloke/wizardLM-7B-HF",
                                         load_in_4bit=True,
                                         torch_dtype=torch.float16)

  from .autonotebook import tqdm as notebook_tqdm
  return torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`

In [9]:
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
import torch

pipe = pipeline(
               'text-generation',
                model=model,
                tokenizer=Tokenizer,
                max_length=1024,
)

wizard_llm = HuggingFacePipeline(pipeline=pipe)

In [10]:
recent_chats = """
User: What is the purpose of the High Voltage Battery Management System (HVBMS)?
Assistant: The HVBMS is designed to monitor, balance, and protect LiFePO4 battery cells, ensuring they operate safely under various conditions.

User: How many major portions does the HVBMS comprise, and what are they?
Assistant: The HVBMS comprises six major portions: one current sensing portion (CS), four cell balancing, monitoring, and protection portions (CBMPs), and one microcontroller portion (MCU).

User: What communication interfaces does the HVBMS board support?
Assistant: The HVBMS board supports RS485 and CAN-Bus interfaces for communication with the
"""

In [11]:
from langchain.chains.question_answering import load_qa_chain
from langchain.memory import ConversationBufferMemory

# memory = ConversationBufferMemory(memory_key="chat_history", input_key="question", return_messages=True)

prompt_template = """
                previous conversation:
                [{chat_history}]
                ---------------------
                \nConsidering the above previous conversations and the given context, please answer the following question, taking into account conversation awareness.
                If you don't know the answer, simply state "Contact NOC for further assistance" without attempting to invent an answer.
                Context: \n{context}\n\n
                Answer the question: \n{question}\n
                Answer:  """

prompt = PromptTemplate(template = prompt_template, input_variables = ["chat_history", "context", "question"])
chain = load_qa_chain(wizard_llm, chain_type="stuff", prompt=prompt)

## Evaluate the RAG system using `mlflow.evaluate()`

Create a simple function that runs each input through the RAG chain

Create an eval dataset

In [50]:
eval_df = pd.DataFrame(
    {
    "questions": [
    "Explore the interactions between the NGM app, BMS Manager, and Inverter Manager in ensuring efficient communication and data exchange across various system components, highlighting their roles in initialization and operational control.",
    "How do the Cloud Manager, WiFi Wizard, and Database Manager collaborate to maintain seamless connectivity and data synchronization between the local system and the cloud, considering their respective responsibilities and interactions?",
    "Discuss the collaborative efforts of the Bring Up Service, UI Manager, and Alerts Service in providing comprehensive system feedback to operators, considering their roles in initialization, user interface management, and alert generation.",
    "Explain how the Forecast Engine and Smart Flow Manager work together to optimize system performance based on forecasted data, emphasizing their roles in data processing and decision-making for backup and charge source strategies.",
    "Explore the interconnectedness of various components such as the NGM app, BMS Manager, and Inverter Manager in handling both routine operations and contingency scenarios, highlighting the importance of seamless coordination for overall system reliability.",
    "How do the NGM app, Cloud Manager, and Alerts Service contribute to system resilience in the face of network failures or hardware disconnectivity, considering their roles in system initialization, data transfer, and alert generation?",
    "Discuss the integration between the Database Manager and other components such as the Bring Up Service and UI Manager in maintaining consistent data access and user interface feedback, considering their roles in data management and system monitoring.",
    "Explain the collaborative efforts between the Inverter Manager, BMS Manager, and Smart Flow Manager in optimizing power flow and operational strategies based on real-time data and forecasted trends, emphasizing their roles in system stability and efficiency.",
    "How do the NGM app, WiFi Wizard, and Cloud Manager adapt to changing network conditions and ensure continuous system operation and data synchronization, considering their roles in network connectivity and cloud integration?",
    "Explore the overall system architecture and interactions among components such as the NGM app, Inverter Manager, and Alerts Service in ensuring reliable operation and timely response to various system events, highlighting the integration and coordination required for seamless functionality.",
        ],
    }
)

In [112]:
from mlflow.metrics.genai import EvaluationExample, faithfulness

from dataclasses import dataclass

@dataclass
class EvaluationExample:
    input: str
    output: str
    score: int
    justification: str
    grading_context: dict

# Create a good and bad example for faithfulness in the context of this problem
faithfulness_examples = [
    EvaluationExample(
        input="How does the NGM app facilitate system initialization?",
        output="The NGM app facilitates system initialization by providing configurations to components and ensuring communication channels are established.",
        score=5,
        justification="The output accurately describes how the NGM app facilitates system initialization, matching the information provided in the context.",
        grading_context={
            "context": "The NGM app is responsible for communicating with various components such as the BMS, Cloud, HV Control Board, and Inverter. It provides configurations for initial setup and ensures communication channels are established before the system starts functioning."
        },
    ),
    EvaluationExample(
        input="What is the role of the BMS Manager?",
        output="The BMS Manager is responsible for handling all NGM based actions related to the battery and cloud. It communicates with all BMSes attached to the system, reads data such as pack current and voltage, and processes this data.",
        score=2,
        justification="While the output provides some information about the BMS Manager, it lacks detail and does not fully capture its role as described in the context.",
        grading_context={
            "context": "The BMS Manager serves as a crucial link between higher-level actions related to the battery and cloud interactions. It communicates with all BMSes attached to the system, reads data such as pack current and voltage, and processes this data before passing relevant information to other processes."
        },
    ),
]


faithfulness_metric = faithfulness(
    model=mpt_pipeline, examples=faithfulness_examples
)
print(faithfulness_metric)

EvaluationMetric(name=faithfulness, greater_is_better=True, long_name=faithfulness, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's faithfulness based on the rubric
justification: Your reasoning about the model's faithfulness score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called faithfulness based on the input and output.
A definition of faithfulness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Inp

In [52]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
judge_tokenizer = AutoTokenizer.from_pretrained(model_name)
judge_model = AutoModelForCausalLM.from_pretrained(model_name,
                                         load_in_4bit=True,
                                         torch_dtype=torch.float16)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00,  8.72s/it]


In [53]:
pipe = pipeline(
               'text-generation',
                model=judge_model,
                tokenizer=judge_tokenizer,
                max_length=2048,
)

mistral_llm = HuggingFacePipeline(pipeline=pipe)

In [None]:
def model(input_df):
    answer = []
    for index, row in input_df.iterrows():
        answer.append(mistral_llm(row["questions"]))

    return answer

In [60]:
answers=model(eval_df)
eval_df["answers"] = answers

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [62]:
eval_df.head()

Unnamed: 0,questions,answers
0,"Explore the interactions between the NGM app, ...",\n\n## 1. Introduction\n\nThe integration of r...
1,"How do the Cloud Manager, WiFi Wizard, and Dat...","\n\nThe Cloud Manager, WiFi Wizard, and Databa..."
2,Discuss the collaborative efforts of the Bring...,"\n\nThe Bring Up Service, UI Manager, and Aler..."
3,Explain how the Forecast Engine and Smart Flow...,\n\nThe Forecast Engine and Smart Flow Manager...
4,Explore the interconnectedness of various comp...,\n\n## 1. Introduction\n\nThe rapid developmen...


In [110]:
from mlflow.metrics.genai import relevance

relevance_metric = relevance(model=mpt_pipeline)
print(relevance_metric)
print("________________________________________________________________________")
print(faithfulness_metric)

EvaluationMetric(name=relevance, greater_is_better=True, long_name=relevance, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your reasoning about the model's relevance score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called relevance based on the input and output.
A definition of relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Outpu

In [116]:
mpt_pipeline = pipeline("text-generation", model="TheBloke/wizardLM-7B-HF", max_length=1000)

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.49s/it]


In [117]:
signature = mlflow.models.infer_signature(
    model_input="What are the three primary colors?",
    model_output="The three primary colors are red, yellow, and blue.",
)

with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=mpt_pipeline,
        artifact_path="mpt-7b",
        signature=signature,
        registered_model_name="mpt-7b-chat",
    )

  model_info = mlflow.transformers.log_model(
  flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
Registered model 'mpt-7b-chat' already exists. Creating a new version of this model...
Created version '6' of model 'mpt-7b-chat'.


In [118]:
results = mlflow.evaluate(
    model_info.model_uri,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[faithfulness_metric, relevance_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",
        }
    },
)
print(results.metrics)

Loading checkpoint shards: 100%|██████████| 66/66 [00:46<00:00,  1.41it/s]
2024/03/11 12:21:59 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/11 12:21:59 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.


ValueError: Input length of input_ids is 43, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.

In [None]:
results.tables["eval_results_table"]

NameError: name 'results' is not defined