## LLM RAG Evaluation with MLflow Example Notebook

Welcome to this comprehensive tutorial on evaluating Retrieval-Augmented Generation (RAG) systems using MLflow. This tutorial is designed to guide you through the intricacies of assessing various RAG systems, focusing on how they can be effectively integrated and evaluated in a real-world context. Whether you are a data scientist, a machine learning engineer, or simply an enthusiast in the field of AI, this tutorial offers valuable insights and practical knowledge.

### What You Will Learn:

1. **Setting Up the Environment**:
   - Learn how to set up your development environment with all the necessary tools and libraries, including MLflow, OpenAI, ChromaDB, LangChain, and more. This section ensures you have everything you need to start working with RAG systems.

2. **Understanding RAG Systems**:
   - Delve into the concept of Retrieval-Augmented Generation and its significance in modern AI applications. Understand how RAG systems leverage both retrieval and generation capabilities to provide accurate and contextually relevant responses.

4. **Deploying and Testing RAG Systems with MLflow**:
   - Learn how to create, deploy, and test RAG systems using MLflow. This includes setting up endpoints, deploying models, and querying them to see their responses in action.

5. **Evaluating Performance with MLflow**: 
   - Dive into evaluating the RAG systems using MLflow's evaluation tools. Understand how to use metrics like relevance and latency to assess the performance of your RAG system.

6. **Experimenting with Chunking Strategies**:
   - Experiment with different text chunking strategies to optimize the performance of RAG systems. Understand how the size of text chunks affects retrieval accuracy and system responsiveness.

7. **Creating and Using Evaluation Datasets**:
   - Learn how to create and utilize evaluation datasets (Golden Datasets) to effectively assess the performance of your RAG system.

8. **Combining Retrieval and Generation for Question Answering**:
   - Gain insights into how retrieval and generation components work together in a RAG system to answer questions based on a given context or documentation.

By the end of this tutorial, you will have a thorough understanding of how to evaluate and optimize RAG systems using MLflow. You will be equipped with the knowledge to deploy, test, and refine RAG systems, making them suitable for various practical applications. This tutorial is your stepping stone into the world of advanced AI model evaluation and deployment.

In [1]:
%pip install -r requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
import ast
import getpass
import warnings
from typing import List

import chromadb
import mlflow
import mlflow.deployments
import pandas as pd
from IPython.display import HTML, display
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)  # noqa
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_community.embeddings import MlflowEmbeddings
from langchain_community.llms import Mlflow
from mlflow.metrics.genai.metric_definitions import relevance

warnings.filterwarnings("ignore")



In [3]:
# check mlflow version
mlflow.__version__

'2.13.2'

In [4]:
# check chroma version
chromadb.__version__

'0.5.0'

In [5]:
def pretty_print(df):
    return display(HTML(df.to_html().replace("\\n", "<br>")))

### Create and Test Endpoint on MLflow for AWS Bedrock and OpenAI

We will use [MLflow Deployments Server](https://mlflow.org/docs/latest/llms/deployments/index.html) locally to abstract the endpoints. Please run the following command to start the MLflow server:

```bash
mlflow deployments start-server --config-path mlflow-deployment.yaml --port 5000 --host localhost --workers 2
```

In [6]:
client = mlflow.deployments.get_deploy_client("http://127.0.0.1:5000")

endpoints = client.list_endpoints()

for endpoint in endpoints:
    print(endpoint)

name='completions' endpoint_type='llm/v1/completions' model=RouteModelInfo(name='anthropic.claude-v2:1', provider='bedrock') endpoint_url='http://127.0.0.1:5000/gateway/completions/invocations' limit=None
name='chat' endpoint_type='llm/v1/chat' model=RouteModelInfo(name='gpt-4', provider='openai') endpoint_url='http://127.0.0.1:5000/gateway/chat/invocations' limit=None
name='embeddings' endpoint_type='llm/v1/embeddings' model=RouteModelInfo(name='text-embedding-ada-002', provider='openai') endpoint_url='http://127.0.0.1:5000/gateway/embeddings/invocations' limit=None


In [7]:
print(
    client.predict(
        endpoint="completions",
        inputs={
            "prompt": "How is Pi calculated? Be very concise.",
            "max_tokens": 100,
        },
    )
)

{'id': None, 'object': 'text_completion', 'created': 1717804873, 'model': 'anthropic.claude-v2:1', 'choices': [{'index': 0, 'text': " Pi is calculated by approximating the ratio of a circle's circumference to its diameter. Modern computations use iterative algorithms to refine the approximation of Pi to extreme precision.", 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': None, 'completion_tokens': None, 'total_tokens': None}}


### Set MLflow Tracking URI

In [8]:
mlflow_tracking_uri = getpass.getpass(prompt="Enter your MLflow Tracking URI: ")
if mlflow_tracking_uri.endswith(".com") and (
    not mlflow_tracking_uri.startswith("http")
):
    mlflow_tracking_uri = "http://" + mlflow_tracking_uri

mlflow.set_tracking_uri(
    mlflow_tracking_uri
)  # please change this to your MLflow tracking URI
mlflow.set_experiment("llmops-demo")

<Experiment: artifact_location='s3://mlflow-artifacts-767397766072-fd891caf/1', creation_time=1717803484641, experiment_id='1', last_update_time=1717803484641, lifecycle_stage='active', name='llmops-demo', tags={}>

### Create RAG POC with LangChain and log with MLflow

Use Langchain and Chroma to create a RAG system that answers questions based on the MLflow documentation.

In [9]:
CHUNK_SIZE = 1000

loader = WebBaseLoader(
    [
        "https://mlflow.org/docs/latest/index.html",
        "https://mlflow.org/docs/latest/tracking/autolog.html",
        "https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html",
        "https://mlflow.org/docs/latest/python_api/mlflow.deployments.html",
    ]
)

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# create the language model using MLflow deployment
llm = Mlflow(
    target_uri="http://127.0.0.1:5000",
    endpoint="completions",
)

# create the embedding function using MLflow deployment
embedding_function = MlflowEmbeddings(
    target_uri="http://127.0.0.1:5000",
    endpoint="embeddings",
)
docsearch = Chroma.from_documents(texts, embedding_function)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(fetch_k=3),
    return_source_documents=True,
)



### Evaluate the Vector Database and Retrieval using `mlflow.evaluate()`

#### Create an eval dataset (Golden Dataset)

We can [leveraging the power of an LLM to generate synthetic data for testing](https://mlflow.org/docs/latest/llms/rag/notebooks/question-generation-retrieval-evaluation.html), offering a creative and efficient alternative. To our readers and customers, we emphasize the importance of crafting a dataset that mirrors the expected inputs and outputs of your RAG application. It's a journey worth taking for the incredible insights you'll gain!

In [10]:
EVALUATION_DATASET_PATH = "https://raw.githubusercontent.com/mlflow/mlflow/master/examples/llms/RAG/static_evaluation_dataset.csv"

synthetic_eval_data = pd.read_csv(EVALUATION_DATASET_PATH)

# Load the static evaluation dataset from disk and deserialize the source and retrieved doc ids
synthetic_eval_data["source"] = synthetic_eval_data["source"].apply(ast.literal_eval)
synthetic_eval_data["retrieved_doc_ids"] = synthetic_eval_data[
    "retrieved_doc_ids"
].apply(ast.literal_eval)

In [11]:
pretty_print(synthetic_eval_data)

Unnamed: 0,question,source,retrieved_doc_ids
0,What is the purpose of the MLflow Model Registry?,[model-registry.html],"[model-registry.html, introduction/index.html, introduction/index.html, deep-learning/index.html]"
1,What is the purpose of registering a model with the Model Registry?,[model-registry.html],"[model-registry.html, models.html, introduction/index.html, introduction/index.html]"
2,What can you do with registered models and model versions?,[model-registry.html],"[model-registry.html, models.html, deployment/deploy-model-to-kubernetes/index.html, deployment/index.html]"
3,"How can you add, modify, update, or delete a model in the Model Registry?",[model-registry.html],"[model-registry.html, models.html, deployment/deploy-model-to-kubernetes/index.html, introduction/index.html]"
4,How can you deploy and organize models in the Model Registry?,[model-registry.html],"[model-registry.html, deployment/index.html, deployment/index.html, models.html]"
5,What is the purpose of the mlflow.sklearn.log_model() method?,[model-registry.html],"[models.html, getting-started/intro-quickstart/index.html, deployment/deploy-model-to-kubernetes/index.html, getting-started/quickstart-1/index.html]"
6,What method do you use to create a new registered model?,[model-registry.html],"[model-registry.html, models.html, deployment/deploy-model-to-kubernetes/index.html, getting-started/quickstart-2/index.html]"
7,How can you deploy and organize models in the Model Registry?,[model-registry.html],"[model-registry.html, deployment/index.html, deployment/index.html, models.html]"
8,How can you fetch a specific model version?,[model-registry.html],"[models.html, model-registry.html, deployment/deploy-model-to-kubernetes/index.html, getting-started/quickstart-2/index.html]"
9,How can you fetch the latest model version in a specific stage?,[model-registry.html],"[models.html, model-registry.html, deployment/deploy-model-to-kubernetes/index.html, llms/prompt-engineering/index.html]"


### Evaluating the Embedding Model with MLflow

In this part of the tutorial, we focus on evaluating the embedding model's performance in the context of a retrieval system. The process involves a series of steps to assess how effectively the model can retrieve relevant documents based on given questions.

#### Creating Evaluation Data
- We start by defining a set of questions and their corresponding source URLs. This `eval_data` DataFrame acts as our evaluation dataset, allowing us to test the model's ability to link questions to the correct source documents.

#### The `evaluate_embedding` Function
- The `evaluate_embedding` function is designed to assess the performance of a given embedding function.
- **Chunking Strategy**: The function begins by splitting a list of documents into chunks using a `CharacterTextSplitter`. The size of these chunks is crucial, as it can influence the retrieval accuracy.
- **Retriever Initialization**: We then use `Chroma.from_documents` to create a retriever with the specified embedding function. This retriever is responsible for finding documents relevant to a given query.
- **Retrieval Process**: The function defines a `retriever_model_function` that applies the retriever to each question in the evaluation dataset. It retrieves document IDs that the model finds most relevant for each question.

#### MLflow Evaluation
- With `mlflow.start_run()`, we initiate an evaluation run. `mlflow.evaluate` is then called to evaluate our retriever model function against the evaluation dataset.
- We use the default evaluator with specified targets to assess the model's performance.
- The results of this evaluation, stored in `eval_results_of_retriever_df_bge`, are displayed, providing insights into the effectiveness of the embedding model in document retrieval.

#### Further Evaluation with Metrics
- Additionally, we perform a more detailed evaluation using various metrics like precision, recall, and NDCG at different 'k' values. These metrics offer a deeper understanding of the model's retrieval accuracy and ranking effectiveness.

This evaluation step is integral to understanding the strengths and weaknesses of our embedding model in a real-world RAG system. By analyzing these results, we can make informed decisions about model adjustments or optimizations to improve overall system performance.


In [12]:
eval_data = pd.DataFrame(
    {
        "question": [
            "What is MLflow?",
            "What is Databricks?",
            "How to serve a model on Databricks?",
            "How to enable MLflow Autologging for my workspace by default?",
        ],
        "source": [
            ["https://mlflow.org/docs/latest/index.html"],
            [
                "https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html"
            ],
            ["https://mlflow.org/docs/latest/python_api/mlflow.deployments.html"],
            ["https://mlflow.org/docs/latest/tracking/autolog.html"],
        ],
    }
)

In [13]:
def evaluate_embedding(embedding_function):
    list_of_documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=0)
    docs = text_splitter.split_documents(list_of_documents)
    retriever = Chroma.from_documents(docs, embedding_function).as_retriever()

    def retrieve_doc_ids(question: str) -> List[str]:
        docs = retriever.get_relevant_documents(question)
        return [doc.metadata["source"] for doc in docs]

    def retriever_model_function(question_df: pd.DataFrame) -> pd.Series:
        return question_df["question"].apply(retrieve_doc_ids)

    with mlflow.start_run():
        return mlflow.evaluate(
            model=retriever_model_function,
            data=eval_data,
            model_type="retriever",
            targets="source",
            evaluators="default",
        )


result1 = evaluate_embedding(
    MlflowEmbeddings(
        target_uri="http://127.0.0.1:5000",
        endpoint="embeddings",
    )
)
# To validate the results of a different model, comment out the above line and uncomment the below line:
# result2 = evaluate_embedding(SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2"))

eval_results_of_retriever_df_bge = result1.tables["eval_results_table"]
# To validate the results of a different model, comment out the above line and uncomment the below line:
# eval_results_of_retriever_df_MiniLM = result2.tables["eval_results_table"]
pretty_print(eval_results_of_retriever_df_bge)

2024/06/08 07:01:37 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/06/08 07:01:40 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  1.64it/s]


Unnamed: 0,question,source,outputs,precision_at_3/score,recall_at_3/score,ndcg_at_3/score
0,What is MLflow?,[https://mlflow.org/docs/latest/index.html],"[https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/python_api/mlflow.deployments.html, https://mlflow.org/docs/latest/python_api/mlflow.deployments.html]",0.666667,1,1.0
1,What is Databricks?,[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",1.0,1,1.0
2,How to serve a model on Databricks?,[https://mlflow.org/docs/latest/python_api/mlflow.deployments.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0.0,0,0.530721
3,How to enable MLflow Autologging for my workspace by default?,[https://mlflow.org/docs/latest/tracking/autolog.html],"[https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html]",1.0,1,1.0


### Evaluate different Top K strategy with MLflow

In [15]:
with mlflow.start_run() as run:
    evaluate_results = mlflow.evaluate(
        data=eval_results_of_retriever_df_bge,
        targets="source",
        predictions="outputs",
        evaluators="default",
        extra_metrics=[
            mlflow.metrics.precision_at_k(1),
            mlflow.metrics.precision_at_k(2),
            mlflow.metrics.precision_at_k(3),
            mlflow.metrics.recall_at_k(1),
            mlflow.metrics.recall_at_k(2),
            mlflow.metrics.recall_at_k(3),
            mlflow.metrics.ndcg_at_k(1),
            mlflow.metrics.ndcg_at_k(2),
            mlflow.metrics.ndcg_at_k(3),
        ],
    )

pretty_print(evaluate_results.tables["eval_results_table"])

2024/06/08 07:02:20 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  1.34it/s]


Unnamed: 0,question,precision_at_3/score,recall_at_3/score,ndcg_at_3/score,source,outputs,precision_at_1/score,precision_at_2/score,recall_at_1/score,recall_at_2/score,ndcg_at_1/score,ndcg_at_2/score
0,What is MLflow?,0.666667,1,1.0,[https://mlflow.org/docs/latest/index.html],"[https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/python_api/mlflow.deployments.html, https://mlflow.org/docs/latest/python_api/mlflow.deployments.html]",1,1,1,1,1,1.0
1,What is Databricks?,1.0,1,1.0,[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",1,1,1,1,1,1.0
2,How to serve a model on Databricks?,0.0,0,0.530721,[https://mlflow.org/docs/latest/python_api/mlflow.deployments.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0,0,0,0,0,0.386853
3,How to enable MLflow Autologging for my workspace by default?,1.0,1,1.0,[https://mlflow.org/docs/latest/tracking/autolog.html],"[https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html]",1,1,1,1,1,1.0


### Evaluate the Chunking Strategy with MLflow

In the realm of RAG systems, the strategy for dividing text into chunks plays a pivotal role in both retrieval effectiveness and the overall system performance. Let's delve into why and how we evaluate different chunking strategies:

#### Importance of Chunking:
- **Influences Retrieval Accuracy**: The way text is chunked can significantly affect the retrieval component of RAG systems. Smaller chunks may lead to more focused and relevant document retrieval, while larger chunks might capture broader context.
- **Impacts System's Responsiveness**: The size of text chunks also influences the speed of document retrieval and processing. Smaller chunks can be processed more quickly but may require the system to evaluate more chunks overall.

#### Evaluating Different Chunk Sizes:
- **Purpose**: By evaluating different chunk sizes, we aim to find an optimal balance between retrieval accuracy and processing efficiency. This involves experimenting with various chunk sizes to see how they impact the system's performance.
- **Method**: We create text chunks of different sizes (e.g., 1000 characters, 2000 characters) and then evaluate how each chunking strategy affects the RAG system. Key aspects to observe include the relevance of retrieved documents and the system's latency.

In this example below, we're using the default evaluation suite to provide a comprehensive adjudication of the quality of the responses to retrieved document contents to determine what the impact to the quality of the returned references are, allowing us to explore and tune the chunk size in order to arrive at a configuration that best handles our suite of test questions.

In [16]:
def evaluate_chunk_size(chunk_size):
    list_of_documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
    docs = text_splitter.split_documents(list_of_documents)
    embedding_function = MlflowEmbeddings(
        target_uri="http://127.0.0.1:5000",
        endpoint="embeddings",
    )
    retriever = Chroma.from_documents(docs, embedding_function).as_retriever()

    def retrieve_doc_ids(question: str) -> List[str]:
        docs = retriever.get_relevant_documents(question)
        return [doc.metadata["source"] for doc in docs]

    def retriever_model_function(question_df: pd.DataFrame) -> pd.Series:
        return question_df["question"].apply(retrieve_doc_ids)

    with mlflow.start_run():
        return mlflow.evaluate(
            model=retriever_model_function,
            data=eval_data,
            model_type="retriever",
            targets="source",
            evaluators="default",
        )


result1 = evaluate_chunk_size(100)
result2 = evaluate_chunk_size(1000)
result3 = evaluate_chunk_size(5000)

pretty_print(result1.tables["eval_results_table"])
pretty_print(result2.tables["eval_results_table"])
pretty_print(result3.tables["eval_results_table"])

2024/06/08 07:03:55 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/06/08 07:03:58 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2024/06/08 07:04:12 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/06/08 07:04:15 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2024/06/08 07:04:25 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/06/08 07:04:27 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s]


Unnamed: 0,question,source,outputs,precision_at_3/score,recall_at_3/score,ndcg_at_3/score
0,What is MLflow?,[https://mlflow.org/docs/latest/index.html],"[https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/index.html]",0.666667,1,0.919721
1,What is Databricks?,[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/python_api/mlflow.deployments.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0.666667,1,0.919721
2,How to serve a model on Databricks?,[https://mlflow.org/docs/latest/python_api/mlflow.deployments.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0.0,0,0.530721
3,How to enable MLflow Autologging for my workspace by default?,[https://mlflow.org/docs/latest/tracking/autolog.html],"[https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html]",1.0,1,1.0


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  1.75it/s]


Unnamed: 0,question,source,outputs,precision_at_3/score,recall_at_3/score,ndcg_at_3/score
0,What is MLflow?,[https://mlflow.org/docs/latest/index.html],"[https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/index.html]",0.666667,1,0.919721
1,What is Databricks?,[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/python_api/mlflow.deployments.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0.666667,1,0.919721
2,How to serve a model on Databricks?,[https://mlflow.org/docs/latest/python_api/mlflow.deployments.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0.0,0,0.530721
3,How to enable MLflow Autologging for my workspace by default?,[https://mlflow.org/docs/latest/tracking/autolog.html],"[https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html]",1.0,1,1.0


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s]


Unnamed: 0,question,source,outputs,precision_at_3/score,recall_at_3/score,ndcg_at_3/score
0,What is MLflow?,[https://mlflow.org/docs/latest/index.html],"[https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/index.html, https://mlflow.org/docs/latest/index.html]",0.666667,1,0.919721
1,What is Databricks?,[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/python_api/mlflow.deployments.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0.666667,1,0.919721
2,How to serve a model on Databricks?,[https://mlflow.org/docs/latest/python_api/mlflow.deployments.html],"[https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html, https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html]",0.0,0,0.530721
3,How to enable MLflow Autologging for my workspace by default?,[https://mlflow.org/docs/latest/tracking/autolog.html],"[https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html, https://mlflow.org/docs/latest/tracking/autolog.html]",1.0,1,1.0


### Evaluate the RAG system using `mlflow.evaluate()`

In this section, we'll delve into evaluating the Retrieval-Augmented Generation (RAG) systems using `mlflow.evaluate()`. This evaluation is crucial for assessing the effectiveness and efficiency of RAG systems in question-answering contexts. We focus on two key metrics: `relevance_metric` and `latency`.

#### Relevance Metric:
- **What It Measures**: The `relevance_metric` quantifies how relevant the RAG system's answers are to the input questions. This metric is critical for understanding the accuracy and contextual appropriateness of the system's responses.
- **Why It's Important**: In question-answering systems, relevance is paramount. The ability of a RAG system to provide accurate and contextually correct answers determines its utility and effectiveness in real-world applications, such as information retrieval and customer support.
- **Tutorial Context**: Within our tutorial, we utilize the `relevance_metric` to evaluate the quality of answers provided by the RAG system. It serves as a quantitative measure of the system's content accuracy, reflecting its capability to generate useful and precise responses.

#### Latency:
- **What It Measures**: The `latency` metric captures the response time of the RAG system. It measures the duration taken by the system to generate an answer after receiving a query.
- **Why It's Important**: Response time is a critical factor in user experience. In interactive systems, lower latency leads to a more efficient and satisfying user experience. High latency, conversely, can be detrimental to user satisfaction.
- **Tutorial Context**: In this tutorial, we assess the system's efficiency in terms of response time through the `latency` metric. This evaluation is vital for understanding the system's performance in a production environment, where timely responses are as important as their accuracy.

To start with evaluating, we'll create a simple function that runs each input through the RAG chain

In [17]:
def model(input_df):
    return input_df["questions"].map(qa).tolist()

### Create an evaluation dataset (Golden Dataset)

In [18]:
eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "What is Databricks?",
            "How to serve a model on Databricks?",
            "How to enable MLflow Autologging for my workspace by default?",
        ],
    }
)
pretty_print(eval_df)

Unnamed: 0,questions
0,What is MLflow?
1,What is Databricks?
2,How to serve a model on Databricks?
3,How to enable MLflow Autologging for my workspace by default?


### Evaluate using LLM as a Judge and Basic Metrics

In this concluding section of the tutorial, we perform a final evaluation of our RAG system using MLflow's powerful evaluation tools. This evaluation is crucial for assessing the performance and efficiency of the question-answering model.

#### Key Steps in the Evaluation:

1. **Setting the Deployment Target**:
   - The deployment target is set to MLflow Deployments Server, enabling us to retrieve all available endpoints. This is essential for accessing our deployed models.

2. **Relevance Metric Setup**:
   - We initialize the `relevance` metric using a model hosted on MLflow Deployments Server. This metric assesses how relevant the answers generated by our RAG system are in response to the input questions.

3. **Running the Evaluation**:
   - An MLflow run is initiated, and `mlflow.evaluate()` is called to evaluate our RAG model against the prepared evaluation dataset.
   - The model is evaluated as a "question-answering" system using default evaluators.
   - Additional metrics, including the `relevance_metric` and `latency`, are specified. These metrics provide insights into the relevance of the answers and the response time of the model.
   - The `evaluator_config` maps the input questions and context, ensuring the correct evaluation of the RAG system.

4. **Results and Metrics Display**:
   - The results of the evaluation, including key metrics, are displayed in a table format, providing a clear and structured view of the model's performance based on relevance and latency.

This comprehensive evaluation step is vital for understanding the effectiveness and efficiency of our RAG system. By assessing both the relevance of the answers and the latency of the responses, we gain a holistic view of the model's performance, guiding any further optimization or deployment decisions.

In [19]:
mlflow.deployments.set_deployments_target("http://127.0.0.1:5000")
mlflow.deployments.get_deployments_target()

relevance_metric = relevance(model="endpoints:/chat")

with mlflow.start_run():
    results = mlflow.evaluate(
        model,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        predictions="result",
        extra_metrics=[relevance_metric, mlflow.metrics.latency()],
        evaluator_config={
            "col_mapping": {
                "inputs": "questions",
                "context": "source_documents",
            }
        },
    )
    print(results.metrics)

pretty_print(
    results.tables["eval_results_table"].drop(columns=["outputs", "source_documents"])
)

2024/06/08 07:04:41 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/06/08 07:05:06 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|██████████| 1/1 [00:04<00:00,  4.82s/it]
100%|██████████| 4/4 [00:05<00:00,  1.26s/it]


{'latency/mean': 6.2395822405815125, 'latency/variance': 3.1304196862583176, 'latency/p90': 7.731202101707458, 'flesch_kincaid_grade_level/v1/mean': 9.25, 'flesch_kincaid_grade_level/v1/variance': 7.5024999999999995, 'flesch_kincaid_grade_level/v1/p90': 12.17, 'ari_grade_level/v1/mean': 12.375, 'ari_grade_level/v1/variance': 11.926875000000003, 'ari_grade_level/v1/p90': 15.920000000000002, 'relevance/v1/mean': 4.75, 'relevance/v1/variance': 0.1875, 'relevance/v1/p90': 5.0}


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  1.73it/s]


Unnamed: 0,questions,latency,token_count,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,relevance/v1/score,relevance/v1/justification
0,What is MLflow?,3.343102,50,12.5,16.1,5,"The output provides a comprehensive answer to the question about what MLflow is. It uses the provided context effectively to explain that MLflow is an open-source platform designed to manage the machine learning lifecycle, making each stage manageable, traceable, and reproducible. The output is entirely relevant to the input and context."
1,What is Databricks?,7.509839,164,11.4,15.5,5,"The output provides a comprehensive answer to the question about what Databricks is. It uses the provided context effectively to explain the key features of Databricks, its usage, and its relevance to big data and machine learning. The output is highly relevant and directly addresses the input question."
2,How to serve a model on Databricks?,6.279316,156,6.9,9.5,5,"The output provides a comprehensive answer to the question about how to serve a model on Databricks. It uses the provided context effectively, detailing the steps required to serve a model on Databricks, and highlighting the difference between the production workspace and the free Community Edition. The output is highly relevant and directly addresses the input question."
3,How to enable MLflow Autologging for my workspace by default?,7.826072,162,6.2,8.4,4,"The output provides a relevant and accurate response to the question about enabling MLflow Autologging by default. It provides a step-by-step guide and also mentions the need for more specific information about the workspace and code structure. However, it could be improved by providing more specific examples or details about how to set an environment variable to always call it, which would make the response more comprehensive."
