Author: Akshay Chougule

Originally Created On: Nov 19, 2023

Credit: Using the [mlflow documentation](https://mlflow.org/docs/latest/model-evaluation/index.html) experiment tracking

Goal of the notebook:
- LangChain setup for a RAG-based system using a web page/ web doc 
- Learn expriment tracking with LLMs
- Learn to use SoTA LLMs to evaluate the responses of LLM under evaluation (Yes, LLM to evaluate a LLM)

In [1]:
#!pip install mlflow

In [4]:
import os
import pandas as pd
import mlflow


* 'schema_extra' has been renamed to 'json_schema_extra'


In [5]:
import sys
sys.path.insert(1, '/home/ubuntu/codebase/my_github/generative-ai-experiments/')
from Constants import OPENAI_API_KEY

In [8]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [6]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [9]:
loader = WebBaseLoader("https://mlflow.org/docs/latest/index.html")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
)

### Evaluate the RAG system using mlflow.evaluate()

Create a simple function that runs each input through the RAG chain

In [17]:
def model(input_df):
    answer = []
    for index, row in input_df.iterrows():
        answer.append(qa(row["questions"]))

    return answer

In [10]:
# Create an eval dataset

eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "How to run mlflow.evaluate()?",
            "How to log_table()?",
            "How to load_table()?",
        ],
    }
)

In [14]:
help(mlflow.metrics.genai)

Help on package mlflow.metrics.genai in mlflow.metrics:

NAME
    mlflow.metrics.genai

PACKAGE CONTENTS
    base
    genai_metric
    metric_definitions
    model_utils
    prompt_template
    prompts (package)
    utils

CLASSES
    builtins.object
        mlflow.metrics.genai.base.EvaluationExample
    
    class EvaluationExample(builtins.object)
     |  EvaluationExample(input: str, output: str, score: float, justification: str, grading_context: Union[Dict[str, str], str, NoneType] = None) -> None
     |  
     |  
     |  
     |  Stores the sample example during few shot learning during LLM evaluation
     |  
     |  :param input: The input provided to the model
     |  :param output: The output generated by the model
     |  :param score: The score given by the evaluator
     |  :param justification: The justification given by the evaluator
     |  :param grading_context: The grading_context provided to the evaluator for evaluation. Either
     |                          a dic

In [11]:
# Create a faithfulness metric

from mlflow.metrics.genai import faithfulness, EvaluationExample

# Create a good and bad example for faithfulness in the context of this problem
faithfulness_examples = [
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions. In Databricks, autologging is enabled by default. ",
        score=2,
        justification="The output provides a working solution, using the mlflow.autolog() function that is provided in the context.",
        grading_context={
            "context": "mlflow.autolog(log_input_examples: bool = False, log_model_signatures: bool = True, log_models: bool = True, log_datasets: bool = True, disable: bool = False, exclusive: bool = False, disable_for_unsupported_versions: bool = False, silent: bool = False, extra_tags: Optional[Dict[str, str]] = None) → None[source] Enables (or disables) and configures autologging for all supported integrations. The parameters are passed to any autologging integrations that support them. See the tracking docs for a list of supported autologging integrations. Note that framework-specific configurations set at any point will take precedence over any configurations set by this function."
        },
    ),
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions.",
        score=5,
        justification="The output provides a solution that is using the mlflow.autolog() function that is provided in the context.",
        grading_context={
            "context": "mlflow.autolog(log_input_examples: bool = False, log_model_signatures: bool = True, log_models: bool = True, log_datasets: bool = True, disable: bool = False, exclusive: bool = False, disable_for_unsupported_versions: bool = False, silent: bool = False, extra_tags: Optional[Dict[str, str]] = None) → None[source] Enables (or disables) and configures autologging for all supported integrations. The parameters are passed to any autologging integrations that support them. See the tracking docs for a list of supported autologging integrations. Note that framework-specific configurations set at any point will take precedence over any configurations set by this function."
        },
    ),
]

faithfulness_metric = faithfulness(model="openai:/gpt-4", examples=faithfulness_examples)
print(faithfulness_metric)


EvaluationMetric(name=faithfulness, greater_is_better=True, long_name=faithfulness, version=v1, metric_details=
Task:
You must return the following fields in your response one below the other:
score: Your numerical score for the model's faithfulness based on the rubric
justification: Your step-by-step reasoning about the model's faithfulness score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called faithfulness based on the input and output.
A definition of faithfulness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Inpu

In [15]:
from mlflow.metrics.genai import relevance, EvaluationExample

relevance_metric = relevance(model="openai:/gpt-4")
print(relevance_metric)

EvaluationMetric(name=relevance, greater_is_better=True, long_name=relevance, version=v1, metric_details=
Task:
You must return the following fields in your response one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your step-by-step reasoning about the model's relevance score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called relevance based on the input and output.
A definition of relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output

In [18]:
results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[faithfulness_metric, relevance_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",
        }
    },
)
print(results.metrics)

  string_columns = trimmed_df.columns[(df.applymap(type) == str).all(0)]
  data = data.applymap(_hash_array_like_element_as_bytes)
2023/11/19 17:42:45 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/11/19 17:42:45 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/11/19 17:42:51 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

2023/11/19 17:43:00 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/11/19 17:43:00 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/11/19 17:43:00 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/11/19 17:43:00 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/11/19 17:43:00 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/11/19 17:43:00 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: faithfulness


  0%|          | 0/4 [00:00<?, ?it/s]

2023/11/19 17:43:23 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: relevance


  0%|          | 0/4 [00:00<?, ?it/s]

{'faithfulness/v1/mean': 4.0, 'faithfulness/v1/variance': 3.0, 'faithfulness/v1/p90': 5.0, 'relevance/v1/mean': 4.5, 'relevance/v1/variance': 0.25, 'relevance/v1/p90': 5.0}


In [19]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,questions,outputs,query,source_documents,latency,token_count,faithfulness/v1/score,faithfulness/v1/justification,relevance/v1/score,relevance/v1/justification
0,What is MLflow?,"MLflow is an open-source platform, purpose-bu...",What is MLflow?,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",1.742607,53,5,The output provided by the model is completely...,5,The output provides a comprehensive answer to ...
1,How to run mlflow.evaluate()?,mlflow.evaluate() is an API that allows you t...,How to run mlflow.evaluate()?,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",1.733826,61,5,The output provided by the model is completely...,4,The output provides a relevant and accurate ex...
2,How to log_table()?,You can log a table with MLflow using the log...,How to log_table()?,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",1.221808,32,1,The output claims that you can log a table wit...,5,The output provides a comprehensive answer to ...
3,How to load_table()?,You can't load_table() with MLflow. MLflow is...,How to load_table()?,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",1.029105,27,5,The output claims that MLflow is a tool for ma...,4,The output provides a relevant and accurate re...
