Author: Akshay Chougule

Originally Created On: Nov 19, 2023

Credit: Using the [mlflow documentation](https://mlflow.org/docs/latest/model-evaluation/index.html) experiment tracking

Goal of the notebook:
- LangChain setup for a RAG-based system using a web page/ web doc 
- Learn expriment tracking with LLMs
- Learn to use SoTA LLMs to evaluate the responses of LLM under evaluation (Yes, LLM to evaluate a LLM)

In [1]:
import os
import pandas as pd
import mlflow


* 'schema_extra' has been renamed to 'json_schema_extra'


In [2]:
import sys
sys.path.insert(1, '/home/ubuntu/codebase/my_github/generative-ai-experiments/')
from Constants import OPENAI_API_KEY

In [3]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [4]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

#### Since the GLM-1 trials are generating a lot of buzz, let's look at one of them.

In [5]:
loader = WebBaseLoader("https://classic.clinicaltrials.gov/ct2/show/NCT02496611")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
)

Created a chunk of size 2376, which is longer than the specified 1000
Created a chunk of size 1744, which is longer than the specified 1000


### Evaluate the RAG system using mlflow.evaluate()

Create a simple function that runs each input through the RAG chain

In [13]:
def model(input_df):
    answer = []
    for index, row in input_df.iterrows():
        answer.append(qa(row["questions"]))

    return answer

In [14]:
# Create an eval dataset

eval_df = pd.DataFrame(
    {
        "questions": [
            "What type of masking or binding is provided for this study?",
            "Can older adults with age over 65 were part of this study?",
            "Were patients treated with growth harmones be part of this study?",
            "Are there any publications associated with this study?",
        ],
    }
)

In [8]:
mlflow.metrics.__dir__()

['__name__',
 '__doc__',
 '__package__',
 '__loader__',
 '__spec__',
 '__path__',
 '__file__',
 '__cached__',
 '__builtins__',
 'base',
 'MetricValue',
 'metric_definitions',
 '_accuracy_eval_fn',
 '_ari_eval_fn',
 '_f1_score_eval_fn',
 '_flesch_kincaid_eval_fn',
 '_mae_eval_fn',
 '_mape_eval_fn',
 '_max_error_eval_fn',
 '_mse_eval_fn',
 '_precision_at_k_eval_fn',
 '_precision_eval_fn',
 '_r2_score_eval_fn',
 '_recall_at_k_eval_fn',
 '_recall_eval_fn',
 '_rmse_eval_fn',
 '_rouge1_eval_fn',
 '_rouge2_eval_fn',
 '_rougeL_eval_fn',
 '_rougeLsum_eval_fn',
 '_token_count_eval_fn',
 '_toxicity_eval_fn',
 'EvaluationMetric',
 'make_metric',
 'experimental',
 'latency',
 'token_count',
 'toxicity',
 'flesch_kincaid_grade_level',
 'ari_grade_level',
 'exact_match',
 'rouge1',
 'rouge2',
 'rougeL',
 'rougeLsum',
 'precision_at_k',
 'recall_at_k',
 'mae',
 'mse',
 'rmse',
 'r2_score',
 'max_error',
 'mape',
 'recall_score',
 'precision_score',
 'f1_score',
 '__all__']

In [10]:
# Create a faithfulness metric

from mlflow.metrics.genai import faithfulness, EvaluationExample

# Create a good and bad example for faithfulness in the context of this problem
faithfulness_examples = [
    EvaluationExample(
        input="Can a 25 year pregnant women be part of this?",
        output="Yes, it would a really beneficial for a 25 year old pregnant woman since this study can help with better neurological development of fetus.",
        score=1,
        justification="The output provides a incorrect information, and cites reason which are not part of the source.",
        grading_context={
            "context": "Exclusion Criteria. Females: currently pregnant, planning to become pregnant, or unwilling to use 2 or more acceptable methods of contraception when engaging in sexual activity throughout the study. Inclusion criteria: 12-17 years old"
        },
    ),
    EvaluationExample(
        input="Can a 25 year pregnant women be part of this?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions.",
        score=5,
        justification="A pregnant 25 year old can not be part of this trial as she would not be eligible due to the exclusion criteria prohibiting females who are currently pregnant, planning to become pregnant, or unwilling to use 2 or more acceptable methods of contraception when engaging in sexual activity throughout the study. Additionally, there is age retriction of 12 to 17 years which would also exclude the individuals from participating in the study.",
        grading_context={
            "context": "Exclusion Criteria. Females: currently pregnant, planning to become pregnant, or unwilling to use 2 or more acceptable methods of contraception when engaging in sexual activity throughout the study. Inclusion criteria. 12-17 years old"
        },
    ),
]

faithfulness_metric = faithfulness(model="openai:/gpt-4", examples=faithfulness_examples)
print(faithfulness_metric)


EvaluationMetric(name=faithfulness, greater_is_better=True, long_name=faithfulness, version=v1, metric_details=
Task:
You must return the following fields in your response one below the other:
score: Your numerical score for the model's faithfulness based on the rubric
justification: Your step-by-step reasoning about the model's faithfulness score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called faithfulness based on the input and output.
A definition of faithfulness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Inpu

In [11]:
from mlflow.metrics.genai import relevance, EvaluationExample

relevance_metric = relevance(model="openai:/gpt-4")
print(relevance_metric)

EvaluationMetric(name=relevance, greater_is_better=True, long_name=relevance, version=v1, metric_details=
Task:
You must return the following fields in your response one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your step-by-step reasoning about the model's relevance score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called relevance based on the input and output.
A definition of relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output

In [15]:
results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[faithfulness_metric, relevance_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",
        }
    },
)
print(results.metrics)

  string_columns = trimmed_df.columns[(df.applymap(type) == str).all(0)]
  data = data.applymap(_hash_array_like_element_as_bytes)
2023/11/19 18:56:07 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/11/19 18:56:07 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/11/19 18:56:11 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

2023/11/19 18:56:20 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/11/19 18:56:20 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/11/19 18:56:20 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/11/19 18:56:20 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/11/19 18:56:20 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/11/19 18:56:20 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: faithfulness


  0%|          | 0/4 [00:00<?, ?it/s]

2023/11/19 18:56:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: relevance


  0%|          | 0/4 [00:00<?, ?it/s]

{'faithfulness/v1/mean': 5.0, 'faithfulness/v1/variance': 0.0, 'faithfulness/v1/p90': 5.0, 'relevance/v1/mean': 5.0, 'relevance/v1/variance': 0.0, 'relevance/v1/p90': 5.0}


In [16]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,questions,outputs,query,source_documents,latency,token_count,faithfulness/v1/score,faithfulness/v1/justification,relevance/v1/score,relevance/v1/justification
0,What type of masking or binding is provided fo...,"Quadruple (Participant, Care Provider, Invest...",What type of masking or binding is provided fo...,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",1.112689,16,5,"The output provided by the model is ""Quadruple...",5,The output directly and comprehensively answer...
1,Can older adults with age over 65 were part of...,"No, this study was only open to participants ...",Can older adults with age over 65 were part of...,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",0.81785,15,5,The output from the model states that the stud...,5,The output accurately and comprehensively answ...
2,Were patients treated with growth harmones be ...,"No, patients were not treated with growth hor...",Were patients treated with growth harmones be ...,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",1.023118,13,5,"The output claim that ""patients were not treat...",5,The output accurately and comprehensively answ...
3,Are there any publications associated with thi...,"Yes, there is a Study Protocol and Statistica...",Are there any publications associated with thi...,"[{'lc_attributes': {}, 'lc_secrets': {}, 'meta...",0.835928,20,5,The output claims that there is a Study Protoc...,5,The output directly and comprehensively answer...


In [19]:
# Let's fix the truncated text
pd.set_option('display.max_colwidth', 0)

In [20]:
results.tables["eval_results_table"][['questions','outputs','query']]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,questions,outputs,query
0,What type of masking or binding is provided for this study?,"Quadruple (Participant, Care Provider, Investigator, Outcomes Assessor)",What type of masking or binding is provided for this study?
1,Can older adults with age over 65 were part of this study?,"No, this study was only open to participants aged 12-17.",Can older adults with age over 65 were part of this study?
2,Were patients treated with growth harmones be part of this study?,"No, patients were not treated with growth hormones in this study.",Were patients treated with growth harmones be part of this study?
3,Are there any publications associated with this study?,"Yes, there is a Study Protocol and Statistical Analysis Plan PDF document provided by the University of Minnesota.",Are there any publications associated with this study?


In [27]:
results.tables["eval_results_table"].iloc[0:4,4:9]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,latency,token_count,faithfulness/v1/score,faithfulness/v1/justification,relevance/v1/score
0,1.112689,16,5,"The output provided by the model is ""Quadruple (Participant, Care Provider, Investigator, Outcomes Assessor)"" in response to the input asking about the type of masking or binding provided for the study. This output is directly supported by the provided context, specifically in the document where it states ""Masking: Quadruple (Participant, Care Provider, Investigator, Outcomes Assessor)"". Therefore, all of the claims in the output are directly supported by the provided context, demonstrating high faithfulness to the provided context.",5
1,0.81785,15,5,"The output from the model states that the study was only open to participants aged 12-17. This is directly supported by the provided context, which states ""Ages Eligible for Study: 12 Years to 17 Years (Child)"". Therefore, all of the claims in the output are directly supported by the provided context, demonstrating high faithfulness to the provided context.",5
2,1.023118,13,5,"The output claim that ""patients were not treated with growth hormones in this study"" is directly supported by the provided context. The context documents describe the study as focusing on weight loss treatments, specifically GLP-1RA (exenatide) and placebo, with no mention of growth hormones being used. Therefore, the model's output is entirely faithful to the context.",5
3,0.835928,20,5,"The output claims that there is a Study Protocol and Statistical Analysis Plan PDF document provided by the University of Minnesota. This claim is directly supported by the provided context, specifically in the document that mentions ""Documents provided by University of Minnesota: Study Protocol and Statistical Analysis Plan [PDF] November 13, 2019"". Therefore, all of the claims in the output are directly supported by the provided context, demonstrating high faithfulness to the provided context.",5
