# Evaluation of the fine-tuned and baseline models with the RAFT generated eval dataset split

In this notebook, we will use the evaluation dataset synthetically generated in the [](./1_gen.ipynb) notebook using the RAFT method to evaluate both the baseline model and the fine-tuned model, then compare the two to analyse the impact of the fine-tuning.

We introduce the `promptflow-evals` package and built-in evaluators. Then, we'll demonstrate how to use the `evaluate` API to assess data using these evaluators.

Finally, we'll draw a diagram showing the performance of the fine-tuned model against the baseline.

## Overview

- Testing
  - Run the baseline model on the evaluation split to get its predictions.
  - Run the finetuned model on the evaluation split to get its predictions.
- Answers formatting
  - Convert the baseline model answers to a format suitable for testing
  - Convert the fine-tuned model answers to a format suitable for testing
- Evaluation
  - Calculate the metrics (such as accuracy, precision, recall, etc.) based on the predictions from the baseline model.
  - Calculate the metrics based on the predictions from the finetuned model.  
- Compare metrics

## Installing requirements

The requirements should have been automatically installed if you opened the project in Dev Container or Codespaces, but if not, uncomment the following cell to install the requirements

In [1]:
#! pip install promptflow-evals

## Running time and cost

The RAFT evaluation script usually takes a few minutes on the default sample document but can take days on bigger domains depending on the number and size of documents and the number of questions being generated for each chunk.

The cost of running this RAFT script on the sample document should be a few dollars. But beware, running it on bigger domains can cost hundreds of dollars if not more.

## Testing

### Define variables we will need

In [29]:
import os
from dotenv import load_dotenv

# User provided values
load_dotenv(".env")

# Variables passed by previous notebooks
load_dotenv(".env.state")

# Let's capture the initial working directory because the evaluate function will change it
dir = os.getcwd()

experiment_name = os.getenv("DATASET_NAME")
experiment_dir = f"{dir}/dataset/{experiment_name}-files"

# Dataset generated by the gen notebook that we will evaluate the baseline and finetuned models on
dataset_path_hf_eval = f"{experiment_dir}/{experiment_name}-hf.eval.jsonl"

# Evaluated answer files
dataset_path_hf_eval_answer = f"{experiment_dir}/{experiment_name}-hf.eval.answer.jsonl"
dataset_path_hf_eval_answer_baseline = f"{experiment_dir}/{experiment_name}-hf.eval.answer.baseline.jsonl"

# Formatted answer evaluation files
dataset_path_eval_answer_finetuned = f"{experiment_dir}/{experiment_name}-eval.answer.finetuned.jsonl"
dataset_path_eval_answer_baseline = f"{experiment_dir}/{experiment_name}-eval.answer.baseline.jsonl"

# Scored answer files
dataset_path_eval_answer_score_finetuned = f"{experiment_dir}/{experiment_name}-eval.answer.score.finetuned.jsonl"
dataset_path_eval_answer_score_baseline = f"{experiment_dir}/{experiment_name}-eval.answer.score.baseline.jsonl"

BASELINE_OPENAI_DEPLOYMENT = os.getenv("BASELINE_OPENAI_DEPLOYMENT")
FINETUNED_OPENAI_DEPLOYMENT = os.getenv("FINETUNED_OPENAI_DEPLOYMENT")
FINETUNED_MODEL_FORMAT = os.getenv("FINETUNED_MODEL_FORMAT")

print(f"Evaluating the finetuned {FINETUNED_MODEL_FORMAT} model {FINETUNED_OPENAI_DEPLOYMENT} against the baseline model {BASELINE_OPENAI_DEPLOYMENT}")

Evaluating the finetuned chat model ft-job-finetune-registered-3838 against the baseline model raft-baseline-llama-2-7b-chat


### Run the baseline model on the evaluation split

In [3]:
!env $(cat .env .env.state) python .gorilla/raft/eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer_baseline \
    --model $BASELINE_OPENAI_DEPLOYMENT \
    --env-prefix BASELINE \
    --mode chat

cat: .env: No such file or directory
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34meval[0m Using model: raft-baseline-llama-2-7b-chat
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34meval[0m Using mode: chat
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34meval[0m Using prompt key: instruction
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34meval[0m Using answer key: answer
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34menv_config[0m Resolved OpenAI env vars with 'BASELINE' prefix:
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI_TYPE=azure
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI_BASE_URL=https://raft-baseline-llama-2-7b-chat.westus3.models.ai.azure.com
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI_DEPLOYMENT=raft-baseline-llama-2-7b-chat
[32m2024-09-13 17:22:51[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI

In [22]:
import pandas as pd
pd.read_json(dataset_path_hf_eval_answer_baseline, lines=True).head(2)

Unnamed: 0,id,type,question,context,oracle_context,cot_answer,instruction,answer
0,06d347e6-9663-4a8d-a5c2-148ddfce46e1,general,What type of surfing focuses on elegance and d...,{'sentences': [['Done for both exhibition and ...,Done for both exhibition and competitionsthe g...,"To answer the question, we need to identify th...",<DOCUMENT>Done for both exhibition and competi...,"Based on the text, it appears that the type of..."
1,98c2b7cd-1026-4eff-9cf1-3afebb73e552,general,Is the emerging material lighter than traditio...,{'sentences': [['An emerging board material is...,An emerging board material is epoxy resinand E...,"To answer the question, we need to determine i...",<DOCUMENT>An emerging board material is epoxy ...,"Yes, according to the text, the emerging mater..."


### Run the fine tuned model on the evaluation split

In [28]:
!env $(cat .env .env.state) python .gorilla/raft/eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer \
    --model $FINETUNED_OPENAI_DEPLOYMENT \
    --env-prefix FINETUNED \
    --mode $FINETUNED_MODEL_FORMAT

cat: .env: No such file or directory
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34meval[0m Using model: ft-job-finetune-registered-3838
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34meval[0m Using mode: chat
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34meval[0m Using prompt key: instruction
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34meval[0m Using answer key: answer
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34menv_config[0m Resolved OpenAI env vars with 'FINETUNED' prefix:
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI_TYPE=azure
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI_API_KEY=............................uwnu
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI_DEPLOYMENT=ft-job-finetune-registered-3838
[32m2024-09-13 17:34:40[0m [1;30m INFO[0m [    ] [34menv_config[0m  - OPENAI_BASE_URL=https://ft-job-fine

In [30]:
import pandas as pd
pd.read_json(dataset_path_hf_eval_answer, lines=True).head(2)

Unnamed: 0,id,type,question,context,oracle_context,cot_answer,instruction,answer
0,98c2b7cd-1026-4eff-9cf1-3afebb73e552,general,Is the emerging material lighter than traditio...,{'sentences': [['An emerging board material is...,An emerging board material is epoxy resinand E...,"To answer the question, we need to determine i...",<DOCUMENT>An emerging board material is epoxy ...,"To answer the question, we need to identify wh..."
1,06d347e6-9663-4a8d-a5c2-148ddfce46e1,general,What type of surfing focuses on elegance and d...,{'sentences': [['Done for both exhibition and ...,Done for both exhibition and competitionsthe g...,"To answer the question, we need to identify th...",<DOCUMENT>Done for both exhibition and competi...,"To answer the question, we need to identify th..."


## Answers formatting

### Format baseline answers

Convert the baseline model answers to a format suitable for testing

In [31]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_eval_answer_baseline \
    --input-type jsonl \
    --output $dataset_path_eval_answer_baseline \
    --output-format eval

[32m2024-09-13 17:39:15[0m [1;30m INFO[0m [    ] [34mraft[0m Dataset has 125 rows
[32m2024-09-13 17:39:15[0m [1;30m INFO[0m [    ] [34mraft[0m Converting jsonl file /workspaces/llama-raft-recipe/dataset/surfing-1k-files/surfing-1k-hf.eval.answer.baseline.jsonl to jsonl eval file /workspaces/llama-raft-recipe/dataset/surfing-1k-files/surfing-1k-eval.answer.baseline.jsonl
Creating json from Arrow format: 100%|███████████| 1/1 [00:00<00:00, 174.52ba/s]


### Format finetuned model answers

Convert the fine-tuned model answers to a format suitable for testing

In [32]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_eval_answer \
    --input-type jsonl \
    --output $dataset_path_eval_answer_finetuned \
    --output-format eval

Generating train split: 125 examples [00:00, 14493.90 examples/s]
[32m2024-09-13 17:39:18[0m [1;30m INFO[0m [    ] [34mraft[0m Dataset has 125 rows
[32m2024-09-13 17:39:18[0m [1;30m INFO[0m [    ] [34mraft[0m Converting jsonl file /workspaces/llama-raft-recipe/dataset/surfing-1k-files/surfing-1k-hf.eval.answer.jsonl to jsonl eval file /workspaces/llama-raft-recipe/dataset/surfing-1k-files/surfing-1k-eval.answer.finetuned.jsonl
Filter out empty examples: 100%|█████| 125/125 [00:00<00:00, 7838.30 examples/s]
Map: 100%|███████████████████████████| 125/125 [00:00<00:00, 3480.89 examples/s]
Map: 100%|███████████████████████████| 125/125 [00:00<00:00, 8434.90 examples/s]
Map: 100%|███████████████████████████| 125/125 [00:00<00:00, 8320.31 examples/s]
Creating json from Arrow format: 100%|███████████| 1/1 [00:00<00:00, 235.23ba/s]


## Let's review the formatted files

### Finetuned model answers

In [33]:
import pandas as pd

In [34]:
pd.read_json(dataset_path_eval_answer_finetuned, lines=True).head(2)

Unnamed: 0,question,answer,gold_final_answer,final_answer,context
0,Is the emerging material lighter than traditio...,"To answer the question, we need to identify wh...",Yes,Yes,<DOCUMENT>An emerging board material is epoxy ...
1,What type of surfing focuses on elegance and d...,"To answer the question, we need to identify th...",Longboard surfing,Exhibition surfing,<DOCUMENT>Done for both exhibition and competi...


### Baseline model answers

In [35]:
pd.read_json(dataset_path_eval_answer_baseline, lines=True).head(2)

Unnamed: 0,question,answer,gold_final_answer,final_answer,context
0,What type of surfing focuses on elegance and d...,"Based on the text, it appears that the type of...",Longboard surfing,"Based on the text, it appears that the type of...",<DOCUMENT>Done for both exhibition and competi...
1,Is the emerging material lighter than traditio...,"Yes, according to the text, the emerging mater...",Yes,"Yes, according to the text, the emerging mater...",<DOCUMENT>An emerging board material is epoxy ...


## Evaluation

### Built-in Evaluators

The table below lists all the built-in evaluators we support. In the following sections, we will select a few of these evaluators to demonstrate how to use them.

| Category       | Namespace                                        | Evaluator Class           | Notes                                             |
|----------------|--------------------------------------------------|---------------------------|---------------------------------------------------|
| Quality        | promptflow.evals.evaluators                      | GroundednessEvaluator     | Measures how well the answer is entailed by the context and is not hallucinated |
|                |                                                  | RelevanceEvaluator        | How well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. |
|                |                                                  | CoherenceEvaluator        | How well all the sentences fit together and sound naturally as a whole. |
|                |                                                  | FluencyEvaluator          | Quality of individual sentences in the answer, and whether they are well-written and grammatically correct. |
|                |                                                  | SimilarityEvaluator       | Measures the similarity between the predicted answer and the correct answer |
|                |                                                  | F1ScoreEvaluator          | F1 score |
| Content Safety | promptflow.evals.evaluators.content_safety       | ViolenceEvaluator         |                                                   |
|                |                                                  | SexualEvaluator           |                                                   |
|                |                                                  | SelfHarmEvaluator         |                                                   |
|                |                                                  | HateUnfairnessEvaluator   |                                                   |
| Composite      | promptflow.evals.evaluators                      | QAEvaluator               | Built on top of individual quality evaluators.    |
|                |                                                  | ChatEvaluator             | Similar to QAEvaluator but designed for evaluating chat messages. |
|                |                                                  | ContentSafetyEvaluator    | Built on top of individual content safety evaluators. |



#### Quality Evaluator

In [36]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

azure_endpoint = os.environ.get("SCORING_AZURE_OPENAI_ENDPOINT")
azure_deployment = os.environ.get("SCORING_AZURE_OPENAI_DEPLOYMENT")
api_key = os.environ.get("SCORING_AZURE_OPENAI_API_KEY")
api_version = os.environ.get("SCORING_OPENAI_API_VERSION")

print(f"azure_endpoint={azure_endpoint}")
print(f"azure_deployment={azure_deployment}")
print(f"api_version={api_version}")

# Initialize Azure OpenAI Connection
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=azure_endpoint,
    azure_deployment=azure_deployment,
    api_version=api_version,
    api_key=api_key
)

azure_endpoint=https://aoai-otgsljc2twqys.openai.azure.com/
azure_deployment=gpt-4
api_version=2023-07-01-preview


In [37]:
from promptflow.evals.evaluators import RelevanceEvaluator, SimilarityEvaluator, GroundednessEvaluator

# Initializing evaluators
similarity = SimilarityEvaluator(model_config)
groundedness = GroundednessEvaluator(model_config)

In [38]:
df = pd.read_json(dataset_path_eval_answer_finetuned, lines=True)
sample = df.iloc[1]
sample

question             What type of surfing focuses on elegance and d...
answer               To answer the question, we need to identify th...
gold_final_answer                                    Longboard surfing
final_answer                                        Exhibition surfing
context              <DOCUMENT>Done for both exhibition and competi...
Name: 1, dtype: object

In [39]:
# Running Groundedness Evaluator on single input row
groundedness_score = groundedness(
    answer=sample["final_answer"],
    context=sample["context"],
)
print(groundedness_score)

{'gpt_groundedness': 3.0}


In [40]:
# Running Similarity Evaluator on single input row
similarity_score = similarity(
    question=sample["question"],
    answer=sample["final_answer"],
    context=sample["context"],
    ground_truth=sample["gold_final_answer"],
)
print(similarity_score)

{'gpt_similarity': 2.0}


### Using the Evaluate API to calculate the metrics

In previous sections, we walked you through how to use built-in evaluators to evaluate a single row and how to define your own custom evaluators. Now, we will show you how to use these evaluators with the powerful `evaluate` API to assess an entire dataset.

### Running the metrics

Now, we will invoke the `evaluate` API using a few evaluators that we already initialized

Additionally, we have a column mapping to map the `truth` column from the dataset to `ground_truth`, which is accepted by the evaluator.

In [41]:
from promptflow.evals.evaluate import evaluate

def score_dataset(dataset, output_path=None):
    result = evaluate(
        data=dataset,
        evaluators={"similarity": similarity, "groundedness": groundedness},
        # column mapping
        evaluator_config={
            "similarity": {
                "question": "${data.question}",
                "answer": "${data.final_answer}",
                "ground_truth": "${data.gold_final_answer}",
                "context": "${data.context}",
            },
            "groundedness": {
                "answer": "${data.final_answer}",
                "context": "${data.context}",
            },
        },
    )

    if output_path:
        pd.DataFrame.from_dict(result["rows"]).to_json(output_path, orient="records", lines=True)

    return result

#### Baseline model evaluation metrics

In [42]:
pd.read_json(dataset_path_eval_answer_baseline, lines=True).head(2)

Unnamed: 0,question,answer,gold_final_answer,final_answer,context
0,What type of surfing focuses on elegance and d...,"Based on the text, it appears that the type of...",Longboard surfing,"Based on the text, it appears that the type of...",<DOCUMENT>Done for both exhibition and competi...
1,Is the emerging material lighter than traditio...,"Yes, according to the text, the emerging mater...",Yes,"Yes, according to the text, the emerging mater...",<DOCUMENT>An emerging board material is epoxy ...


In [43]:
baseline_result = score_dataset(dataset_path_eval_answer_baseline, dataset_path_eval_answer_score_baseline)
from IPython.display import display, JSON

display(JSON(baseline_result["metrics"]))

Starting prompt flow service...
Starting prompt flow service...
Start prompt flow service on 127.0.0.1:23333, version: 1.14.0.
Start prompt flow service on 127.0.0.1:23333, version: 1.14.0.


[2024-09-13 17:40:30 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run promptflow_evals_evaluators_similarity_similarity_similarityevaluator_9cjux5yq_20240913_174028_147971, log path: /home/vscode/.promptflow/.runs/promptflow_evals_evaluators_similarity_similarity_similarityevaluator_9cjux5yq_20240913_174028_147971/logs.txt
[2024-09-13 17:40:30 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run promptflow_evals_evaluators_groundedness_groundedness_groundednessevaluator_6r9mrsap_20240913_174028_149507, log path: /home/vscode/.promptflow/.runs/promptflow_evals_evaluators_groundedness_groundedness_groundednessevaluator_6r9mrsap_20240913_174028_149507/logs.txt


You can stop the prompt flow service with the following command:'[1mpf service stop[0m'.

You can stop the prompt flow service with the following command:'[1mpf service stop[0m'.

You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=promptflow_evals_evaluators_similarity_similarity_similarityevaluator_9cjux5yq_20240913_174028_147971
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=promptflow_evals_evaluators_groundedness_groundedness_groundednessevaluator_6r9mrsap_20240913_174028_149507


[2024-09-13 17:40:32 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 17:40:32 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 17:40:32 +0000][promptflow.core._prompty_utils][ERROR] - E

2024-09-13 18:12:01 +0000   56186 execution.bulk     INFO     Process 56294 terminated.
2024-09-13 18:12:01 +0000   56186 execution.bulk     INFO     Process 56253 terminated.


[2024-09-13 18:12:01 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 9 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:12:01 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:12:01 +0000][promptflow.core._prompty_utils][ERROR] - E

2024-09-13 17:40:30 +0000   52105 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-13 17:40:30 +0000   52105 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 125}.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-4:2)-Process id(56271)-Line number(0) start execution.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-4:3)-Process id(56284)-Line number(3) start execution.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-4:1)-Process id(56253)-Line number(1) start execution.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-4:4)-Process id(56294)-Line number(2) start execution.
2024-09-13 17:40:32 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-4:3)-Process id(56

  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", np.nan, inplace=True)
[2024-09-13 18:12:10 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 1 second. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:12:10 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 1 second. Please contact Azure support service 

2024-09-13 18:12:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:4)-Process id(56265)-Line number(113) completed.
2024-09-13 18:12:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:4)-Process id(56265)-Line number(121) start execution.


[2024-09-13 18:12:11 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 49 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:12:11 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 49 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:12:11 +0000][promptflow.core._prompty_utils][ERROR] -

2024-09-13 18:12:31 +0000   52105 execution.bulk     INFO     [Process Pool] [Active processes: 4 / 4]
2024-09-13 18:12:31 +0000   52105 execution.bulk     INFO     [Lines] [Finished: 118] [Processing: 4] [Pending: 3]
2024-09-13 18:12:31 +0000   52105 execution.bulk     INFO     Processing Lines: line 118 (Process name(ForkProcess-3:3)-Process id(56258)-Line number(118)), line 119 (Process name(ForkProcess-3:2)-Process id(56252)-Line number(119)), line 120 (Process name(ForkProcess-3:1)-Process id(56247)-Line number(120)), line 121 (Process name(ForkProcess-3:4)-Process id(56265)-Line number(121)).


[2024-09-13 18:13:00 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 1 second. Please contact Azure support service if you would like to further increase the default rate limit.'}}


2024-09-13 18:13:00 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:2)-Process id(56252)-Line number(119) completed.
2024-09-13 18:13:00 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:2)-Process id(56252)-Line number(122) start execution.
2024-09-13 18:13:01 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:1)-Process id(56247)-Line number(120) completed.
2024-09-13 18:13:01 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:1)-Process id(56247)-Line number(123) start execution.
2024-09-13 18:13:01 +0000   52105 execution.bulk     INFO     Finished 120 / 125 lines.
2024-09-13 18:13:01 +0000   52105 execution.bulk     INFO     Average execution time for completed lines: 16.25 seconds. Estimated time for incomplete lines: 81.25 seconds.
2024-09-13 18:13:01 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:4)-Process id(56265)-Line number(121) completed.
2024-09-13 18:13:01 +0000   52105 exe

[2024-09-13 18:13:01 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:13:01 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}


2024-09-13 18:13:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:4)-Process id(56265)-Line number(124) completed.


[2024-09-13 18:13:11 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 49 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}


2024-09-13 18:13:31 +0000   52105 execution.bulk     INFO     [Process Pool] [Active processes: 4 / 4]
2024-09-13 18:13:31 +0000   52105 execution.bulk     INFO     [Lines] [Finished: 124] [Processing: 1] [Pending: 0]
2024-09-13 18:13:31 +0000   52105 execution.bulk     INFO     Processing Lines: line 118 (Process name(ForkProcess-3:3)-Process id(56258)-Line number(118)).
2024-09-13 18:14:01 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:3)-Process id(56258)-Line number(118) completed.
2024-09-13 18:14:02 +0000   52105 execution.bulk     INFO     Finished 125 / 125 lines.
2024-09-13 18:14:02 +0000   52105 execution.bulk     INFO     Average execution time for completed lines: 16.09 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-09-13 18:14:02 +0000   52105 execution.bulk     INFO     The thread monitoring the process [56247-ForkProcess-3:1] will be terminated.
2024-09-13 18:14:02 +0000   52105 execution.bulk     INFO     The thread monitoring the 

[2024-09-13 18:14:03 +0000][promptflow.evals.evaluate._utils][ERROR] - Unable to log traces as trace destination was not defined.


2024-09-13 17:40:30 +0000   52105 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-13 17:40:30 +0000   52105 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 125}.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:4)-Process id(56265)-Line number(0) start execution.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:3)-Process id(56258)-Line number(1) start execution.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:2)-Process id(56252)-Line number(2) start execution.
2024-09-13 17:40:31 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:1)-Process id(56247)-Line number(3) start execution.
2024-09-13 17:40:32 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-3:3)-Process id(56

<IPython.core.display.JSON object>

In [44]:
# Check the results using Azure AI Studio UI
if baseline_result["studio_url"]:
    print(f"Results uploaded to AI Studio {baseline_result['studio_url']}")
else:
    print("Results available at http://127.0.0.1:23333")

Results available at http://127.0.0.1:23333


#### Finetuned model evaluation metrics

In [45]:
pd.read_json(dataset_path_eval_answer_finetuned, lines=True).head(2)

Unnamed: 0,question,answer,gold_final_answer,final_answer,context
0,Is the emerging material lighter than traditio...,"To answer the question, we need to identify wh...",Yes,Yes,<DOCUMENT>An emerging board material is epoxy ...
1,What type of surfing focuses on elegance and d...,"To answer the question, we need to identify th...",Longboard surfing,Exhibition surfing,<DOCUMENT>Done for both exhibition and competi...


In [46]:
finetune_result = score_dataset(dataset_path_eval_answer_finetuned, dataset_path_eval_answer_score_finetuned)
from IPython.display import display, JSON

display(JSON(finetune_result["metrics"]))

[2024-09-13 18:25:10 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run promptflow_evals_evaluators_similarity_similarity_similarityevaluator_yekzx791_20240913_182510_585693, log path: /home/vscode/.promptflow/.runs/promptflow_evals_evaluators_similarity_similarity_similarityevaluator_yekzx791_20240913_182510_585693/logs.txt
[2024-09-13 18:25:10 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run promptflow_evals_evaluators_groundedness_groundedness_groundednessevaluator_dswx2cxa_20240913_182510_586583, log path: /home/vscode/.promptflow/.runs/promptflow_evals_evaluators_groundedness_groundedness_groundednessevaluator_dswx2cxa_20240913_182510_586583/logs.txt


Prompt flow service has started...
Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=promptflow_evals_evaluators_groundedness_groundedness_groundednessevaluator_dswx2cxa_20240913_182510_586583
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=promptflow_evals_evaluators_similarity_similarity_similarityevaluator_yekzx791_20240913_182510_585693
2024-09-13 18:56:41 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:3)-Process id(66749)-Line number(107) completed.
2024-09-13 18:56:41 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:4)-Process id(66763)-Line number(106) completed.
2024-09-13 18:56:41 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:3)-Process id(66749)-Line number(110) start execution.
2024-09-13 18:56:41 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:4)-Process id(66763)-Line number(111) start exe

[2024-09-13 18:25:12 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:25:12 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:25:12 +0000][promptflow.core._prompty_utils][ERROR] - E

2024-09-13 18:55:41 +0000   66644 execution.bulk     INFO     Process 66731 terminated.
2024-09-13 18:25:10 +0000   52105 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-13 18:25:10 +0000   52105 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 125}.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-7:3)-Process id(66742)-Line number(0) start execution.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-7:4)-Process id(66757)-Line number(1) start execution.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-7:2)-Process id(66731)-Line number(2) start execution.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-7:1)-Process id(66720)-Line number(3) start execution.
2024-09-13 18:25:

  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", np.nan, inplace=True)
[2024-09-13 18:55:51 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 50 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:55:51 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 50 seconds. Please contact Azure support se

2024-09-13 18:56:10 +0000   52105 execution.bulk     INFO     [Process Pool] [Active processes: 4 / 4]
2024-09-13 18:56:10 +0000   52105 execution.bulk     INFO     [Lines] [Finished: 106] [Processing: 4] [Pending: 15]
2024-09-13 18:56:10 +0000   52105 execution.bulk     INFO     Processing Lines: line 106 (Process name(ForkProcess-8:4)-Process id(66763)-Line number(106)), line 107 (Process name(ForkProcess-8:3)-Process id(66749)-Line number(107)), line 108 (Process name(ForkProcess-8:2)-Process id(66735)-Line number(108)), line 109 (Process name(ForkProcess-8:1)-Process id(66721)-Line number(109)).


[2024-09-13 18:56:42 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 60 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:56:42 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:56:42 +0000][promptflow.core._prompty_utils][ERROR] - 

2024-09-13 18:57:10 +0000   52105 execution.bulk     INFO     [Process Pool] [Active processes: 4 / 4]
2024-09-13 18:57:10 +0000   52105 execution.bulk     INFO     [Lines] [Finished: 113] [Processing: 4] [Pending: 8]
2024-09-13 18:57:10 +0000   52105 execution.bulk     INFO     Processing Lines: line 113 (Process name(ForkProcess-8:1)-Process id(66721)-Line number(113)), line 114 (Process name(ForkProcess-8:4)-Process id(66763)-Line number(114)), line 115 (Process name(ForkProcess-8:3)-Process id(66749)-Line number(115)), line 116 (Process name(ForkProcess-8:2)-Process id(66735)-Line number(116)).


[2024-09-13 18:57:42 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 60 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:57:42 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 10 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}
[2024-09-13 18:57:42 +0000][promptflow.core._prompty_utils][ERROR] - 

2024-09-13 18:58:10 +0000   52105 execution.bulk     INFO     [Process Pool] [Active processes: 4 / 4]
2024-09-13 18:58:10 +0000   52105 execution.bulk     INFO     [Lines] [Finished: 121] [Processing: 4] [Pending: 0]
2024-09-13 18:58:10 +0000   52105 execution.bulk     INFO     Processing Lines: line 120 (Process name(ForkProcess-8:4)-Process id(66763)-Line number(120)), line 121 (Process name(ForkProcess-8:1)-Process id(66721)-Line number(121)), line 122 (Process name(ForkProcess-8:3)-Process id(66749)-Line number(122)), line 124 (Process name(ForkProcess-8:2)-Process id(66735)-Line number(124)).
2024-09-13 18:58:43 +0000   66668 execution.bulk     INFO     Process 66749 terminated.
2024-09-13 18:58:43 +0000   66668 execution.bulk     INFO     Process 66721 terminated.
2024-09-13 18:58:43 +0000   66668 execution.bulk     INFO     Process 66735 terminated.
2024-09-13 18:58:43 +0000   66668 execution.bulk     INFO     Process 66763 terminated.


[2024-09-13 18:58:44 +0000][promptflow.evals.evaluate._utils][ERROR] - Unable to log traces as trace destination was not defined.


2024-09-13 18:25:10 +0000   52105 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-13 18:25:10 +0000   52105 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 125}.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:3)-Process id(66749)-Line number(1) start execution.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:1)-Process id(66721)-Line number(0) start execution.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:2)-Process id(66735)-Line number(2) start execution.
2024-09-13 18:25:11 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:4)-Process id(66763)-Line number(3) start execution.
2024-09-13 18:25:12 +0000   52105 execution.bulk     INFO     Process name(ForkProcess-8:1)-Process id(66

<IPython.core.display.JSON object>


Finally, let's check the results produced by the evaluate API.

In [None]:
# Check the results using Azure AI Studio UI
if finetune_result["studio_url"]:
    print(f"Results uploaded to AI Studio {finetune_result['studio_url']}")
else:
    print("Results available at http://127.0.0.1:23333")

## Let's look at examples

In [None]:
df_baseline=pd.read_json(dataset_path_eval_answer_baseline, lines=True)
df_finetuned=pd.read_json(dataset_path_eval_answer_finetuned, lines=True)
df_merged=pd.merge(df_baseline, df_finetuned, on="question", suffixes=('_baseline', '_finetuned'))

## Compare the metrics of the fine-tuned model against the baseline

In [None]:
metrics = pd.DataFrame.from_dict({"baseline": baseline_result["metrics"], "finetuned": finetune_result["metrics"]})
metrics["improvement"] = metrics["finetuned"] / metrics["baseline"]
metrics

In [None]:
metrics.drop("improvement", axis=1).plot.bar(rot=0)