# NeMo Evaluator microservice: Retriever and RAG Evaluation

In the following notebook, we'll be exploring how to use [NeMo Evaluator microservice](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/overview.html) to evaluate [Retriever Models](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations/evaluations_retriever.html) as well as [Retrieval Augmented Generation (RAG) Models](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/models/models_rag.html)!

We'll look at the following examples: 

- [Retriever Model Evaluation on FiQA](##Retriever-Model-Evaluation-on-FiQA)
- [Retriever + Reranking Evaluation on FiQA](##retriever--reranking-evaluation-on-fiqa)
- [Retrieval Augmented Generation (RAG) Evaluation on FiQA with Ragas Metrics](##retrieval-augmented-generation-rag-evaluation-on-fiqa-with-ragas-metrics)
- [Retrieval Augmented Generation (RAG) Evaluation on Synthetically Generated Data with Ragas Metrics](##retrieval-augmented-generation-rag-evaluation-on-synthetically-generated-data-with-ragas-metrics)

In order to get started, we'll need to make sure our Evaluation Microservice is running, alongside our Retriever, Re-Rank, and LLM NIMs.

## Initial Set-up and Notebook Dependencies

In order to run this notebook, the following will need to be up and running: 

- Evaluator Microservice, which can be deployed through the convenient [Deploying with Helm](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/deploy-helm.html)
- NVIDIA NIM Text Embedding, `nvidia/nv-embedqa-e5-v5`, which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/getting-started.html) guide
- NVIDIA NIM Text Reranking, `nv-rerank-qa-mistral-4b`, which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/getting-started.html) guide
- NVIDIA NIM for LLM (this notebook used Llama 3.1 8B Instruct), which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) guide

Once all of our services are up and running, we can install the Python `requests` library, which we will use to communicate with the Evaluator API.

In [None]:
!pip install -qU requests 

We'll need to provide the Evaluation API URL in the cell below.

> NOTE: Your evaluation URL will be provided as part of your deployment. 

In [2]:
EVAL_URL = "<< YOUR EVAL URL HERE >>"

Now we can verify our Evaluation API is up and running with the built-in health check!

In [3]:
import requests

endpoint = f"{EVAL_URL}/health"
response = requests.get(endpoint).json()
print(response)

{'status': 'healthy'}


## Retriever Model Evaluation on FiQA

For our first evaluation, we're going to evaluate our Retrieval Model (`nvidia/nv-embedqa-e5-v5`) on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

The core pieces we need to provide are: 

- `top_k`, how many documents to retriever through our retriever model
- `query_embedding_url`, the address of your currently running `nvidia/nv-embedqa-e5-v5` NIM if you're following the notebook exactly.
- `query_embedding_model`, this will be `nvidia/nv-embedqa-e5-v5` if you're following the notebook exactly.
- `index_embedding_url`, which will mirror the `query_embedding_url` assuming that you're using the same NIM deployment for both Query Embedding and Index embedding.
- `index_embedding_model`, this will mirror the `query_embedding_model` assuming that you're using the same NIM deployment for both Query Embedding and Index embedding.

> NOTE: While it's possible to use different NIM *deployments* for Query/Index Embedding - you will need to ensure the underlying model is the same between both.

We'll also want to ensure we've set-up our evaluations correctly by following the available [documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations/evaluations_retriever.html) for Retriever evaluations.


In [13]:
retriever_eval_config = {
  "model": {
    "retriever": {
      "top_k": 10,
      "query_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
      "query_embedding_model": "nvidia/nv-embedqa-e5-v5",
      "index_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
      "index_embedding_model": "nvidia/nv-embedqa-e5-v5"
    }
  },
  "evaluations": [
    {
      "eval_type": "automatic",
      "eval_subtype": "beir",
      "dataset_path" : "fiqa",
      "metrics": "recall_5,ndcg_cut_5,recall_10,ndcg_cut_10",
      "dataset_format": "beir"
    }
  ],
  "tag": "retriever-eval-beir"
}

We can now point to our `evaluations` endpoint at our Evaluation URL.

In [14]:
evaluator_endpoint = f"{EVAL_URL}/v1/evaluations"

The following cell will kick-off an evaluation job, and provide the Evaluation ID which can be used to monitor, and later download, the evaluation and results.

In [16]:
response = requests.post(evaluator_endpoint, json=retriever_eval_config).json()
retriever_evaluation_id = response["evaluation_id"]
print(f"Evaluation ID: {retriever_evaluation_id}")

Evaluation ID: eval-P75Yh9icc58inYuzn82AVo


We can check on the status of our evaluation in the cell below. 

> NOTE: When the evaluation `status` becomes `succeeded`, the `evaluation_results` field will become populated.

In [42]:
evaluation_id_endpoint = evaluator_endpoint + f"/{retriever_evaluation_id}"
response = requests.get(evaluation_id_endpoint).json()
response

{'evaluation_id': 'eval-P75Yh9icc58inYuzn82AVo',
 'status': 'succeeded',
 'model': {'llm_name': None,
  'inference_url': None,
  'llm': None,
  'retriever': {'top_k': 10,
   'query_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
   'query_embedding_model': 'nvidia/nv-embedqa-e5-v5',
   'index_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
   'index_embedding_model': 'nvidia/nv-embedqa-e5-v5',
   'ranker_model': None,
   'ranker_url': None},
  'rag': None},
 'evaluations': [{'eval_type': 'automatic',
   'eval_subtype': 'beir',
   'dataset_path': 'fiqa',
   'metrics': 'recall_5,ndcg_cut_5,recall_10,ndcg_cut_10',
   'dataset_format': 'beir'}],
 'tag': 'retriever-eval-beir',
 'created_at': '2024-09-25T22:03:28',
 'created_by': None,
 'evaluation_results': [{'level_name': 'evaluation',
   'isRecommended': True,
   'extra_grouping_fields': None,
   'metrics': [{'name': 'ndcg_cut_5',
     'value':

The `evaluation_results` field will contain our `metrics` along with their name, and their score.

```python
[
  {
    'name': 'ndcg_cut_5',
    'value': 0.43179850619730425,
    'metadata': {'name': 'beir'}
  },
  {
    'name': 'recall_10',
    'value': 0.5212761004427672,
    'metadata': {'name': 'beir'}
  },
  {
    'name': 'ndcg_cut_10',
    'value': 0.455153721565557,
    'metadata': {'name': 'beir'}
  },
  {
    'name': 'recall_5',
    'value': 0.4460219435913878,
    'metadata': {'name': 'beir'}
  }
]
```

## Retriever + Reranking Evaluation on FiQA

For our second evaluation, we're going to evaluate our Retrieval Model (`nvidia/nv-embedqa-e5-v5`) on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

Instead of simply using a Retriever model, however, this example will also leverage a Reranking model (`nvidia/nv-rerank-qa-mistral-4b`) to rerank the retrieved results.

We'll rerun the same evaluation configuration as we did above - with a few extra parameters in our `retriever` configuration:

- `ranker_url`, which will point to our reranking model
- `ranker_model`, which will contain the name of our reranking model

Aside from that change - this evaluation is identical so we can observe the improvements gained by using a reranking NIM.

In [61]:
retriever_reranker_eval_config = {
  "model": {
    "retriever": {
      "top_k": 10,
      "query_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
      "query_embedding_model": "nvidia/nv-embedqa-e5-v5",
      "index_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
      "index_embedding_model": "nvidia/nv-embedqa-e5-v5",
      "ranker_url" : "https://ranking.dev.aire.nvidia.com/v1",
      "ranker_model" : "nvidia/nv-rerank-qa-mistral-4b"
    }
  },
  "evaluations": [
    {
      "eval_type": "automatic",
      "eval_subtype": "beir",
      "dataset_path" : "fiqa",
      "metrics": "recall_5,ndcg_cut_5,recall_10,ndcg_cut_10",
      "dataset_format": "beir"
    }
  ],
  "tag": "retriever-reranker-eval-beir"
}

Once again we can kick-off the evaluation job by sending a request to the evaluation URL.

In [62]:
response = requests.post(evaluator_endpoint, json=retriever_reranker_eval_config).json()
retriever_reranker_evaluation_id = response["evaluation_id"]
print(f"Evaluation ID: {retriever_reranker_evaluation_id}")

Evaluation ID: eval-J7HN4DmDTFuRaqBVp3WUxT


We can monitor the job using the following request.

In [86]:
evaluation_id_endpoint = evaluator_endpoint + f"/{retriever_reranker_evaluation_id}"
response = requests.get(evaluation_id_endpoint).json()
response

{'evaluation_id': 'eval-J7HN4DmDTFuRaqBVp3WUxT',
 'status': 'failed',
 'model': {'llm_name': None,
  'inference_url': None,
  'llm': None,
  'retriever': {'top_k': 10,
   'query_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
   'query_embedding_model': 'nvidia/nv-embedqa-e5-v5',
   'index_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
   'index_embedding_model': 'nvidia/nv-embedqa-e5-v5',
   'ranker_model': 'nvidia/nv-rerank-qa-mistral-4b',
   'ranker_url': 'https://ranking.dev.aire.nvidia.com/v1'},
  'rag': None},
 'evaluations': [{'eval_type': 'automatic',
   'eval_subtype': 'beir',
   'dataset_path': 'fiqa',
   'metrics': 'recall_5,ndcg_cut_5,recall_10,ndcg_cut_10',
   'dataset_format': 'beir'}],
 'tag': 'retriever-reranker-eval-beir',
 'created_at': '2024-09-26T00:11:52',
 'created_by': None,
 'evaluation_results': []}

We can observe the results, as compared to the retriever-only results, below!

In [None]:
### NEED JOB TO SUCCEED

## Retrieval Augmented Generation (RAG) Evaluation on FIQA with Ragas Metrics

With the most recent release of NeMo Evaluator microservice, not only can we evaluate Retrievers and Rerankers - we can also Evaluate RAG!

Once again, we're going to evaluate on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

We're also going to evaluate our RAG pipeline on the [Ragas](https://docs.ragas.io/en/stable/howtos/index.html) metrics ["Faithfulness"](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html). This can be done by extending our evaluation configuration in the following ways:

1. We can create the model type `rag`, and provide our `retriever` configuration we used in the first evaluation.
2. We need to provide a `context_ordering` parameter, in this case we'll use `desc` which will order our context in descending score.
3. We need to provide a "generator" (LLM) that can be used to generate responses based on the retrieved context!

We'll also need to add in a number of `judge_` parameters to help calculate the Faithfulness metric.

Let's look at an example evaluation configuration below:

In [47]:
rag_eval_config = {
  "model": {
    "rag": {
        "context_ordering": "desc",
        "retriever": {
            "top_k": 10,
            "query_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
            "query_embedding_model": "nvidia/nv-embedqa-e5-v5",
            "index_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
            "index_embedding_model": "nvidia/nv-embedqa-e5-v5"
        },
        "llm": {
            "inference_url": "http://meta-llama3-1-8b-instruct.nim-meta-llama3-1-8b-instruct.svc.cluster.local:8000/v1",
            "llm_name": "meta/llama-3_1-8b-instruct"
        }
    }
  },
  "evaluations": [
    {
        "eval_type": "automatic",
        "eval_subtype": "beir",
        "dataset_path": "fiqa",
        "dataset_format" : "beir",
        "retriever_metrics": "recall_5,ndcg_cut_5,recall_10,ndcg_cut_10",
        "rag_metrics": "faithfulness",
        "judge_llm": "meta/llama-3_1-8b-instruct",
        "judge_llm_url": "http://meta-llama3-1-8b-instruct.nim-meta-llama3-1-8b-instruct.svc.cluster.local:8000/v1",
        "judge_llm_api_key": None,
        "judge_embeddings": "nvidia/nv-embedqa-e5-v5",
        "judge_embeddings_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
        "judge_embeddings_api_key": None,
        "judge_timeout": 120,
        "judge_max_retries": 2,
        "judge_max_workers": 16
    }
  ],
  "tag": "rag-eval-beir"
}

Now that we've set-up our evaluation configuration, we're ready to fire off the evaluation job!

In [36]:
response = requests.post(evaluator_endpoint, json=rag_eval_config).json()
rag_evaluation_id = response["evaluation_id"]
print(f"Evaluation ID: {rag_evaluation_id}")

{'evaluation_id': 'eval-4JHowsb6JsttyYjdQ6JBdt', 'status': 'running', 'model': {'llm_name': None, 'inference_url': 'http://meta-llama3-8b-instruct.nim-meta-llama3-8b-instruct.svc.cluster.local:8000/v1', 'llm': None, 'retriever': None, 'rag': {'retriever': {'top_k': 10, 'query_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings', 'query_embedding_model': 'nvidia/nv-embedqa-e5-v5', 'index_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings', 'index_embedding_model': 'nvidia/nv-embedqa-e5-v5', 'ranker_model': None, 'ranker_url': None}, 'llm': {'llm_name': 'meta/llama-3_1-8b-instruct', 'inference_url': 'http://meta-llama3-1-8b-instruct.nim-meta-llama3-1-8b-instruct.svc.cluster.local:8000/v1'}}}, 'evaluations': [{'eval_type': 'automatic', 'eval_subtype': 'beir'}], 'tag': 'rag-eval-beir', 'created_at': '2024-09-25T22:15:55', 'created_by': None}
Evaluation ID: eval-4JHowsb6JsttyYjdQ6JBdt


Once again, we can poll the API to determine the status of our job.

In [60]:
evaluation_id_endpoint = evaluator_endpoint + f"/{rag_evaluation_id}"
response = requests.get(evaluation_id_endpoint).json()
response

{'evaluation_id': 'eval-4JHowsb6JsttyYjdQ6JBdt',
 'status': 'succeeded',
 'model': {'llm_name': None,
  'inference_url': 'http://meta-llama3-8b-instruct.nim-meta-llama3-8b-instruct.svc.cluster.local:8000/v1',
  'llm': None,
  'retriever': None,
  'rag': {'retriever': {'top_k': 10,
    'query_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
    'query_embedding_model': 'nvidia/nv-embedqa-e5-v5',
    'index_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
    'index_embedding_model': 'nvidia/nv-embedqa-e5-v5',
    'ranker_model': None,
    'ranker_url': None},
   'llm': {'llm_name': 'meta/llama-3_1-8b-instruct',
    'inference_url': 'http://meta-llama3-1-8b-instruct.nim-meta-llama3-1-8b-instruct.svc.cluster.local:8000/v1'}}},
 'evaluations': [{'eval_type': 'automatic',
   'eval_subtype': 'beir',
   'dataset_path': 'fiqa',
   'dataset_format': 'beir',
   'retriever_metrics': 'recall_5,ndcg_cut_5,

Going beyond just looking at the results in the response object - we can download the results from the NeMo Datastore microservice as showcased below!

In [64]:
url = evaluator_endpoint + f"/{rag_evaluation_id}" + "/download-results"
headers = {'accept': 'application/json'}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    with open('result.zip', 'wb') as file:
        file.write(response.content)
    print("Results downloaded successfully.")
else:
    print(f"Failed to download results. Status code: {response.status_code}")

Results downloaded successfully.


Let's unzip the results.

In [68]:
!unzip result.zip -d results

Archive:  result.zip
 extracting: results/metadata.json   
 extracting: results/.gitattributes  
 extracting: results/automatic/beir/model_config_custom_rag_model.yaml  
 extracting: results/automatic/beir/haystack_yaml/index.yaml  
 extracting: results/automatic/beir/haystack_yaml/query.yaml  
 extracting: results/automatic/beir/results/cleanup_milvus-run.log  
 extracting: results/automatic/beir/results/scores.jsonl  
 extracting: results/automatic/beir/results/README.md  
 extracting: results/automatic/beir/results/rag-run.log  


And use a helper function to display them!

In [80]:
import json

def display_jsonl_scores(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            data = json.loads(line.strip())
            for category, scores in data.items():
                print(f"{category.capitalize()} Scores:")
                for key, value in scores.items():
                    print(f"  {key}: {value:.4f}")
            print()  # Add a blank line between objects for readability

In [82]:
display_jsonl_scores("./results/automatic/beir/results/scores.jsonl")

Retriever Scores:
  ndcg_cut_10: 0.4548
  recall_5: 0.4455
  ndcg_cut_5: 0.4314
  recall_10: 0.5208

Generation Scores:
  faithfulness: 0.7632



## Retrieval Augmented Generation (RAG) Evaluation on Synthetically Generated Data with Ragas Metrics

For our final evaluation, we're going to be leveraging work done in [this](https://github.com/NVIDIA/GenerativeAIExamples/blob/main/nemo/retriever-synthetic-data-generation/notebooks/quickstart.ipynb) notebook to create a BeIR format dataset created with Synthetic Data Generation. 

The output from the above notebook should be a dataset with the following items which can be found in the `outputs/sample_synthetic_data/beir/filtered/synthetic` directory after running the notebook:

- `corpus.jsonl`
- `qrels/test.tsv`
- `queries.jsonl`

This notebook assumes you've run the above notebook and have moved the `outputs/sample_synthetic_data/beir/filtered/synthetic` directory into the root folder of this notebook.

We'll use the following utility function to upload the folder contents to the NeMo Datastore microservice under the name "SDG_BEIR".

In [85]:
!python upload_sdg_data.py

Dataset folder uploaded to: https://datastore.stg.llm.ngc.nvidia.com/datasets/nvidia/SDG_BEIR/tree/main/.


We can once again create our RAG evaluation configuration while make a small change, which is to simply point at the newly updated dataset.

> NOTE: We'll also include the [Answer Relevance](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html) metric from Ragas for this run!

In [91]:
rag_eval_sdg_config = {
  "model": {
    "rag": {
        "context_ordering": "desc",
        "retriever": {
            "top_k": 10,
            "query_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
            "query_embedding_model": "nvidia/nv-embedqa-e5-v5",
            "index_embedding_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
            "index_embedding_model": "nvidia/nv-embedqa-e5-v5"
        },
        "llm": {
            "inference_url": "http://meta-llama3-1-8b-instruct.nim-meta-llama3-1-8b-instruct.svc.cluster.local:8000/v1",
            "llm_name": "meta/llama-3_1-8b-instruct"
        }
    }
  },
  "evaluations": [
    {
        "eval_type": "automatic",
        "eval_subtype": "beir",
        "dataset_path": "nds:SDG_BEIR",
        "dataset_format" : "beir",
        "retriever_metrics": "recall_5,ndcg_cut_5,recall_10,ndcg_cut_10",
        "rag_metrics": "faithfulness,answer_relevancy",
        "judge_llm": "meta/llama-3_1-8b-instruct",
        "judge_llm_url": "http://meta-llama3-1-8b-instruct.nim-meta-llama3-1-8b-instruct.svc.cluster.local:8000/v1",
        "judge_llm_api_key": None,
        "judge_embeddings": "nvidia/nv-embedqa-e5-v5",
        "judge_embeddings_url": "http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings",
        "judge_embeddings_api_key": None,
        "judge_timeout": 120,
        "judge_max_retries": 2,
        "judge_max_workers": 16
    }
  ],
  "tag": "rag-eval-sdg-beir"
}


Let's fire off that evaluation job!

In [89]:
response = requests.post(evaluator_endpoint, json=rag_eval_sdg_config).json()
rag_sdg_evaluation_id = response["evaluation_id"]
print(f"Evaluation ID: {rag_sdg_evaluation_id}")

Evaluation ID: eval-7wNGSbr3vAWJrdkktjrc3J


Again, we can use the API to determine when the run has completed.

In [98]:
evaluation_id_endpoint = evaluator_endpoint + f"/{rag_sdg_evaluation_id}"
response = requests.get(evaluation_id_endpoint).json()
response

{'evaluation_id': 'eval-7wNGSbr3vAWJrdkktjrc3J',
 'status': 'succeeded',
 'model': {'llm_name': None,
  'inference_url': 'http://meta-llama3-8b-instruct.nim-meta-llama3-8b-instruct.svc.cluster.local:8000/v1',
  'llm': None,
  'retriever': None,
  'rag': {'retriever': {'top_k': 10,
    'query_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
    'query_embedding_model': 'nvidia/nv-embedqa-e5-v5',
    'index_embedding_url': 'http://nemo-embedding-ms.nemo-retrieval.svc.cluster.local:8080/v1/embeddings',
    'index_embedding_model': 'nvidia/nv-embedqa-e5-v5',
    'ranker_model': None,
    'ranker_url': None},
   'llm': {'llm_name': 'meta/llama-3_1-8b-instruct',
    'inference_url': 'http://meta-llama3-1-8b-instruct.nim-meta-llama3-1-8b-instruct.svc.cluster.local:8000/v1'}}},
 'evaluations': [{'eval_type': 'automatic',
   'eval_subtype': 'beir',
   'dataset_path': 'nds:SDG_BEIR',
   'dataset_format': 'beir',
   'retriever_metrics': 'recall_5,ndc

Let's download, unzip, and view the results on our synthetically created evaluation set!

In [99]:
url = evaluator_endpoint + f"/{rag_sdg_evaluation_id}" + "/download-results"
headers = {'accept': 'application/json'}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    with open('result_sdg.zip', 'wb') as file:
        file.write(response.content)
    print("Results downloaded successfully.")
else:
    print(f"Failed to download results. Status code: {response.status_code}")

Results downloaded successfully.


In [100]:
!unzip result_sdg.zip -d results_sdg

Archive:  result_sdg.zip
 extracting: results_sdg/metadata.json  
 extracting: results_sdg/.gitattributes  
 extracting: results_sdg/automatic/beir/model_config_custom_rag_model.yaml  
 extracting: results_sdg/automatic/beir/haystack_yaml/index.yaml  
 extracting: results_sdg/automatic/beir/haystack_yaml/query.yaml  
 extracting: results_sdg/automatic/beir/results/cleanup_milvus-run.log  
 extracting: results_sdg/automatic/beir/results/scores.jsonl  
 extracting: results_sdg/automatic/beir/results/README.md  
 extracting: results_sdg/automatic/beir/results/rag-run.log  
 extracting: results_sdg/automatic/beir/SDG_BEIR/queries.jsonl  
 extracting: results_sdg/automatic/beir/SDG_BEIR/corpus.jsonl  
 extracting: results_sdg/automatic/beir/SDG_BEIR/.gitattributes  
 extracting: results_sdg/automatic/beir/SDG_BEIR/qrels/test.tsv  


In [101]:
display_jsonl_scores("./results_sdg/automatic/beir/results/scores.jsonl")

Retriever Scores:
  ndcg_cut_5: 1.0000
  ndcg_cut_10: 1.0000
  recall_5: 1.0000
  recall_10: 1.0000

Generation Scores:
  faithfulness: 0.7384
  answer_relevancy: 0.4565

