# Calculate retrieval recall accuracy with NV-Ingest and LlamaIndex

In this notebook, we'll use NV-ingest and LlamaIndex to calculate the end-to-end recall accuracy of a retrieval pipeline made up of NV-Ingest's extraction, embedding, and vector database (VDB) upload tasks and LlamaIndex's retrieval functionality.

**Note:** In order to run this notebook, you'll need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up.

### Calculating table recall

**Important:** nv_ingest_collection in the Milvus VDB must be empty before running so we can make sure that only the tables we're extracting get pulled. This means that the following code should return a row count of `0`.

In [8]:
from pymilvus import MilvusClient

milvus_client = MilvusClient("http://localhost:19530")
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 0}

In [3]:
import os
import glob
import logging
import time
import json
from tqdm import tqdm
from collections import defaultdict
import numpy as np
import pandas as pd

from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.embeddings.nvidia import NVIDIAEmbedding

# TODO: Add your NVIDIA API key here
os.environ['NVIDIA_API_KEY'] = '<YOUR_NVIDIA_API_KEY>'

First, well use NV-Ingest's pdfium model ensemble to extract tables, we'll use the included embedding NIM to generate embeddings for those extracted tables, and then we'll store those embeddings in the Milvus VDB. We'll do this by creating a batch NV-Ingest job and adding tasks for table extraction, embedding, and VDB upload. 

In [None]:
from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import BatchJobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.primitives.tasks import EmbedTask
from nv_ingest_client.primitives.tasks import VdbUploadTask

batch_job_spec = BatchJobSpec(
    [
        "data/bo767/*.pdf",
    ]
)

extract_task = ExtractTask(
    document_type="pdf",
    extract_text=False,
    extract_images=False,
    extract_tables=True,
    extract_charts=False,
)

embed_task = EmbedTask(
    text=False,
    tables=True,
)

vdb_upload_task = VdbUploadTask()

batch_job_spec.add_task(extract_task)
batch_job_spec.add_task(embed_task)
batch_job_spec.add_task(vdb_upload_task)

client = NvIngestClient(
    message_client_hostname="nv-ingest-ms-runtime",  # Host where nv-ingest-ms-runtime is running
    message_client_port=7670,  # REST port, defaults to 7670
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)

Now, our extracted tables and their corresponding embeddings are uploaded and indexed in the Milvus VDB

In [11]:
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 27198}

Next, we'll read in our set of queries and expected retrievals

In [2]:
df_query = pd.read_csv('table_queries_cleaned_235.csv')[['query','pdf','page','table']]
df_query['pdf_page'] = df_query.apply(lambda x: f"{x.pdf}_{x.page}", axis=1)
df_query

Unnamed: 0,query,pdf,page,table,pdf_page
0,How much did Pendleton County spend out of the...,1003421,2,1003421_2_0,1003421_2
1,How many units are occupied by single families...,1008059,6,1008059_6_1,1008059_6
2,"In the Klamath county, what is the total valua...",1008059,6,1008059_6_1,1008059_6
3,How much did Nalco pay GRIDCO for electricity ...,1011810,21,1011810_21_0,1011810_21
4,How much coal is used at Alumina refinery of N...,1011810,21,1011810_21_2,1011810_21
...,...,...,...,...,...
230,How much is the rental income from water plant...,2407280,30,2407280_30_0,2407280_30
231,In 2020 how much were the supplemental taxes f...,2415001,65,not detected,2415001_65
232,"As of 2020, what is the total of collections a...",2415001,65,not detected,2415001_65
233,What was the net gain from the operations of t...,2416020,84,2416020_84_0,2416020_84


Then, we'll connect LlamaIndex to the Milvus VDB and create a retriever with the NVIDIA `nv-embedqa-e5-v5` model used for the embedding function. When we query the retriever, LLamaIndex will use this model to embed the query and then use that embedding to retrieve entries from our Milvus VDB. 

**Note:** You *must* use this specific embedding model or the query embeddings won't match up with the embeddings in Milvus. 

In [3]:
embed_model = NVIDIAEmbedding(model="nvidia/nv-embedqa-e5-v5")

vector_store = MilvusVectorStore(
    uri="http://localhost:19530",
    collection_name="nv_ingest_collection",
    doc_id_field="pk",
    embedding_field="vector",
    text_key="text",
    dim=1024,
    output_fields=["source", "content_metadata"],
    overwrite=False
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=embed_model)
retriever = index.as_retriever(similarity_top_k=10)

Finally, we'll use the retriever to calculate recall scores by checking if a table from the expected pdf and page is retrieved in the top `k` results for `k = 1, 3, 5, 10`

In [None]:
#TODO: optionally set reranker

In [4]:
hits = defaultdict(list)

for i in tqdm(range(len(df_query))):
    query = df_query['query'][i]
    expected_pdf_page = df_query['pdf_page'][i]
    retrieved_answers = retriever.retrieve(query)
    retrieved_pdfs = [os.path.basename(json.loads(node.json())["node"]["metadata"]["source"]["source_id"]).split('.')[0] for node in retrieved_answers]
    retrieved_pages = [str(json.loads(node.json())["node"]["metadata"]["content_metadata"]["page_number"]) for node in retrieved_answers]
    retrieved_pdf_pages = [f"{pdf}_{page}" for pdf, page in zip(retrieved_pdfs, retrieved_pages)]

    for k in [1, 3, 5, 10]:
        hits[k].append(expected_pdf_page in retrieved_pdf_pages[:k])

for k in hits:
    print(f'  - Recall @{k}: {np.mean(hits[k]) :.3f}')

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 235/235 [01:49<00:00,  2.15it/s]

  - Recall @1: 0.447
  - Recall @3: 0.651
  - Recall @5: 0.736
  - Recall @10: 0.796





### Calculating chart recall

Calculating the chart recall is almost exactly the same, but we'll need to empty our Milvus VDB and load the extracted charts and corresponding embeddings in.

**Note:** Once again our Milvus VDB will need to be empty meaning the following code should return a row count of `0`.

In [10]:
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 0}

Then, we'll create another NV-Ingest job with the extract, embed, and VDB upload tasks, but this time we'll only extract charts instead of tables

In [None]:
batch_job_spec = BatchJobSpec(
    [
        "data/bo767/*.pdf",
    ]
)

extract_task = ExtractTask(
    document_type="pdf",
    extract_text=False,
    extract_images=False,
    extract_tables=False,
    extract_charts=True,
)

embed_task = EmbedTask(
    text=False,
    tables=True,
)

vdb_upload_task = VdbUploadTask()

batch_job_spec.add_task(extract_task)
batch_job_spec.add_task(embed_task)
batch_job_spec.add_task(vdb_upload_task)

client = NvIngestClient(
    message_client_hostname="nv-ingest-ms-runtime",  # Host where nv-ingest-ms-runtime is running
    message_client_port=7670,  # REST port, defaults to 7670
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)

Now, our extracted tables and their corresponding embeddings are uploaded and indexed in the Milvus VDB

In [12]:
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 5665}

Then, we'll read in our set of queries and expected retrievals

In [5]:
df_query = pd.read_csv('charts_with_page_num_fixed.csv')[['query','pdf','page']]
df_query['page'] = df_query['page']-1 # page -1 because the page number starts with 1 in that csv
df_query['pdf_page'] = df_query.apply(lambda x: f"{x.pdf}_{x.page}", axis=1) 
df_query

Unnamed: 0,query,pdf,page,pdf_page
0,What are the top three consumer complaint cate...,1009210,11,1009210_11
1,Which 3 categories did extremely well in terms...,1009210,11,1009210_11
2,What's the longest recent US recession?,1010876,0,1010876_0
3,Is the 12-Month default rate usually higher th...,1010876,0,1010876_0
4,Which allegation is submitted highest to RTAs ...,1014669,0,1014669_0
...,...,...,...,...
263,"After the 2008 recession, what percentage of p...",2384395,6,2384395_6
264,what were the top 3 major religious groups in ...,2392676,5,2392676_5
265,What percentage of people in the world identif...,2392676,5,2392676_5
266,"Between 2003 and 2019, has the household mortg...",2410699,189,2410699_189


And finally we can use our LlamaIndex retriever from earlier to calculate recall scores in the same manner

In [None]:
# TODO: optionally set reranker 

In [7]:
hits = defaultdict(list)

for i in tqdm(range(len(df_query))):
    query = df_query['query'][i]
    expected_pdf_page = df_query['pdf_page'][i]
    retrieved_answers = retriever.retrieve(query)
    retrieved_pdfs = [os.path.basename(json.loads(node.json())["node"]["metadata"]["source"]["source_id"]).split('.')[0] for node in retrieved_answers]
    retrieved_pages = [str(json.loads(node.json())["node"]["metadata"]["content_metadata"]["page_number"]) for node in retrieved_answers]
    retrieved_pdf_pages = [f"{pdf}_{page}" for pdf, page in zip(retrieved_pdfs, retrieved_pages)]

    if 

    for k in [1, 3, 5, 10]:
        hits[k].append(expected_pdf_page in retrieved_pdf_pages[:k])

for k in hits:
    print(f'  - Recall @{k}: {np.mean(hits[k]) :.3f}')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 268/268 [02:16<00:00,  1.96it/s]

  - Recall @1: 0.593
  - Recall @3: 0.713
  - Recall @5: 0.757
  - Recall @10: 0.810



