# Calculate retrieval recall accuracy with NV-Ingest and LlamaIndex

In this notebook, we'll use NV-ingest and LlamaIndex to calculate the end-to-end recall accuracy of a retrieval pipeline made up of NV-Ingest's extraction, embedding, and vector database (VDB) upload tasks and LlamaIndex's retrieval functionality.

**Note:** In order to run this notebook, you'll need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up.

To start, make sure the required LlamaIndex packages are installed and up to date as well as pymilvus

In [None]:
pip install -qU llama_index llama-index-embeddings-nvidia llama-index-postprocessor-nvidia-rerank llama-index-vector-stores-milvus pymilvus

## Calculating table recall

**Important:** nv_ingest_collection in the Milvus VDB must be empty before running so we can make sure that only the tables we're extracting get pulled. This means that the following code should return a row count of `0`.

In [8]:
from pymilvus import MilvusClient

milvus_client = MilvusClient("http://localhost:19530")
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 0}

In [37]:
import os
import glob
import logging
import time
import json
from tqdm import tqdm
from collections import defaultdict
import numpy as np
import pandas as pd

from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.embeddings.nvidia import NVIDIAEmbedding

First, well use NV-Ingest's pdfium model ensemble to extract tables, we'll use the included embedding NIM to generate embeddings for those extracted tables, and then we'll store those embeddings in the Milvus VDB. We'll do this by creating a batch NV-Ingest job and adding tasks for table extraction, embedding, and VDB upload. 

In [None]:
from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import BatchJobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.primitives.tasks import EmbedTask
from nv_ingest_client.primitives.tasks import VdbUploadTask

batch_job_spec = BatchJobSpec(
    [
        "../data/bo767/*.pdf",
    ]
)

extract_task = ExtractTask(
    document_type="pdf",
    extract_text=False,
    extract_images=False,
    extract_tables=True,
    extract_charts=False,
)

embed_task = EmbedTask(
    text=False,
    tables=True,
)

vdb_upload_task = VdbUploadTask()

batch_job_spec.add_task(extract_task)
batch_job_spec.add_task(embed_task)
batch_job_spec.add_task(vdb_upload_task)

client = NvIngestClient(
    message_client_hostname="localhost",
    message_client_port=7670,
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)

Now, our extracted tables and their corresponding embeddings are uploaded and indexed in the Milvus VDB

In [11]:
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 27198}

Next, we'll read in our set of queries and expected retrievals

In [2]:
df_query = pd.read_csv('../data/table_queries_cleaned_235.csv')[['query','pdf','page','table']]
df_query['pdf_page'] = df_query.apply(lambda x: f"{x.pdf}_{x.page}", axis=1)
df_query

Unnamed: 0,query,pdf,page,table,pdf_page
0,How much did Pendleton County spend out of the...,1003421,2,1003421_2_0,1003421_2
1,How many units are occupied by single families...,1008059,6,1008059_6_1,1008059_6
2,"In the Klamath county, what is the total valua...",1008059,6,1008059_6_1,1008059_6
3,How much did Nalco pay GRIDCO for electricity ...,1011810,21,1011810_21_0,1011810_21
4,How much coal is used at Alumina refinery of N...,1011810,21,1011810_21_2,1011810_21
...,...,...,...,...,...
230,How much is the rental income from water plant...,2407280,30,2407280_30_0,2407280_30
231,In 2020 how much were the supplemental taxes f...,2415001,65,not detected,2415001_65
232,"As of 2020, what is the total of collections a...",2415001,65,not detected,2415001_65
233,What was the net gain from the operations of t...,2416020,84,2416020_84_0,2416020_84


Then, we'll connect LlamaIndex to the Milvus VDB and create a retriever with the NVIDIA `nv-embedqa-e5-v5` model used for the embedding function. When we query the retriever, LLamaIndex will use this model to embed the query and then use that embedding to retrieve entries from our Milvus VDB. 

**Note:** You *must* use this specific embedding model or the query embeddings won't match up with the embeddings in Milvus. 

In [13]:
embed_model = NVIDIAEmbedding(base_url="http://localhost:8012/v1", model="nvidia/nv-embedqa-e5-v5")

vector_store = MilvusVectorStore(
    uri="http://localhost:19530",
    collection_name="nv_ingest_collection",
    doc_id_field="pk",
    embedding_field="vector",
    text_key="text",
    dim=1024,
    output_fields=["source", "content_metadata"],
    overwrite=False
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=embed_model)
retriever = index.as_retriever(similarity_top_k=10)

Finally, we'll use the retriever to calculate recall scores by checking if a table from the expected pdf and page is retrieved in the top `k` results for `k = 1, 3, 5, 10`

In [4]:
hits = defaultdict(list)

for i in tqdm(range(len(df_query))):
    query = df_query['query'][i]
    expected_pdf_page = df_query['pdf_page'][i]
    retrieved_answers = retriever.retrieve(query)

    retrieved_pdfs = [os.path.basename(json.loads(node.json())["node"]["metadata"]["source"]["source_id"]).split('.')[0] for node in retrieved_answers]
    retrieved_pages = [str(json.loads(node.json())["node"]["metadata"]["content_metadata"]["page_number"]) for node in retrieved_answers]
    retrieved_pdf_pages = [f"{pdf}_{page}" for pdf, page in zip(retrieved_pdfs, retrieved_pages)]

    for k in [1, 3, 5, 10]:
        hits[k].append(expected_pdf_page in retrieved_pdf_pages[:k])

for k in hits:
    print(f'  - Recall @{k}: {np.mean(hits[k]) :.3f}')

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 235/235 [01:49<00:00,  2.15it/s]

  - Recall @1: 0.447
  - Recall @3: 0.651
  - Recall @5: 0.736
  - Recall @10: 0.796





## Calculating chart recall

Calculating the chart recall is almost exactly the same, but we'll need to empty our Milvus VDB and load the extracted charts and corresponding embeddings in.

**Note:** Once again our Milvus VDB will need to be empty meaning the following code should return a row count of `0`.

In [10]:
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 0}

Then, we'll create another NV-Ingest job with the extract, embed, and VDB upload tasks, but this time we'll only extract charts instead of tables

In [None]:
batch_job_spec = BatchJobSpec(
    [
        "../data/bo767/*.pdf",
    ]
)

extract_task = ExtractTask(
    document_type="pdf",
    extract_text=False,
    extract_images=False,
    extract_tables=False,
    extract_charts=True,
)

embed_task = EmbedTask(
    text=False,
    tables=True,
)

vdb_upload_task = VdbUploadTask()

batch_job_spec.add_task(extract_task)
batch_job_spec.add_task(embed_task)
batch_job_spec.add_task(vdb_upload_task)

client = NvIngestClient(
    message_client_hostname="localhost",
    message_client_port=7670,
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)

Now, our extracted charts and their corresponding embeddings are uploaded and indexed in the Milvus VDB

In [12]:
milvus_client.get_collection_stats(collection_name='nv_ingest_collection')

{'row_count': 5665}

Then, we'll read in our set of queries and expected retrievals

In [9]:
df_query = pd.read_csv('../data/charts_with_page_num_fixed.csv')[['query','pdf','page']]
df_query['page'] = df_query['page']-1 # page -1 because the page number starts with 1 in that csv
df_query['pdf_page'] = df_query.apply(lambda x: f"{x.pdf}_{x.page}", axis=1) 
df_query

Unnamed: 0,query,pdf,page,pdf_page
0,What are the top three consumer complaint cate...,1009210,11,1009210_11
1,Which 3 categories did extremely well in terms...,1009210,11,1009210_11
2,What's the longest recent US recession?,1010876,0,1010876_0
3,Is the 12-Month default rate usually higher th...,1010876,0,1010876_0
4,Which allegation is submitted highest to RTAs ...,1014669,0,1014669_0
...,...,...,...,...
263,"After the 2008 recession, what percentage of p...",2384395,6,2384395_6
264,what were the top 3 major religious groups in ...,2392676,5,2392676_5
265,What percentage of people in the world identif...,2392676,5,2392676_5
266,"Between 2003 and 2019, has the household mortg...",2410699,189,2410699_189


And finally we can use our LlamaIndex retriever from earlier to calculate recall scores in the same manner

In [18]:
hits = defaultdict(list)

for i in tqdm(range(len(df_query))):
    query = df_query['query'][i]
    expected_pdf_page = df_query['pdf_page'][i]
    retrieved_answers = retriever.retrieve(query)
    retrieved_pdfs = [os.path.basename(json.loads(node.json())["node"]["metadata"]["source"]["source_id"]).split('.')[0] for node in retrieved_answers]
    retrieved_pages = [str(json.loads(node.json())["node"]["metadata"]["content_metadata"]["page_number"]) for node in retrieved_answers]
    retrieved_pdf_pages = [f"{pdf}_{page}" for pdf, page in zip(retrieved_pdfs, retrieved_pages)]


    for k in [1, 3, 5, 10]:
        hits[k].append(expected_pdf_page in retrieved_pdf_pages[:k])

for k in hits:
    print(f'  - Recall @{k}: {np.mean(hits[k]) :.3f}')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 268/268 [01:21<00:00,  3.31it/s]

  - Recall @1: 0.593
  - Recall @3: 0.720
  - Recall @5: 0.746
  - Recall @10: 0.802





## Calculating text recall

To calculate text recall, we'll use three popular question answering (QA) datasets. We'll use NV-Ingest to extract the text and once again embed and upload to Milvus.

The corpus for each of the QA datasets comes in the form of jsonl file with text snippets and their corresponding IDs used to identify if the correct text snippet was returned. In order to ingest these text snippets with NV-Ingest we'll need to split them into individual textfiles with the ID used as the file name

In [None]:
from pathlib import Path

for dataset in ["fiqa", "nq_sample35", "hotpotqa_sample35"]:
    Path(f"../data/{dataset}/textfiles").mkdir(parents=False, exist_ok=True)
    corpus_df = pd.read_json(f"../data/{dataset}/corpus.jsonl", lines=True)

    for i in tqdm(range(len(corpus_df))):
        text_id = str(corpus_df["_id"][i])
        text = corpus_df["text"][i]
    
        with open(f"../data/{dataset}/textfiles/{text_id}.txt", "w") as f:
            f.write(text)

If you ran the table and chart recall cells you should already have a retriever connected to the Milvus VDB, but if not we can create that here

In [None]:
embed_model = NVIDIAEmbedding(base_url="http://localhost:8012/v1", model="nvidia/nv-embedqa-e5-v5")

vector_store = MilvusVectorStore(
    uri="http://localhost:19530",
    collection_name="nv_ingest_collection",
    doc_id_field="pk",
    embedding_field="vector",
    text_key="text",
    dim=1024,
    output_fields=["source", "content_metadata"],
    overwrite=False
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=embed_model)
retriever = index.as_retriever(similarity_top_k=10)

### Fiqa

First, we'll create an NV-Ingest job that extracts the text from our textfiles, embeds them, and uploads them to our Milvus VDB

In [30]:
batch_job_spec = BatchJobSpec(
    [
        "../data/fiqa/textfiles/*.txt",
    ]
)

extract_task = ExtractTask(
    document_type="text",
    extract_text=True,
    extract_images=False,
    extract_tables=False,
    extract_charts=False,
)

embed_task = EmbedTask(
    text=True,
    tables=False,
)

vdb_upload_task = VdbUploadTask()

batch_job_spec.add_task(extract_task)
batch_job_spec.add_task(embed_task)
batch_job_spec.add_task(vdb_upload_task)

client = NvIngestClient(
    message_client_hostname="localhost",
    message_client_port=7670,
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)

Next, we'll load in our query dataset, which contains the questions and their corresponding IDs

In [39]:
df_query = pd.read_json("../data/fiqa/queries.jsonl", lines=True)
df_query

Unnamed: 0,_id,text,metadata
0,0,What is considered a business expense on a bus...,{}
1,4,Business Expense - Car Insurance Deductible Fo...,{}
2,5,Starting a new online business,{}
3,6,“Business day” and “due date” for bills,{}
4,7,New business owner - How do taxes work for the...,{}
...,...,...,...
6643,4102,How can I determine if my rate of return is “g...,{}
6644,3566,Where can I buy stocks if I only want to inves...,{}
6645,94,Using credit card points to pay for tax deduct...,{}
6646,2551,How to find cheaper alternatives to a traditio...,{}


And then we'll load in our evaluation set which contains the mappings of queries to corpus text snippets

In [27]:
df_test = pd.read_csv('../data/fiqa/qrels/test.tsv', sep='\t')
df_test

Unnamed: 0,query-id,corpus-id,score
0,8,566392,1
1,8,65404,1
2,15,325273,1
3,18,88124,1
4,26,285255,1
...,...,...,...
1701,11039,330058,1
1702,11039,91183,1
1703,11054,155053,1
1704,11054,321015,1


Finally, we'll iterate through our evaluation set, check if the expected answer for each query is in the top 1, 3, 5, and 10 results, and then average those results to get our various recall scores

In [28]:
hits = defaultdict(list)

for i in tqdm(range(len(df_test))):
    query_id = df_test['query-id'][i]
    answer_id = str(df_test['corpus-id'][i])
    query = df_query[df_query['_id'] == query_id]['text'].to_list()[0]

    retrieved_answers = retriever.retrieve(query)
    retrieved_ids = [os.path.basename(node.metadata["source"]["source_id"]).split('.')[0] for node in retrieved_answers]

    for k in [1, 3, 5, 10]:
        hits[k].append(answer_id in retrieved_ids[:k])

for k in hits:
    print(f'  - Recall @{k}: {np.mean(hits[k]) :.3f}')

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1706/1706 [08:47<00:00,  3.24it/s]

  - Recall @1: 0.174
  - Recall @3: 0.328
  - Recall @5: 0.407
  - Recall @10: 0.494





For NQ_Sample35 and HotpotQA_Sample35, we'll repeat the exact same process as Fiqa, but with the different datasets 

### NQ_Sample35

In [None]:
from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import BatchJobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.primitives.tasks import EmbedTask
from nv_ingest_client.primitives.tasks import VdbUploadTask

batch_job_spec = BatchJobSpec(
    [
        "../data/nq_sample35/textfiles/*.txt",
    ]
)

extract_task = ExtractTask(
    document_type="text",
    extract_text=True,
    extract_images=False,
    extract_tables=False,
    extract_charts=False,
)

embed_task = EmbedTask(
    text=True,
    tables=False,
)

vdb_upload_task = VdbUploadTask()

batch_job_spec.add_task(extract_task)
batch_job_spec.add_task(embed_task)
batch_job_spec.add_task(vdb_upload_task)

client = NvIngestClient(
    message_client_hostname="localhost",
    message_client_port=7670,
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)

In [40]:
df_query = pd.read_json("../data/nq_sample35/queries.jsonl", lines=True)
df_query

Unnamed: 0,_id,text,metadata
0,test0,what is non controlling interest on balance sheet,{}
1,test1,how many episodes are in chicago fire season 4,{}
2,test2,who sings love will keep us alive by the eagles,{}
3,test3,who is the leader of the ontario pc party,{}
4,test4,nitty gritty dirt band fishin in the dark album,{}
...,...,...,...
3447,test3447,when is the met office leaving the bbc,{}
3448,test3448,where does junior want to go to find hope,{}
3449,test3449,who does eric end up with in that 70s show,{}
3450,test3450,where does the great outdoors movie take place,{}


In [41]:
df_test = pd.read_csv('../data/nq_sample35/qrels/test.tsv', sep='\t')
df_test

Unnamed: 0,query-id,corpus-id,score
0,test0,doc0,1
1,test0,doc1,1
2,test1,doc6,1
3,test2,doc10,1
4,test3,doc17,1
...,...,...,...
4196,test3449,doc117643,1
4197,test3449,doc117646,1
4198,test3450,doc117662,1
4199,test3450,doc117663,1


In [42]:
hits = defaultdict(list)

for i in tqdm(range(len(df_test))):
    query_id = df_test['query-id'][i]
    answer_id = str(df_test['corpus-id'][i])
    query = df_query[df_query['_id'] == query_id]['text'].to_list()[0]

    retrieved_answers = retriever.retrieve(query)
    retrieved_ids = [os.path.basename(node.metadata["source"]["source_id"]).split('.')[0] for node in retrieved_answers]

    for k in [1, 3, 5, 10]:
        hits[k].append(answer_id in retrieved_ids[:k])

for k in hits:
    print(f'  - Recall @{k}: {np.mean(hits[k]) :.3f}')

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4201/4201 [15:04<00:00,  4.64it/s]

  - Recall @1: 0.309
  - Recall @3: 0.522
  - Recall @5: 0.615
  - Recall @10: 0.728





### HotpotQA_Sample35

In [None]:
batch_job_spec = BatchJobSpec(
    [
        "../data/hotpotqa_sample35/textfiles/*.txt",
    ]
)

extract_task = ExtractTask(
    document_type="text",
    extract_text=True,
    extract_images=False,
    extract_tables=False,
    extract_charts=False,
)

embed_task = EmbedTask(
    text=True,
    tables=False,
)

vdb_upload_task = VdbUploadTask()

batch_job_spec.add_task(extract_task)
batch_job_spec.add_task(embed_task)
batch_job_spec.add_task(vdb_upload_task)

client = NvIngestClient(
    message_client_hostname="localhost",
    message_client_port=7670,
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)

In [12]:
df_query = pd.read_json("../data/hotpotqa_sample35/queries.jsonl", lines=True)
df_query

Unnamed: 0,_id,text,metadata
0,5ab6d31155429954757d3384,What country of origin does House of Cosbys an...,"{'answer': 'American', 'supporting_facts': [['..."
1,5ac0d92f554299012d1db645,"How many fountains where present ""World of Col...","{'answer': '1,200 musical water fountains', 's..."
2,5abd01335542993a06baf9fc,Chris Larceny directed the music video Gon Joc...,"{'answer': 'the Fugees', 'supporting_facts': [..."
3,5abff8c95542994516f4555c,The person where local tradition says Cross La...,"{'answer': 'the Iroquois Confederacy', 'suppor..."
4,5adec8ad55429975fa854f8f,"The actor who played Carl Sweetchuck in the ""P...","{'answer': 'Denise DeClue', 'supporting_facts'..."
...,...,...,...
97847,5ab92307554299753720f72d,What Pakistani actor and writer from Islamabad...,"{'answer': 'Yasir Hussain', 'supporting_facts'..."
97848,5abba3b1554299642a094aed,Are both Volvic and Canfield's Diet Chocolate ...,"{'answer': 'no', 'supporting_facts': [['Volvic..."
97849,5a8173fa554299260e20a28e,Are Billy and Barak both breeds of scenthound?...,"{'answer': 'yes', 'supporting_facts': [['Bosni..."
97850,5a8caf1d554299585d9e3720,Were both of the following rock groups formed ...,"{'answer': 'yes', 'supporting_facts': [['Dig (..."


In [11]:
df_test = pd.read_csv('../data/hotpotqa_sample35/qrels/test.tsv', sep='\t')
df_test

Unnamed: 0,query-id,corpus-id,score
0,5a8b57f25542995d1e6f1371,2816539,1
1,5a8b57f25542995d1e6f1371,10520,1
2,5a8c7595554299585d9e36b6,33022480,1
3,5a8c7595554299585d9e36b6,804602,1
4,5a85ea095542994775f606a8,12342237,1
...,...,...,...
14805,5a8173fa554299260e20a28e,6920767,1
14806,5a8caf1d554299585d9e3720,22220767,1
14807,5a8caf1d554299585d9e3720,1993862,1
14808,5ac132a755429964131be17c,1085678,1


In [None]:
hits = defaultdict(list)

for i in tqdm(range(len(df_test))):
    query_id = df_test['query-id'][i]
    answer_id = str(df_test['corpus-id'][i])
    query = df_query[df_query['_id'] == query_id]['text'].to_list()[0]

    retrieved_answers = retriever.retrieve(query)
    retrieved_ids = [os.path.basename(node.metadata["source"]["source_id"]).split('.')[0] for node in retrieved_answers]

    for k in [1, 3, 5, 10]:
        hits[k].append(answer_id in retrieved_ids[:k])

for k in hits:
    print(f'  - Recall @{k}: {np.mean(hits[k]) :.3f}')