# Aim:
Assess the performance of different RAG methods in financial documents analysis (primary background investigation for consulting, stock purchase) 

# Assessment criteria:
1. relevance
2. length of retrieved context
3. <s>speed
4. <s>cost

# Methods to test:
1. Dense Embeddings
   1.1 parameters
   1.2 <s>finetune embedding model (need GPU machine, too expensive for now)
2. ColBERT
4. Hybrid retriever and rerank
5. <s>Knowledge Augmented Generation (KAG, need to build a domain-specific architecture from sratch)
6. <s>Contextual retrieval preprocessing (use llm to search through all chunks, too expensive)

# Test questions:

Test questions were designed to assess retrieval methods' performance under different senario. The first four questions require coarse search and return text content. Q3 and Q4 require large pieces of context. The last seven questions need precise search to return one specific number and some of them are from tables. Q5 and Q6 have similar keyword but difference in numeric description(2025 revenue of NVDIA and 2024 revenue of NVDIA). Q7 vs Q10 (total liabilities vs total current liabilities), Q9 and Q10 (total current assets vs total current liabilities) were designed in similar way.


In [1]:
import os
import numpy as np
import pandas as pd

In [25]:
def multiple_strreplace(string, replace_dic):
    for k,v in replace_dic.items():
        string = string.replace(k,v)
    return string

def parse_queries(qa_fp, replace_dic):
    qa = pd.read_csv(qa_fp)
    queries = list(map(lambda query: multiple_strreplace(query, replace_dic), qa['question'].values))
    return queries

In [30]:
company_name = 'NVDIA'
year1, year2 = 2025, 2024
qa_fp = '../inputs/Q-A.csv'

In [31]:
replace_dic = {'{company_name}':'NVDIA',
              '{year1}':str(year1),
               '{year2}':str(year2)}
parse_queries(qa_fp, replace_dic)

['What kind of products or services is NVDIA providing?',
 'Who are the customers of NVDIA or what types of markets are NVDIA operating in?',
 'Who are the competitors of NVDIA?',
 "What are the risk factors and uncertainties that could affect the NVDIA's future performance?",
 'What is the 2025 revenue of NVDIA?',
 'What is the 2024 revenue of NVDIA?',
 'What is the 2025 total liabilities?',
 "What is the 2025 total shareholders' equity?",
 'What is the 2025 total current assets?',
 'What is the 2025 total current liabilities?',
 'What is the 2025 gross margin?']

## Results

The eleven questions listed above were used as test instructions for different retrieval methods. Each method will return 5 pieces of documents. Columns 'Q1' to 'Q11' bellow were number of retrived documents relevent to the corresponding instruction and containing information to answer the question.

In [33]:
scores = pd.read_csv('../outputs/retriever_score.csv')
scores.head(2)

Unnamed: 0,method,model,parameters,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,ave
0,dense embed,minishlab/potion-base-8M,500/100,1,0,0,3,1,0,0,1,0,0,0,0.545455
1,dense embed,BAAI/bge-base-en-v1.5,500/100,0,0,1,3,1,0,0,0,2,1,0,0.727273


In [None]:
s_cols = list(filter(lambda x: x[0] == 'Q', scores.columns.tolist()))
scores['ave'] = scores[s_cols].mean(axis=1)
scores['answered'] = (scores[s_cols]>0).sum(axis=1)

In [40]:
scores

Unnamed: 0,method,model,parameters,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,ave,answered
0,dense embed,minishlab/potion-base-8M,500/100,1,0,0,3,1,0,0,1,0,0,0,0.545455,4
1,dense embed,BAAI/bge-base-en-v1.5,500/100,0,0,1,3,1,0,0,0,2,1,0,0.727273,5
2,dense embed,minishlab/potion-base-8M,800/100,1,0,1,1,1,1,1,1,0,0,0,0.636364,7
3,dense embed,BAAI/bge-base-en-v1.5,800/100,0,1,1,3,1,1,0,1,1,1,0,0.909091,8
4,colbert,colbertv2.0,512,2,4,0,5,3,3,3,1,1,1,2,2.272727,10
5,hybrid,colbertv2.0+BM25+BAAI/bge-reranker-base,0.7/0.3,3,1,2,5,3,3,1,1,1,1,4,2.272727,11


## Conclusion:
Hybrid search combined ColBert, BM25 and reranker had the best performance. ColBert had significantly better performance than dense embeddings. Keyword search did not contribute too much in hybrid search. But rerank put the most relevent content on top, especially in precise searched. Between two dense embedding models, the one with more parameters worked better and also large context will improve the performance. Due to the limitation of fund and time, I did not explore larger or fine tuned embedding models and those may outperform all the methods included in this study.