# ICAIF 2024 Finance-RAG Challenge Baseline

This notebook demonstrates a **baseline solution** for the **ICAIF 2024 Finance-RAG Challenge**. The goal of the challenge is to create a **Retrieval-Augmented Generation (RAG)** system for financial data. Participants are tasked with developing systems that retrieve relevant documents from a large corpus and provide accurate, context-aware responses to user queries.

---

## System Components

The system is divided into two main components:

1. **Retrieval**: Retrieves relevant financial documents from a large corpus based on a user query.
2. **Reranking**: Refines the ranking of the retrieved documents to ensure the most relevant information is prioritized.

---

## Model Overview

This baseline notebook uses a combination of `SentenceTransformer` and `CrossEncoder` models to perform these tasks:

- The **retrieval model** is responsible for encoding both the queries and documents into embeddings.
- The **reranking model** evaluates the relevance of the retrieved documents and reorders them accordingly.

In this example, the baseline task used is **FinDER**, which is one of the seven available tasks in the FinanceRAG project. The retrieval model used is `intfloat/e5-large-v2`, and the reranking is performed using `cross-encoder/ms-marco-MiniLM-L-12-v2`. Both of these models can be substituted with other models supported by the `sentence_transformers` library for performance experimentation.

---

## Goal

The goal of this notebook is to provide a **solid foundation** for participants to build more advanced solutions for the challenge. Feel free to customize the task, retrieval model, and reranking model as needed.

---

## Repository Setup and Environment Configuration

You can find the repository for this project on GitHub [here](https://github.com/JiH00nKw0n/FinanceRAG).

To clone the repository and set up the environment, follow these steps:

### 1. Clone the repository:

```bash
git clone https://github.com/linq-rag/FinanceRAG.git
cd FinanceRAG
```

### 2. Set up the Python environment:

#### If using `venv` (Python 3.11 or higher required):

```bash
python3 -m venv .venv
source .venv/bin/activate  # On Windows use .venv\Scriptsctivate
pip install --upgrade pip
pip install -r requirements.txt
```

#### If using `conda`:

```bash
conda create -n financerag python=3.11
conda activate financerag
pip install -r requirements.txt
```

You should now be ready to run the baseline notebook!

In [1]:
# Add parent directory to Python path to find the financerag package
import sys
from pathlib import Path

# Add the parent directory (FinanceRAG root) to the path
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

In [2]:
# Step 1: Import necessary libraries
# --------------------------------------
# Import required libraries for document retrieval, reranking, and logging setup.
from sentence_transformers import CrossEncoder
import logging

from financerag.rerank import CrossEncoderReranker
from financerag.retrieval import DenseRetrieval, SentenceTransformerEncoder
from financerag.tasks import FinDER

# Setup basic logging configuration to show info level messages.
logging.basicConfig(level=logging.INFO)


  from tqdm.autonotebook import tqdm, trange


In [3]:
# Step 2: Initialize FinDER Task
# --------------------------
# In this baseline example, we are using the FinDER task, one of the seven available tasks in this project.
# If you want to use a different task, for example, 'OtherTask', you can change the task initialization as follows:
#
# Example:
# from financerag.tasks import OtherTask
# finder_task = OtherTask()
#
# For this baseline, we proceed with FinDER.
finder_task = FinDER()


INFO:financerag.common.loader:Loading Corpus...
INFO:financerag.common.loader:Loaded 13867 Documents.
INFO:financerag.common.loader:Corpus Example: {'id': 'ADBE20230004', 'title': 'ADBE OVERVIEW', 'text': 'Adobe is a global technology company with a mission to change the world through personalized digital experiences. For over four decades, Adobe’s innovations have transformed how individuals, teams, businesses, enterprises, institutions, and governments engage and interact across all types of media. Our products, services and solutions are used around the world to imagine, create, manage, deliver, measure, optimize and engage with content across surfaces and fuel digital experiences. We have a diverse user base that includes consumers, communicators, creative professionals, developers, students, small and medium businesses and enterprises. We are also empowering creators by putting the power of artificial intelligence (“AI”) in their hands, and doing so in ways we believe are responsi

In [None]:
# Step 3: Initialize DenseRetriever model
# -------------------------------------
# Initialize the retrieval model using SentenceTransformers. This model will be responsible
# for encoding both the queries and documents into embeddings.
#
# You can replace 'intfloat/e5-large-v2' with any other model supported by SentenceTransformers.
# For example: 'BAAI/bge-large-en-v1.5', 'Linq-AI-Research/Linq-Embed-Mistral', etc.
encoder_model = SentenceTransformerEncoder(
    model_name_or_path='intfloat/e5-large-v2',
    query_prompt='query: ',
    doc_prompt='passage: ',
)



In [None]:
# Step 4: Perform retrieval
# ---------------------
# Use the model to retrieve relevant documents for given queries.
retrieval_model = DenseRetrieval(
    model=encoder_model
)

retrieval_result = finder_task.retrieve(
    retriever=retrieval_model
)

# Print a portion of the retrieval results to verify the output.
print(f"Retrieved results for {len(retrieval_result)} queries. Here's an example of the top 5 documents for the first query:")

for q_id, result in retrieval_result.items():
    print(f"\nQuery ID: {q_id}")
    # Sort the result to print the top 5 document ID and its score
    sorted_results = sorted(result.items(), key=lambda x: x[1], reverse=True)

    for i, (doc_id, score) in enumerate(sorted_results[:5]):
        print(f"  Document {i + 1}: Document ID = {doc_id}, Score = {score}")

    break  # Only show the first query


In [None]:
# Step 5: Initialize CrossEncoder Reranker
# --------------------------------------
# The CrossEncoder model will be used to rerank the retrieved documents based on relevance.
#
# You can replace 'cross-encoder/ms-marco-MiniLM-L-12-v2' with any other model supported by CrossEncoder.
# For example: 'cross-encoder/ms-marco-TinyBERT-L-2', 'cross-encoder/stsb-roberta-large', etc.
reranker = CrossEncoderReranker(
    model=CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
)


In [None]:
# Step 6: Perform reranking
# -------------------------
# Rerank the top 100 retrieved documents using the CrossEncoder model.
reranking_result = finder_task.rerank(
    reranker=reranker,
    results=retrieval_result,
    top_k=100,  # Rerank the top 100 documents
    batch_size=32
)

# Print a portion of the reranking results to verify the output.
print(f"Reranking results for {len(reranking_result)} queries. Here's an example of the top 5 documents for the first query:")

for q_id, result in reranking_result.items():
    print(f"\nQuery ID: {q_id}")
    # Sort the result to print the top 5 document ID and its score
    sorted_results = sorted(result.items(), key=lambda x: x[1], reverse=True)

    for i, (doc_id, score) in enumerate(sorted_results[:5]):
        print(f"  Document {i + 1}: Document ID = {doc_id}, Score = {score}")

    break  # Only show the first query


In [None]:
# Step 7: Save results
# -------------------
# Save the results to the specified output directory as a CSV file.
output_dir = './results'
finder_task.save_results(output_dir=output_dir)

# Confirm the results have been saved.
print(f"Results have been saved to {output_dir}/FinDER/results.csv")
