# **Investigating Polarity Manipulation in LLM-Based Query Expansion for Financial Opinion Search**

This notebook presents a systematic investigation of LLM-based Query Expansion in financial information retrieval, with a specific focus on effects of sentiment polarity manipulation. While large language models have shown strong capabilities in enriching search queries with domain-specific knowledge, the consequences of deliberately steering query expansions toward positive or negative sentiment remain underexplored, particularly in the context of financial opinion retrieval.

We designed a controlled experimental framework that compares a hybrid retrieval baseline (BM25 combined with Bi-Encoder re-ranking) against three sentimental-controlled query expansion strategies: neutral, positive (bullish) and negative (bearish). Query expansions are generated using an instruction-following language model and are cached to ensure reproducibility across experiments.

The central research question addressed in the study is:


  *To what extent does strategically inducing polarity (positive or negative sentiment) in LLM-generated Query Expansion alter ranking behaviour and introduce measurable bias in retrieval metrics within a hybrid financial IR system?*

To answer this question, we quantify the impact of sentiment-controlled expansions using standard information retrieval metrics, including MAP, nDCG, Precision@k and Recall@k. In addition, we explicitly verify the induced polarity of the expansions using a financial sentiment classifier, allowing us to separate the effects of query enrichment from those of sentiment bias. Through this analysis, we assess whether sentiment-oriented expansions improve the retrieval of financial opinions or instead introduce semantic noise that degrades relevance and objectivity.  

## **Important Necessary Libraries**

Importing necessary libraries for data processing, retrieval and model inference.

In [1]:
!pip install python-terrier
!pip install spacy
!python -m spacy download en_core_web_sm
!pip -q install -U transformers accelerate sentencepiece
!pip -q install -U transformers accelerate sentencepiece

Collecting python-terrier
  Downloading python_terrier-1.0-py3-none-any.whl.metadata (987 bytes)
Collecting pyterrier>=1.0 (from pyterrier[all]>=1.0->python-terrier)
  Downloading pyterrier-1.0.1-py3-none-any.whl.metadata (7.3 kB)
Collecting ir_datasets>=0.3.2 (from pyterrier>=1.0->pyterrier[all]>=1.0->python-terrier)
  Downloading ir_datasets-0.5.11-py3-none-any.whl.metadata (12 kB)
Collecting deprecated (from pyterrier>=1.0->pyterrier[all]>=1.0->python-terrier)
  Downloading deprecated-1.3.1-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting ir_measures>=0.4.1 (from pyterrier>=1.0->pyterrier[all]>=1.0->python-terrier)
  Downloading ir_measures-0.4.3-py3-none-any.whl.metadata (7.0 kB)
Collecting pytrec_eval_terrier>=0.5.3 (from pyterrier>=1.0->pyterrier[all]>=1.0->python-terrier)
  Downloading pytrec_eval_terrier-0.5.10-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Collecting lz4 (from pyterrier>=1.0->pyterrier[all]>=1.0->python-terrier)
  Downloading lz

In [2]:
import pyterrier as pt
import re
from collections import defaultdict
import textwrap
import numpy as np
import spacy
import os
import shutil
from pyterrier.measures import nDCG, P, R
!pip -q install sentence-transformers
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
import json
from typing import List
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, pipeline
import requests
import time

## **Exploratory Data Analysis**

This phase focused on auditing the FinQA dataset to understand its structure, quality and linguistic nuances before indexing. Our analysis covers five key dimensions:


1.   Corpus statistics
2.   Query Linguistics
3. Lexical Overlap
4. Query Intent: Opinion Analysis
5. Relevance Sparsity



Before performing any analysis, we load the FinQA and perform a quick inspection of a single document to understand the available metadata and schema.

In [3]:
dataset = pt.get_dataset('irds:beir/fiqa/test')

for doc in dataset.get_corpus_iter():
    print("Document keys:", doc.keys())
    print("Document example:", doc)
    break

[INFO] [starting] building docstore
[INFO] [starting] opening zip file
[INFO] If you have a local copy of https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/17918ed23cd04fb15047f73e6c3bd9d9
[INFO] [starting] https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip
docs_iter:   0%|                                     | 0/57638 [00:01<?, ?doc/s]
https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip: 0.0%| 0.00/17.9M [00:00<?, ?B/s][A
https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip: 0.2%| 32.8k/17.9M [00:00<01:25, 211kB/s][A
https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip: 0.5%| 81.9k/17.9M [00:00<01:09, 259kB/s][A
https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip: 1.1%| 197k/17.9M [00:00<00:43, 412kB/s] [A
https://public.ukp.informatik.tu-darmstadt.de/tha

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

Document keys: dict_keys(['text', 'docno'])
Document example: {'text': "I'm not saying I don't like the idea of on-the-job training too, but you can't expect the company to do that. Training workers is not their job - they're building software. Perhaps educational systems in the U.S. (or their students) should worry a little about getting marketable skills in exchange for their massive investment in education, rather than getting out with thousands in student debt and then complaining that they aren't qualified to do anything.", 'docno': '3'}


This confirms that the documents are structured as dictionary and, importantly, reveals that there is no "Title" field, only `text` and `docno`. The sample shows a informal, forum-style tone, typical of the FinQA dataset.

To ensure the reliability of our downstream analysis, we perform a data quality check on a sample to identify missing or empty fields.

In [4]:
fields_to_check = ["text", "docno"]
missing = defaultdict(int)
total_docs = 0

MAX_DOCS = 20000  # sample-based check

for doc in dataset.get_corpus_iter():
    total_docs += 1
    for f in fields_to_check:
      if not doc.get(f) or doc.get(f).strip() == "":
        missing[f] += 1

    if total_docs >= MAX_DOCS:
        break

print("\n--- DATA QUALITY CHECK (sampled) ---")
print(f"Documents checked: {total_docs}")
for field in ["docno", "text"]:
    print(
        f"{field}: missing in {missing[field]} documents "
        f"({missing[field] / total_docs:.2%})"
    )

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]


--- DATA QUALITY CHECK (sampled) ---
Documents checked: 20000
docno: missing in 0 documents (0.00%)
text: missing in 17 documents (0.08%)


The result show 0% missing IDs and negligible percentage of missing text. This confirms the dataset is clean enough for further processing without intensive data cleaning.

### **Corpus statistics**

To support exploratory analysis and corpus characterization, we define a lightweight preprocessing pipeline based on spaCy. Text is lowercased, lemmatized, and cleaned by removing punctuation and standard stopwords. Common financial jargon and abbreviations are normalized (“btc” to “bitcoin”) to ensure lexical consistency.

Tokenization is designed specifically for analytical purposes rather than retrieval. In addition to standard linguistic normalization, we remove a set of high-frequency domain-specific financial terms. This prevents ubiquitous concepts (such as “money”, “investment”, or “market”) from dominating corpus-level statistics.

The resulting tokens are used exclusively for exploratory analyses, including document length estimation, vocabulary size computation, and lexical overlap inspection. All retrieval and ranking experiments rely on the original document representations provided by the IR models and are not affected by this preprocessing step.

In [5]:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner", "textcat"])

FINANCIAL_STOPWORDS = {
    "market", "stock", "price", "rates", "investment", "investor", "finance",
    "financial", "risk", "return", "value", "assets", "capital", "fund",
    "bank", "bond", "company", "share", "economy", "economic"
}

GERGO_MAP = {
    "feds": "fed",
    "stonks": "stocks",
    "crypto": "cryptocurrency",
    "btc": "bitcoin",
    "eth": "ethereum",
    "nav": "net_asset_value",
    "ev": "enterprise_value",
    "pe": "price_earnings_ratio",
    "qe": "quantitative_easing",
    "ipo": "initial_public_offering"
}

def normalize_jargon(token):
    return GERGO_MAP.get(token, token)

def get_doc_text(doc):
    # since there is no titles
    text = doc.get("text", "") or ""
    return text.strip()

def tokenize_analysis(text):
    if text is None:
        return []

    text = str(text).lower().strip()
    if not text:
        return []

    doc = nlp(text)  # spaCy processing

    tokens = []
    for token in doc:
        if token.is_punct or token.is_space:
            continue
        if token.is_stop:
            continue

        lemma = token.lemma_.strip()
        lemma = normalize_jargon(lemma)

        if len(lemma) <= 2:
            continue
        if lemma in FINANCIAL_STOPWORDS:
            continue

        tokens.append(lemma)

    return tokens

Because the FinQA corpus is heterogenous, containing documents from diverse sources, we define three categories based on token length:


*   Microblogs: less than 50 tokens.
*   News: between 51 ans 200 tokens.
* Reports: more than 200 tokens




These tresholds will help us characterize the distribution of the doucment types in the collections.

In [6]:
SHORT_MAX = 50      # less than 50 tokens -> microblog
MEDIUM_MAX = 200    # 51-200 tokens -> news
# greater than 200 -> report

def classify_doc_length(num_tokens):
    if num_tokens <= SHORT_MAX:
        return "microblog"
    elif num_tokens <= MEDIUM_MAX:
        return "news"
    else:
        return "report"

Subsequentially, we iterate through every document in the collection to apply our preprocessing pipeline.

Specifically, the goal is to:


1. Calculate the number of doucments.
2. Retrive the ID of each document.
3. Extract the raw text.
4. Apply the preprocessing pipeline.
5. Update the global statistics initialized at the beginning.
6. Assign each document to a category (microblog, news or report).
7. Save the qualitative examples.




In [7]:
total_docs = 0
total_tokens = 0

type_counts = defaultdict(int)
type_token_sums = defaultdict(int)

vocab = set()   # global vocabulary

examples = {
    "microblog": [],
    "news": [],
    "report": []
}
MAX_EXAMPLES_PER_TYPE = 2

for doc in dataset.get_corpus_iter():
    total_docs += 1

    docno = doc.get("docno", None) or doc.get("doc_id", None)
    text = get_doc_text(doc)
    tokens = tokenize_analysis(text)
    doc_len = len(tokens)

    total_tokens += doc_len
    vocab.update(tokens)

    doc_type = classify_doc_length(doc_len)
    type_counts[doc_type] += 1
    type_token_sums[doc_type] += doc_len

    if len(examples[doc_type]) < MAX_EXAMPLES_PER_TYPE:
        examples[doc_type].append({
            "docno": docno,
            "len": doc_len,
            "text": text[:500]
        })

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

Once the processing is completed, we generate a high-level summary. Crucially, we break down the collection by documents type to understand the balance between short social-media-style posts and longer financial reports.

In [8]:
def line():
    print("=" * 70)

def section(title):
    line()
    print(f"{title}".center(70))
    line()

section("DOCUMENT COLLECTION STATISTICS (FinQA test)")

print(f"Total number of documents     : {total_docs:,}")
print(f"Vocabulary size (unique tokens): {len(vocab):,}")

if total_docs > 0:
    avg_len = total_tokens / total_docs
    print(f"Average document length        : {avg_len:.2f} tokens")

section("DOCUMENT TYPES (by heuristic length)")

print(f"{'Type':12s} | {'Count':>8s} | {'Avg Length':>12s}")
print("-" * 40)

for doc_type in ["microblog", "news", "report"]:
    count = type_counts[doc_type]
    avg_len_type = (type_token_sums[doc_type] / count) if count > 0 else 0
    print(f"{doc_type.capitalize():12s} | {count:8d} | {avg_len_type:12.2f}")

             DOCUMENT COLLECTION STATISTICS (FinQA test)              
Total number of documents     : 57,638
Vocabulary size (unique tokens): 86,492
Average document length        : 55.29 tokens
                 DOCUMENT TYPES (by heuristic length)                 
Type         |    Count |   Avg Length
----------------------------------------
Microblog    |    36612 |        28.51
News         |    19710 |        89.42
Report       |     1316 |       288.84


The output reveals that the FinQA dataset is heavily skewed toward what we classified as Microblog (approaximately the 64% of the corpus), which have a very short average length of ~28 tokens. While Reports are the minority (only 1316 documents), they are significally denser, averaging nearly 289 tokens each. This confirms that the retrieval system must be robust enhough to handle both "Noisy" short text and detailed technical content.

The final step in this section is to display real examples from the categorized groups.  

In [9]:
section("EXAMPLE 'NOISY' MICROBLOGS (short docs)")

for ex in examples["microblog"]:
    print(f"[docno={ex['docno']} | len={ex['len']} tokens]")
    wrapped = textwrap.fill(ex["text"], width=70)
    print(wrapped + "\n")

section("EXAMPLE LONG REPORTS (long docs)")

for ex in examples["report"]:
    print(f"[docno={ex['docno']} | len={ex['len']} tokens]")
    wrapped = textwrap.fill(ex["text"], width=70)
    print(wrapped + "\n")

               EXAMPLE 'NOISY' MICROBLOGS (short docs)                
[docno=3 | len=29 tokens]
I'm not saying I don't like the idea of on-the-job training too, but
you can't expect the company to do that. Training workers is not their
job - they're building software. Perhaps educational systems in the
U.S. (or their students) should worry a little about getting
marketable skills in exchange for their massive investment in
education, rather than getting out with thousands in student debt and
then complaining that they aren't qualified to do anything.

[docno=31 | len=31 tokens]
So nothing preventing false ratings besides additional scrutiny from
the market/investors, but there are some newer controls in place to
prevent institutions from using them. Under the DFA banks can no
longer solely rely on credit ratings as due diligence to buy a
financial instrument, so that's a plus. The intent being that if
financial institutions do their own leg work then *maybe* they'll
figure out that a 

The examples confirms that documents labelled as Microblogs often contains informal, forum posts, whereas Reports contains more structured, explanatory content.

### **Query Linguistic**

We begin by loading the `topics` from the dataset and performing a quick inspection of the first rows to understand the typical phrasing.

In [10]:
topics = dataset.get_topics()
print("First rows of topics:")
print(topics.head())

First rows of topics:
     qid                                              query
0   4641  Where should I park my rainy-day / emergency f...
1   5503  Tax considerations for selling a property belo...
2   7803  Can the Delta be used to calculate the option ...
3   7017                 Basic Algorithmic Trading Strategy
4  10152  What does a high operating margin but a small ...


[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [1ms]
[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [0ms]


The initial sample shows a mix of personal finance questions and highly technical inquiries involving financial metrics.

To quantify the "expert" nature of the queries, we define a list of Financial Jargon and iterate thorugh the topics. For each query, we calculate its token length and check for the presence of these specialized terms. This allows us to separate general questions from those requiring deep domain knowledge.

In [11]:
FINANCIAL_JARGON = {
    "dividend", "yield", "leverage", "derivative", "option", "options",
    "futures", "etf", "ipo", "nav", "spread", "hedge", "hedging",
    "shorting", "short", "long", "equity", "bond", "portfolio",
    "liquidity", "volatility", "margin", "swap", "swaption",
    "mutual_fund", "asset", "valuation", "underlying", "call", "put"
}

query_lengths = []
short_vague_queries = []       # examples of short/vague queries
MAX_SHORT_EXAMPLES = 5

jargon_queries = []            # examples of queries with financial jargon
MAX_JARGON_EXAMPLES = 5

num_queries_with_jargon = 0

SHORT_QUERY_MAX_TOKENS = 3

for _, row in topics.iterrows():
    qid = row["qid"]
    qtext = row["query"]

    # tokenize using the pipeline
    tokens = tokenize_analysis(qtext)
    q_len = len(tokens)
    query_lengths.append(q_len)

    # detect short/vague queries
    if q_len <= SHORT_QUERY_MAX_TOKENS and len(short_vague_queries) < MAX_SHORT_EXAMPLES:
        short_vague_queries.append({
            "qid": qid,
            "len": q_len,
            "query": qtext
        })

    # detect presence of financial jargon
    token_set = set(tokens)
    jargon_in_query = token_set.intersection(FINANCIAL_JARGON)
    if jargon_in_query:
        num_queries_with_jargon += 1
        if len(jargon_queries) < MAX_JARGON_EXAMPLES:
            jargon_queries.append({
                "qid": qid,
                "len": q_len,
                "query": qtext,
                "jargon": sorted(list(jargon_in_query))
            })

num_queries = len(query_lengths)
avg_q_len = float(np.mean(query_lengths)) if num_queries > 0 else 0.0
min_q_len = int(np.min(query_lengths)) if num_queries > 0 else 0
max_q_len = int(np.max(query_lengths)) if num_queries > 0 else 0

percent_with_jargon = (num_queries_with_jargon / num_queries * 100.0) if num_queries > 0 else 0.0

section("QUERY ANALYSIS (FinQA test)")

print(f"Total number of queries        : {num_queries}")
print(f"Average query length (tokens)  : {avg_q_len:.2f}")
print(f"Min query length               : {min_q_len} tokens")
print(f"Max query length               : {max_q_len} tokens")

print()
print(f"Queries containing financial jargon: {num_queries_with_jargon} "
      f"({percent_with_jargon:.2f}% of all queries)")

                     QUERY ANALYSIS (FinQA test)                      
Total number of queries        : 648
Average query length (tokens)  : 5.14
Min query length               : 1 tokens
Max query length               : 13 tokens

Queries containing financial jargon: 106 (16.36% of all queries)


With 648 total queries and an average length of only 5.14 tokens, these queries are concise. Interestingly, over 16% of queries contain specific financial jargon, indicating that the retrieval system must understand specialized trminology to be effective.

Subsequentially, we categorize the queries into bins based on their token count. Understanding this distribution is vital because extremely short queries are often underspecified and vague, making them harder for traditional keyword-based search engins to answer accurately.

In [12]:
bins = [0, 1, 3, 5, 10, 20, 100]
bin_labels = ["=0", "1", "2-3", "4-5", "6-10", "11-20", ">20"]
counts = [0] * len(bin_labels)

for L in query_lengths:
    if L == 0:
        counts[0] += 1
    elif L == 1:
        counts[1] += 1
    elif 2 <= L <= 3:
        counts[2] += 1
    elif 4 <= L <= 5:
        counts[3] += 1
    elif 6 <= L <= 10:
        counts[4] += 1
    elif 11 <= L <= 20:
        counts[5] += 1
    else:
        counts[6] += 1

section("QUERY LENGTH DISTRIBUTION (approximate bins)")

print(f"{'Length bin':15s} | {'Count':>8s}")
print("-" * 30)
for label, c in zip(bin_labels, counts):
    print(f"{label:15s} | {c:8d}")

             QUERY LENGTH DISTRIBUTION (approximate bins)             
Length bin      |    Count
------------------------------
=0              |        0
1               |        9
2-3             |      136
4-5             |      250
6-10            |      240
11-20           |       13
>20             |        0


The distribution is concentrated between the 2-10 token range. There are no queries longer tha 20 tokens, confirming that users typically provide very little context, placing a heavy burden on the retriever to match the right concepts.

Finally, we display examples of queries that triggered our jargon detection.

In [13]:
section("EXAMPLES OF QUERIES WITH FINANCIAL JARGON")

for ex in jargon_queries:
    print(f"[qid={ex['qid']} | len={ex['len']} tokens | jargon={ex['jargon']}]")
    wrapped = textwrap.fill(ex["query"], width=70)
    print(wrapped + "\n")

              EXAMPLES OF QUERIES WITH FINANCIAL JARGON               
[qid=7803 | len=7 tokens | jargon=['option']]
Can the Delta be used to calculate the option premium given a certain
target?

[qid=10152 | len=7 tokens | jargon=['margin']]
What does a high operating margin but a small but positive ROE imply
about a company?

[qid=10809 | len=4 tokens | jargon=['leverage']]
Definitions of leverage and of leverage factor

[qid=7105 | len=6 tokens | jargon=['equity']]
What is the difference between fixed-income duration and equity
duration?

[qid=6807 | len=4 tokens | jargon=['dividend']]
How to incorporate dividends while calculating annual return of a
Stock



Hence, the FinQA dataset requires a system capable of handling sophisticated financial sentiment rather than just simple keyword matching.

### **Lexical Overlap**

In this section, we analyze the lexical overlap between user queries and their relevant documents. The goal is to show that many relevant documents share only a small portion of query terms, highlighting the vocabulary mismatch problem that affects lexical retrieval models such as BM25.

To perform this comparison efficiently, we first build a fast-lookup dictionary (`doc_store`) to access document content by ID. We then organize our relevance judgements to identify exactly which documents should be retrieved for every given query.

In [14]:
qrels = dataset.get_qrels()
relevant_docnos = set(qrels["docno"].astype(str).unique())

# Build a fast lookup dictionary: docno -> document
doc_store = {}

for doc in dataset.get_corpus_iter():
    # Retrieve document identifier
    docno = str(doc.get("docno") or doc.get("doc_id"))

    # Store document if it is among the relevant ones
    if docno in relevant_docnos:
        doc_store[docno] = doc

qrels = dataset.get_qrels()

# Group relevant documents by query id
qrels_by_qid = defaultdict(list)

for _, row in qrels.iterrows():
    qid = row["qid"]
    docno = str(row["docno"])
    qrels_by_qid[qid].append(docno)

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

For each query-document pair in our ground truth, we calculate the Overlap Score: the percentage of unique query tokens that are present in the relevant document. An overlap of 1.0 means that the document contains all query terms, while 0.0 indicates total mismatch.

In [15]:
# Containers for statistics
overlap_scores = []            # overlap values for all query-document pairs
query_avg_overlap = {}         # average overlap per query
low_overlap_examples = []      # qualitative examples

MAX_EXAMPLES = 5

# Iterate over all queries
for _, row in topics.iterrows():
    qid = row["qid"]
    query_text = row["query"]

    # Tokenize query
    q_tokens = set(tokenize_analysis(query_text))
    if len(q_tokens) == 0:
        continue

    overlaps_for_query = []

    # Iterate over relevant documents for this query
    for docno in qrels_by_qid.get(qid, []):
        doc = doc_store.get(docno)
        if doc is None:
            continue

        # Extract and tokenize document text
        doc_text = get_doc_text(doc)
        d_tokens = set(tokenize_analysis(doc_text))

        # Compute lexical overlap
        overlap = len(q_tokens & d_tokens) / len(q_tokens)
        overlaps_for_query.append(overlap)
        overlap_scores.append(overlap)

    # Store average overlap per query
    if overlaps_for_query:
        avg_overlap = np.mean(overlaps_for_query)
        query_avg_overlap[qid] = avg_overlap

        # Save examples with very low overlap
        if avg_overlap < 0.3 and len(low_overlap_examples) < MAX_EXAMPLES:
            low_overlap_examples.append({
                "qid": qid,
                "query": query_text,
                "avg_overlap": avg_overlap
            })

We summarize the findings to see the "average" difficulty of the retrival task.

In [16]:
print("\n--- QUERY–DOCUMENT LEXICAL OVERLAP STATISTICS ---")

if overlap_scores:
    print(f"Average overlap : {np.mean(overlap_scores):.3f}")
    print(f"Median overlap  : {np.median(overlap_scores):.3f}")
    print(f"Minimum overlap : {np.min(overlap_scores):.3f}")
    print(f"Maximum overlap : {np.max(overlap_scores):.3f}")
else:
    print("No overlap scores computed.")


--- QUERY–DOCUMENT LEXICAL OVERLAP STATISTICS ---
Average overlap : 0.453
Median overlap  : 0.444
Minimum overlap : 0.000
Maximum overlap : 1.000


These statistics reveals that, even for relevant documents, the lexical overlap with the query is often limited, only 0.453. On average, more than half of the terms a user types are missing from the very document that answer the specific question. The existence of 0.0 overlap cases proves that some relevant documents share no common words with the query after preprocessing, making them "invisible" to  purely lexical models like BM25.

To see if the low average is caused by a few outliers or a general trend, we bucket the scores into ranges.

In [17]:
# Define bins for overlap distribution
bins = [0.0, 0.1, 0.25, 0.5, 0.75, 1.0]
bin_counts = defaultdict(int)

for score in overlap_scores:
    for i in range(len(bins) - 1):
        if bins[i] <= score < bins[i + 1]:
            label = f"{bins[i]}–{bins[i + 1]}"
            bin_counts[label] += 1
            break

print("\n--- OVERLAP DISTRIBUTION ---")
for label, count in bin_counts.items():
    print(f"{label:10s} : {count}")


--- OVERLAP DISTRIBUTION ---
0.75–1.0   : 158
0.25–0.5   : 481
0.1–0.25   : 181
0.5–0.75   : 543
0.0–0.1    : 197


The distribution confirms a significant challenge: over the 50% of the data have an overlap score of 0.5 or lower. This high density of low-overlap pairs suggests that "semantic search" is likely necessary to achieve high performance on the dataset.

### **Query Intent: Opinion Analysis**

Afterwards, we analyze the queries to understand whether and to what extent they express opinion-oriented, rather than purely factual, information needs. The goal is to demonstrate that a significant portion of the queries concern sentiment, judgments, concerns, and expectations, justifying the non-classical nature of the retrieval task.

Let's start by defining a list of keywords that typically signal the presence of opinions, ratings or perceptions in financial queries; and initializing the analysis structures.

In [18]:
# List of opinion-oriented cue words commonly used in financial queries
OPINION_CUE_WORDS = {
    "opinion", "view", "views",
    "think", "believe", "expect", "expectation", "expectations",
    "sentiment", "confidence", "fear", "concern", "concerns",
    "risk", "risks", "outlook", "reaction", "reactions",
    "bullish", "bearish", "optimistic", "pessimistic",
    "positive", "negative", "uncertainty", "pressure"
}

num_opinion_queries = 0
opinion_queries_examples = []

MAX_OPINION_EXAMPLES = 5

For each query, we tokenize the text and check whether at least one token belongs to the set of opinion cue words. Then we compute the percentage of queries that present opinion-oriented linguistic signals.

In [19]:
# Iterate over all queries in the dataset
for _, row in topics.iterrows():
    qid = row["qid"]
    query_text = row["query"]

    # Tokenize query using the existing pipeline
    q_tokens = set(tokenize_analysis(query_text))

    # Detect presence of opinion cue words
    matched_cues = q_tokens.intersection(OPINION_CUE_WORDS)

    if matched_cues:
        num_opinion_queries += 1

        # Save a few qualitative examples
        if len(opinion_queries_examples) < MAX_OPINION_EXAMPLES:
            opinion_queries_examples.append({
                "qid": qid,
                "query": query_text,
                "matched_cues": sorted(list(matched_cues))
            })

# percentage of opinion-oriented queries
total_queries = len(topics)

percent_opinion_queries = (
    num_opinion_queries / total_queries * 100
    if total_queries > 0 else 0.0
)

section("QUERY TYPE ANALYSIS: OPINION-ORIENTED QUERIES")

print(f"Total number of queries           : {total_queries}")
print(f"Opinion-oriented queries detected : {num_opinion_queries}")
print(f"Percentage of opinion queries     : {percent_opinion_queries:.2f}%")

            QUERY TYPE ANALYSIS: OPINION-ORIENTED QUERIES             
Total number of queries           : 648
Opinion-oriented queries detected : 9
Percentage of opinion queries     : 1.39%


Only a small fraction (1.39%) of the queries explicitly contain opinion cue words. This suggests that many opinion-oriented information needs are expressed implicitly rather than through direct subjective markers. As a result, detecting and retrieving financial opinions requires semantic understanding beyond simple keyword matching.

### **Relevance Sparsity**

We start by assessing the total number of query-document associations annotated as relevant. This gives us a first look at the density of the relevance signal relative to the size of the corpus.

In [20]:
# Total number of relevance judgements
total_qrels = len(qrels)

# Number of relevant and non-relevant judgements
relevance_counts = qrels["label"].value_counts().to_dict()

print("\n--- QRELS OVERVIEW ---")
print(f"Total relevance judgements: {total_qrels}")
for label, count in relevance_counts.items():
    print(f"Label {label}: {count} judgements")


--- QRELS OVERVIEW ---
Total relevance judgements: 1706
Label 1: 1706 judgements


All judgments are binary (label = 1), meaning that documents are either relevant or non-relevant, with no graded relevance.
This confirms that evaluation will rely on binary relevance metrics such as MAP and nDCG with binary gains.

For each query, we calculate how many documents are deemed relevant for each individual query. This allows us to understand whether the queries have many possible answers or whether relevance is concentrated in just a few documents.

In [21]:
# Count relevant documents per query
relevant_docs_per_query = defaultdict(int)

for _, row in qrels.iterrows():
    if row["label"] > 0:
        relevant_docs_per_query[row["qid"]] += 1

# Convert to list for statistics
relevant_counts = list(relevant_docs_per_query.values())

print("\n--- RELEVANT DOCUMENTS PER QUERY ---")
print(f"Number of queries with at least one relevant document: {len(relevant_counts)}")
print(f"Average relevant docs per query: {np.mean(relevant_counts):.2f}")
print(f"Median relevant docs per query: {np.median(relevant_counts):.2f}")
print(f"Min relevant docs per query: {np.min(relevant_counts)}")
print(f"Max relevant docs per query: {np.max(relevant_counts)}")


--- RELEVANT DOCUMENTS PER QUERY ---
Number of queries with at least one relevant document: 648
Average relevant docs per query: 2.63
Median relevant docs per query: 2.00
Min relevant docs per query: 1
Max relevant docs per query: 15


With an average of only 2.63 relevant documents per query, the signal is extremely sparse. In a corpus of ~57000 documents, finding this ~2 items is a significant challenge for any retrieval model.

Finally, we bucket the queries based on their number of relevant documents to assess how widespread this sparsity is.

In [22]:
# Bucket queries by number of relevant documents
bins = {
    "1 relevant doc": 0,
    "2-3 relevant docs": 0,
    "4-5 relevant docs": 0,
    ">5 relevant docs": 0
}

for count in relevant_counts:
    if count == 1:
        bins["1 relevant doc"] += 1
    elif 2 <= count <= 3:
        bins["2-3 relevant docs"] += 1
    elif 4 <= count <= 5:
        bins["4-5 relevant docs"] += 1
    else:
        bins[">5 relevant docs"] += 1

print("\n--- RELEVANCE SPARSITY DISTRIBUTION ---")
for k, v in bins.items():
    print(f"{k:20s}: {v}")


--- RELEVANCE SPARSITY DISTRIBUTION ---
1 relevant doc      : 220
2-3 relevant docs   : 287
4-5 relevant docs   : 85
>5 relevant docs    : 56


The distribution shows that 220 queries (about 34%) have exactly one relevant document. This high level of concentration means the retrieval task is "winner takes all", missing that single document result in a score of zero for that query.
Such sparsity motivates the use of semantic retrieval and re-ranking strategies, as simple keyword matching is likely to miss these rare relevant items if the vocabulary does not align perfectly.

## **Baseline Retrieval Experiments**

In this phase, we establish a performance floor by implementing classical lexical retrieval baselines. The objective is to quantify the effectiveness of exact keyword matching using the full FinQA dataset before introducing advances semantic models.

In particular, we have implemented three mandatory retrieval strategies:


1.   TF-IDF: A standard statistical baseline that weights terms based on their frequency and rarity.
2.   BM25: The industry-standard probabilistic model for lexical retrieval, which addresses term frequency saturation.
3. BM25 + RM3: A classical query expansion technique using Pseudo-Relevance Feedback (PRF) to mitigate the vocabulary mismatch problem.



These baselines are specifically chosen to highlight the limitations of lexical systems.

To ensure academic validity and avoid "preprocessing bias", we index the raw document text using Terrier's default internal pipeline

In [23]:
queries = dataset.get_topics()
qrels = dataset.get_qrels()

# Function to clean queries and avoid TerrierQL parser errors
def final_query_cleaner(text):
    # Remove Terrier special characters that trigger syntax errors during parsing
    clean = re.sub(r'[:\-^~*]', ' ', text)
    return re.sub(r'\s+', ' ', clean).strip()

# Prepare the topics for retrieval by applying the cleaner
queries_to_run = queries.copy()
queries_to_run['query'] = queries_to_run['query'].apply(final_query_cleaner)

# Initialize and build the standard Terrier index on raw text
index_path = "./fiqa_index_standard"
if os.path.exists(index_path):
    shutil.rmtree(index_path)

# Indexing the full corpus without manual pre-tokenization
indexer = pt.IterDictIndexer(index_path, overwrite=True, meta={'docno': 20, 'text': 4096})
indexref = indexer.index(dataset.get_corpus_iter())

terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...


https://repo1.maven.org/maven2/org/terrier/terrier-assemblies/5.11/terrier-assemblies-5.11-jar-with-dependenci…

Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...


https://repo1.maven.org/maven2/org/terrier/terrier-python-helper/0.0.8/terrier-python-helper-0.0.8.jar:   0%| …

Done


Java started (triggered by TerrierIndexer.__init__) and loaded: pyterrier.java.colab, pyterrier.java, pyterrier.java.24, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

14:04:57.650 [ForkJoinPool-1-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 39 empty documents


We define the retrieval models and execute a comprehensive experiment using all mandatory evaluation metrics to measure precision, recall and ranking quality

In [24]:
bm25 = pt.terrier.Retriever(indexref, wmodel="BM25")
tfidf = pt.terrier.Retriever(indexref, wmodel="TF_IDF")

rm3 = bm25 >> pt.rewrite.RM3(indexref) >> bm25

metrics = [
    P@1, P@5, P@10, R@5, R@10, nDCG@5, nDCG@10, "map"
]

results = pt.Experiment(
    [tfidf, bm25, rm3],
    queries_to_run,
    dataset.get_qrels(),
    eval_metrics=metrics,
    names=["TF_IDF", "BM25", "BM25+RM3"],
    verbose=True
)

print("\n--- FINAL STANDARD BASELINES ---")
print(results.sort_values(by="map", ascending=False))

pt.Experiment:   0%|          | 0/3 [00:00<?, ?system/s]


--- FINAL STANDARD BASELINES ---
       name       map       P@1       P@5      P@10       R@5      R@10  \
1      BM25  0.210381  0.236111  0.106481  0.070370  0.247471  0.309708   
0    TF_IDF  0.209956  0.236111  0.108642  0.071142  0.249805  0.313278   
2  BM25+RM3  0.206529  0.220679  0.107099  0.067593  0.243568  0.303302   

     nDCG@5   nDCG@10  
1  0.230060  0.252589  
0  0.231547  0.253659  
2  0.225140  0.245375  


The results indicate that BM25 and TF-IDF perform similarly, achieving a MAP of approaximately 0.21. While these are solid lexical scores, the Recall@10 of ~0.31 highlights a critical failure: lexical models miss nearly 70% of relevant documents within the top 10 results.

Furthermore, the decrease in performance for RM3 (MAP 0.206) suggests that statistcial query expansion introduces "noise" from the social-media-style documents in FinQA, causing query drift. This identifies a clear opportunity for using semantic models to bridge the vocabulary gap more accurately.

## **Advanced retrieval Pipeline**

Following the establishment of our lexical baselines, we now transition to the development of our Advanced Baseline. This stage is critical for our research as it builds the retrieval infrastructure required to test LLM-induced bias.

The pipeline consists of a two stage process:
*   **BM25 candidate retrieval**: we utilize our existing index to retrieve the top k (100) candidate documents.
*   **Bi-encoder semantic reranking**: we implement a sentence-transformer (`sentence-transformers/all-MiniLM-L6-v2`) to re-order these candidates based on semantic similarity.

The bi-encoder is strictly employed as a reranker on the fixed BM25 candidate set, rather than a full dense retriever, ensuring a controlled and manageable experiment environment. This hybrid architecture produces performance scores for the original, unexpanded queries and serves as the primary comparison point for all subsequent QE scenarios.




In [25]:
# 1) BM25 baseline retriever (candidate generator)
bm25 = pt.terrier.Retriever(indexref, wmodel="BM25")

# 2) Transformer to attach raw text from the index to retrieved results
get_text = pt.text.get_text(indexref, "text")

# 3) Bi-encoder model
bi_encoder_name = "sentence-transformers/all-MiniLM-L6-v2"
biencoder = SentenceTransformer(bi_encoder_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
biencoder = biencoder.to(device)

def cosine_sim(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    # a: [d], b: [n, d]
    a = a / (np.linalg.norm(a) + 1e-12)
    b = b / (np.linalg.norm(b, axis=1, keepdims=True) + 1e-12)
    return b @ a

# 4) PyTerrier re-ranker: for each query, re-rank its candidate docs using cosine similarity
def biencoder_rerank(df: pd.DataFrame,
                     top_k: int = 100,
                     batch_size: int = 64) -> pd.DataFrame:
    if df.empty:
        return df

    # Keep only top_k BM25 candidates for efficiency
    df = df.sort_values("score", ascending=False).head(top_k).copy()

    # Query embedding
    query_text = df["query"].iloc[0]
    q_emb = biencoder.encode(query_text, convert_to_numpy=True, device=device)

    # Document embeddings in batches
    docs = df["text"].fillna("").tolist()
    doc_embs = []
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        emb = biencoder.encode(batch, convert_to_numpy=True, device=device)
        doc_embs.append(emb)
    doc_embs = np.vstack(doc_embs)

    # Cosine similarity scores
    sims = cosine_sim(q_emb, doc_embs)

    # Replace score with neural score
    df["score"] = sims

    # Recompute rank (PyTerrier expects rank starting from 0)
    df = df.sort_values("score", ascending=False).reset_index(drop=True)
    df["rank"] = np.arange(len(df), dtype=int)

    return df

neural_reranker = pt.apply.by_query(biencoder_rerank)

# 5) Full hybrid pipeline:BM25 -> attach text -> bi-encoder rerank
hybrid_bm25_biencoder = bm25 >> get_text >> neural_reranker


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We evaluate the hybrid system with the same metrics used in Phase I, so we can compare fairly and quantify whether semantic re-ranking improves the quality of the top results.

In [26]:
# Required metrics
metrics = [
    P@1, P@5, P@10,
    R@5, R@10,
    nDCG@5, nDCG@10,
    "map"
]

qrels = dataset.get_qrels()

# Run evaluation
results_phase2_e1 = pt.Experiment(
    retr_systems=[
        pt.terrier.Retriever(indexref, wmodel="BM25"),
        hybrid_bm25_biencoder
    ],
    topics=queries_to_run,
    qrels=qrels,
    eval_metrics=metrics,
    names=["BM25 (Phase I ref)", f"Hybrid BM25 + Bi-Encoder"],
    verbose=True,
    precompute_prefix=True  # avoids running shared prefixes multiple times
)

display(results_phase2_e1.sort_values(by="map", ascending=False))

  warn('precompute_prefix was True for pt.Experiment, but no common pipeline prefix was found among %d pipelines' % len(retr_systems))


pt.Experiment:   0%|          | 0/2 [00:00<?, ?system/s]

Unnamed: 0,name,map,P@1,P@5,P@10,R@5,R@10,nDCG@5,nDCG@10
1,Hybrid BM25 + Bi-Encoder,0.299058,0.350309,0.160802,0.097377,0.355704,0.423885,0.340156,0.360562
0,BM25 (Phase I ref),0.210381,0.236111,0.106481,0.07037,0.247471,0.309708,0.23006,0.252589


As shown in the results, the BM25 + Bi-Encoder system significantly outperforms the lexical reference across all metrics, with MAP rising from 0.21 to 0.299. Moreover, we observe a major increase in P@1 (from 0.236 to 0.350) demonstrating that semantic re-ranking is highly effective at placing relevant documents at the very top of the list. Even with a fixed candidate set, R@10 improves from 0.309 to 0.423, proving that neural re-ranking successfully elevates relevant documents that lexical matching had ranked lower.

## **Prompt Engineering and LLM Generation**

In this core experimental stage, we design three controlled prompt templates to expand each financial opinion question in a consistent and reproducible manner. The primary goal is to generate domain-relevant expansion terms while systematically manipulating sentiment polarity:

* **Neutral QE**: Expands the query without adding sentiment.
* **Positive (Bullish) QE**: Injects optimistic or growth-oriented financial terminology.
* **Negative (Bearish) QE**: Injects pessimistic or risk-oriented financial terminology.

We have implemented strict constraints to limit the LLM to a fixed number of expansion terms per query. To guarantee the total reproducibility of our retrieval experiments, all generated expansions are saved and reused consistently across all ranking scenarios. This setup allows us to isolate the impact of sentiment bias on the subsequent hybrid retrieval process.


To ensure consistency across different execution environments, we implement an automated synchronization mechanism. This cell checks for the local presence of our pre-generated expansions; if missing, it fetches the authoritative version from our project repository. This prevents redundant LLM API calls and ensures that every run uses the exact same expansion set.

In [27]:
GITHUB_RAW_URL = "https://raw.githubusercontent.com/BeatriceCamera/FinQA_IR_project/refs/heads/main/llm_query_expansions.json"
EXPANSION_FILE = "llm_query_expansions.json"

def download_from_github(url, target_path):
    if not os.path.exists(target_path):
        print(f"File not found. Downloading from GitHub...")
        try:
            response = requests.get(url)
            response.raise_for_status()
            with open(target_path, "wb") as f:
                f.write(response.content)
            print("Download completed successfully.")
        except Exception as e:
            print(f"Error while downloading: {e}")
    else:
        print(f"File '{target_path}' already present locally.")

download_from_github(GITHUB_RAW_URL, EXPANSION_FILE)

File not found. Downloading from GitHub...
Download completed successfully.


We define the control variables for the LLM generation task. We set a fixed target of 8 terms per expansion to maintain a balanced query length across all variants.

In [28]:
RUN_QE_GENERATION = False
FORCE_REGEN = False           # True -> overwrite/regen the JSON file

N_TERMS = 8
SAVE_EVERY = 15
MAX_RETRIES = 2
MAX_NEW_TOKENS = 80

We utilize `Qwen2.5-1.5B-Instruct`, a state-of-the-art small language model optimized for following complex instructions.

In [29]:
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

gen = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0


To minimize variance in our results, we wrap the generation process with specific decoding parameters. By setting `do_sample=False` and `temperature=0.0`, we force the model to perform greedy decoding, ensuring that for any given prompt, the output remains deterministic and reproducible.

In [30]:
def llm_generate(prompt: str, max_new_tokens: int = MAX_NEW_TOKENS) -> str:
    messages = [
        {"role": "system", "content": "You strictly follow instructions and output only what is requested."},
        {"role": "user", "content": prompt},
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    out = gen(
        text,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        temperature=0.0,
        return_full_text=False
    )
    return out[0]["generated_text"]

One of the primary challenges in financial QE is lexical drift, when a financial term (for example "parking cash") is misinterpreted by the LLM as a general-purpose concept (for example "parking a car"). We implement a robust filtering system:
1. `FINANCE_CLUES`: A whitelist of tokens that must be present to confirm financial relevance.
2. `BAD_NONFINANCE`: A blacklist of tokens (Automotive, Household, General) used to detect and discard off-topic drift.
3. `Truncation Checks`: Ensures that terms cut off by token limits are removed.

In [31]:
FINANCE_CLUES = {
    "fund", "funds", "money", "cash", "capital", "investment", "investing",
    "portfolio", "equity", "bond", "stock", "stocks", "market", "interest", "yield",
    "return", "risk", "savings", "account", "bank", "financial", "finance", "economy",
    "asset", "assets", "liquidity", "wealth", "valuation", "tax", "taxes",
    "option", "options", "derivative", "derivatives", "premium", "delta", "volatility",
    "inflation", "rate", "rates", "fed", "treasury", "credit", "debt", "loan", "mortgage",
    "dividend", "earnings", "eps", "pe", "p/e", "etf", "ipo", "bitcoin", "crypto"
}

BAD_NONFINANCE = {
    # Automotive/Physical Parking
    "parked car", "garage", "garages", "valet", "sidewalk",
    "pavement", "traffic", "scooter", "bicycle", "pedestrian", "towing",
    "parking meter", "parking lot", "street parking", "bus", "metro", "train",

    # Household/Physical
    "washing machine", "laundry", "detergent", "soap", "swimming", "swim",
    "chlorine", "lifeguard", "pool party", "aquarium", "water tank", "gas station",

    # Locations/General
    "near work", "near home", "neighborhood", "camera", "cameras",
    "public transit", "grocery", "mall", "store", "gps", "map",

    # Meta language
    "here are", "keywords", "phrases", "original query", "related to", "search for"
}

BAD_NONFINANCE_TOKENS = {
    "parking", "garage", "valet", "towing", "sidewalk", "pavement",
    "laundry", "detergent", "chlorine", "lifeguard", "aquarium",
    "gps", "metro", "bus", "train"
}

TRUNC_ENDINGS = {"in", "to", "with", "and", "of", "for", "on", "at", "linked"}

def has_finance_signal(text: str) -> bool:
    low = text.lower()
    return any(sig in low for sig in FINANCE_CLUES)

def is_non_finance_drift(text: str) -> bool:
    low = text.lower()

    if has_finance_signal(low):
        return False

    if any(bad in low for bad in BAD_NONFINANCE):
        return True

    for tok in BAD_NONFINANCE_TOKENS:
        if re.search(rf"\b{re.escape(tok)}\b", low):
            return True

    return False

def is_truncated_term(text: str) -> bool:
    low = text.lower().strip()
    if low.endswith((":", "-", "—", "...")):
        return True
    last = low.split()[-1] if low.split() else ""
    return last in TRUNC_ENDINGS

Raw LLM outputs often contain noise such as bullet points, repetitions, or echoing of the original query. The `clean_terms` function applies rigorous post-processing to normalize the text, remove duplicates, and ensure each term meets our strict length and quality requirements (between 1 and 6 words).

In [32]:
def clean_terms(text: str, original_query: str, n_terms: int) -> List[str]:
    lines = [ln.strip() for ln in str(text).splitlines() if ln.strip()]

    oq = original_query.lower()
    oq_norm = re.sub(r"\W+", " ", oq).strip()

    terms = []
    for ln in lines:
        ln = re.sub(r"^[-*•\d\.\)\]]+\s*", "", ln).strip()
        ln = ln.strip('"').strip("'")

        if not ln:
            continue

        ln_low = ln.lower()
        ln_norm = re.sub(r"\W+", " ", ln_low).strip()

        if ln_low == oq or ln_norm == oq_norm:
            continue

        if ln.endswith("?"):
            continue
        if ln_low.startswith(("how ", "what ", "why ", "when ", "where ", "is ", "are ", "can ")):
            continue

        num_words = len(ln.split())
        if not (1 <= num_words <= 6):
            continue

        if len(ln) < 3:
            continue

        if is_non_finance_drift(ln):
            continue
        if is_truncated_term(ln):
            continue

        terms.append(ln)

    seen = set()
    deduped = []
    for t in terms:
        key = t.lower()
        if key not in seen:
            seen.add(key)
            deduped.append(t)

    return deduped[:n_terms]

def is_valid_terms(terms: list, n_terms: int) -> bool:
    if not isinstance(terms, list):
        return False
    if len(terms) != n_terms:
        return False
    for t in terms:
        if not isinstance(t, str) or len(t.strip()) < 3:
            return False
        if not (1 <= len(t.split()) <= 6):
            return False
    return True

To maintain a consistent query length in our retrieval experiments, we must ensure every query has exactly 8 terms. If the LLM generation or our quality filters result in fewer terms, the `pad_with_fallback` function injects domain-consistent terms (standardized by sentiment mode) to fulfill the count. This prevents performance variations caused simply by differing query lengths.

In [33]:
FALLBACK_TERMS = {
    "neutral": [
        "risk management", "asset allocation", "liquidity management",
        "capital preservation", "interest rate risk",
        "portfolio diversification", "tax implications", "expected returns"
    ],
    "positive": [
        "growth investing", "bullish outlook", "upside potential",
        "return maximization", "compound growth",
        "earnings growth", "market optimism", "positive sentiment"
    ],
    "negative": [
        "downside risk", "bearish outlook", "volatility risk",
        "drawdown risk", "recession fears",
        "uncertainty risk", "default risk", "negative sentiment"
    ]
}

def pad_with_fallback(terms: list, mode: str, n_terms: int = N_TERMS) -> list:
    terms = list(terms) if isinstance(terms, list) else []
    seen = {t.lower() for t in terms if isinstance(t, str)}
    for t in FALLBACK_TERMS[mode]:
        if len(terms) >= n_terms:
            break
        if t.lower() not in seen:
            terms.append(t)
            seen.add(t.lower())
    return terms[:n_terms]

The prompt is the most critical component for controlling LLM behavior. In this cell, we define a strict template that uses Few-Shot style constraints to prevent common pitfalls such as echoing the question or drifting into non-financial meanings of polysemic words. We also implement a retry mechanism: if the LLM output is rejected by our quality filters, the system automatically re-attempts the generation up to `MAX_RETRIES` times, ensuring that every query in the final dataset meets our quality standards.

In [34]:
def build_qe_prompt(query: str, n_terms: int, mode: str) -> str:
    polarity_rules = {
        "neutral": "Neutral tone only.",
        "positive": "Bullish, opportunity-focused language. Emphasize growth and upside.",
        "negative": "Bearish, risk-focused language. Emphasize downside and uncertainty."
    }
    if mode not in polarity_rules:
        raise ValueError(f"Unknown mode: {mode}")

    return f"""
Generate {n_terms} {mode.upper()} query expansion keyword phrases for a financial search engine.

Original query:
{query}

Rules:
- This is STRICTLY a FINANCE / investing / personal finance query expansion task.
- If any word is ambiguous (e.g., "park"), ALWAYS use the FINANCIAL meaning (parking money).
- NEVER interpret queries as car/vehicle/real-world parking, travel, places, neighborhoods, or physical storage.
- Output EXACTLY {n_terms} lines
- One keyword phrase per line (1 to 6 words)
- NO numbering, NO bullets
- NO questions
- DO NOT repeat the original query
- {polarity_rules[mode]}
- Use finance-specific concepts, metrics, entities

Return EXACTLY {n_terms} lines.
""".strip()

def generate_qe_with_retry(query: str, mode: str, n_terms: int = N_TERMS, max_retries: int = MAX_RETRIES) -> List[str]:
    best_terms = []
    for _ in range(max_retries):
        raw = llm_generate(build_qe_prompt(query, n_terms, mode))
        terms = clean_terms(raw, original_query=query, n_terms=n_terms)

        if len(terms) > len(best_terms):
            best_terms = terms

        if len(terms) == n_terms:
            return terms

    return pad_with_fallback(best_terms, mode, n_terms=n_terms)

In [35]:
def load_expansions(path: str = EXPANSION_FILE) -> dict:
    if os.path.exists(path):
        with open(path, "r") as f:
            return json.load(f)
    return {}

def save_expansions(data: dict, path: str = EXPANSION_FILE):
    with open(path, "w") as f:
        json.dump(data, f, indent=2)

Generating expansions for the entire FinQA dataset is a resource-intensive task. This orchestration function iterates through the topics, implementing an intelligent caching strategy: it only triggers the LLM for missing or invalid entries. To protect against runtime interruptions, we include a checkpointing system that saves progress to disk every iterations, allowing the process to resume seamlessly without losing data.

In [36]:
def build_all_query_expansions(topics_df, n_terms: int = N_TERMS, save_every: int = SAVE_EVERY):
    expansions = load_expansions()
    total = len(topics_df)

    if FORCE_REGEN:
        expansions = {}

    t0 = time.time()
    n_generated = 0

    def count_cached_qids():
        return len([k for k in expansions.keys() if k != "_meta"])

    for _, row in topics_df.iterrows():
        qid = str(row["qid"])
        query = row["query"]

        cached = expansions.get(qid, {})

        if not FORCE_REGEN:
            reuse_ok = (
                isinstance(cached, dict)
                and cached.get("original") == query
                and all(is_valid_terms(cached.get(m, []), n_terms) for m in ["neutral", "positive", "negative"])
            )
            if reuse_ok:
                continue

        expansions[qid] = {
            "original": query,
            "neutral": generate_qe_with_retry(query, "neutral", n_terms),
            "positive": generate_qe_with_retry(query, "positive", n_terms),
            "negative": generate_qe_with_retry(query, "negative", n_terms),
        }

        n_generated += 1

        if n_generated % 5 == 0:
            done = count_cached_qids()
            elapsed = (time.time() - t0) / 60
            print(f"[{done}/{total}] generated | elapsed: {elapsed:.1f} min")

        if n_generated % save_every == 0:
            save_expansions(expansions)
            print("  -> checkpoint saved to disk")

    save_expansions(expansions)
    print("Done. Saved:", EXPANSION_FILE)
    return expansions

This cell controls the main workflow of the Query Expansion stage. By toggling the `RUN_QE_GENERATION` flag, we can switch between the Generation Phase (producing new expansions) and the Evaluation Phase (loading the existing authoritative cache). This modularity is essential for running IR experiments without re-incurring the computational cost of LLM inference.

In [37]:
expansions = load_expansions()

if RUN_QE_GENERATION:
    # Safety: make backup before overwriting
    if os.path.exists(EXPANSION_FILE):
        os.rename(EXPANSION_FILE, EXPANSION_FILE.replace(".json", "_OLD.json"))
        print("Backup created:", EXPANSION_FILE.replace(".json", "_OLD.json"))

    expansions = build_all_query_expansions(topics, n_terms=N_TERMS, save_every=SAVE_EVERY)
    save_expansions(expansions)
else:
    if not expansions:
        raise FileNotFoundError(f"Missing cache file: {EXPANSION_FILE}. Upload it to /content before running.")

print("Loaded expansions (qids):", len([k for k in expansions.keys() if k != "_meta"]))

Loaded expansions (qids): 648


Before proceeding to quantitative evaluation, it is vital to perform a qualitative "sanity check". We define a formatting utility to visualize how a single query has been expanded across the three different sentiment modes.

In [38]:
def line(width: int = 70):
    print("=" * width)

def section(title: str, width: int = 70):
    line(width)
    print(f"{title}".center(width))
    line(width)

def print_qe_example(expansions: dict, qid: str, width: int = 70):
    if qid not in expansions:
        raise KeyError(f"QID '{qid}' not found in expansions.")

    data = expansions[qid]

    section("LLM QUERY EXPANSION EXAMPLE", width)
    print(f"QID     : {qid}")
    print(f"Original: {data.get('original', '')}")

    for mode in ["neutral", "positive", "negative"]:
        line(width)
        print(f"{mode.upper()} EXPANSION".center(width))
        line(width)
        terms = data.get(mode, [])
        for i, t in enumerate(terms, 1):
            print(f"{i:02d}. {t}")
    line(width)

# the disambiguity challenge
print_qe_example(expansions, "10639")

                     LLM QUERY EXPANSION EXAMPLE                      
QID     : 10639
Original: Short term parking of a large inheritance?
                          NEUTRAL EXPANSION                           
01. Investment opportunities with short-term gains
02. Maximizing returns on inherited assets
03. Strategies for managing large inheritances
04. Tax implications of immediate investment decisions
05. Predictive models for successful investments
06. Leveraging market trends in inherited funds
07. Diversification techniques for inherited wealth
08. Retirement planning with inherited capital
                          POSITIVE EXPANSION                          
01. Maximizing short-term investment returns
02. Unlocking potential with quick cash flow
03. Boosting your wealth through timely savings
04. Expanding your portfolio with high-yield opportunities
05. Growing your retirement
06. growth investing
07. bullish outlook
08. upside potential
                          NEGATIVE EXPA

This query exemplifies the successful implementation of semantic guardrails. Despite the metaphorical use of the word "parking", the expansion remains strictly within the Wealth Management domain. The model correctly identifies "parking" as a temporary capital allocation for an inheritance, avoiding any lexical drift toward the automotive domain while maintaining a clear divergence between return optimization (Positive) and capital erosion concerns (Negative).

In [39]:
# technical depth
print_qe_example(expansions, "6252")

                     LLM QUERY EXPANSION EXAMPLE                      
QID     : 6252
Original: Is this mortgage advice good, or is it hooey?
                          NEUTRAL EXPANSION                           
01. Mortgage analysis tools
02. Loan repayment strategies
03. Interest rate comparison
04. Credit score impact study
05. Debt-to-income ratio guide
06. Investment portfolio review
07. Savings account growth tracker
08. Retirement fund evaluation
                          POSITIVE EXPANSION                          
01. Mortgage investment strategies
02. Pros of refinancing options
03. Equity market outlook analysis
04. Yield potential in real estate
05. Home equity loan benefits
06. Rental property profitability study
07. Property tax savings comparison
08. Cash flow enhancement techniques
                          NEGATIVE EXPANSION                          
01. Mortgage scam detection tools
03. Credit score decline indicators
04. Market volatility prediction models
05. Stock

This example highlights the model’s ability to interpret colloquial language and translate subjective user intent into actionable financial search terms. By analyzing the informal expression 'hooey' (slang for nonsense), the system correctly identifies the underlying need for skepticism and verification. This is reflected in the Negative expansion, which successfully shifts the focus toward 'scam detection', 'fraud warning signs', and 'credibility ratings'. This proves the pipeline’s effectiveness in bridging the gap between natural, everyday language and the specialized vocabulary of financial auditing and risk assessment

In [40]:
# market mechanism
print_qe_example(expansions, "8332")

                     LLM QUERY EXPANSION EXAMPLE                      
QID     : 8332
Original: Why do put option prices go higher when the underlying stock tanks (drops)?
                          NEUTRAL EXPANSION                           
01. Stock market volatility impact analysis
02. Underlying asset performance correlation study
03. Option pricing model sensitivity test
04. Market sentiment influence on price fluctuations
05. Risk premium calculation in financial markets
06. Historical data trend examination of stocks
07. Liquidity risk assessment during downturns
08. Investor psychology effect on investment decisions
                          POSITIVE EXPANSION                          
01. High volatility stocks attract premium
02. Market panic triggers bullish options
03. Underlying decline fuels call premiums
04. Fear of loss drives put demand
05. Stock crash boosts option value
06. Investors seek upside protection
07. Panic selling pushes up strikes
08. Risk-on strategy see

This query showcases a sophisticated understanding of market dynamics and inverse sentiment. The model correctly interprets the financial slang "tanks" as a market crash. Notably, in the Positive expansion, it identifies that a price drop is a favorable event for a Put option Holder ("*Stock crash boosts option value*"), proving that the bias induction is context-aware and based on investment logic rather than simple word association.

## **Bias Verification (sentiment analysis)**

After generating the query expansions, it is essential to verify that the intentional steering of sentiment has been successful. Simply prompting an LLM to be "bullish" or "bearish" does not guarantee that the resulting terms carry the intended emotional weight.

In this stage, we employ FinBERT, a specialized pre-trained NLP model based on the BERT architecture, specifically fine-tuned on financial corpora (Financial PhraseBank). We use FinBERT to perform a quantitative sentiment analysis on every generated expansion term. By calculating a Polarity Index (defined as the difference between the positive and negative confidence scores), we can objectively measure the bias magnitude. This verification serves as a "Sanity Check" to ensure that our experimental variables (Neutral vs Positive vs Negative expansions) are statistically distinct before they are fed into the retrieval pipeline.

We load the ProsusAI/finbert model and establish a high-performance classification pipeline. Unlike general-purpose sentiment models, FinBERT understands the nuance of financial language (such as distinguishing between a "market tanking" and a physical "tank"). We also implement a defensive label mapping to ensure consistency in scoring regardless of the model's output formatting.

In [41]:
# Load a finance-domain sentiment model (FinBERT)
SENT_MODEL = "ProsusAI/finbert"

sent_tokenizer = AutoTokenizer.from_pretrained(SENT_MODEL)
sent_model = AutoModelForSequenceClassification.from_pretrained(SENT_MODEL)

sentiment_pipe = pipeline(
    "text-classification",
    model=sent_model,
    tokenizer=sent_tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    top_k=None,
    truncation=True
)

# Normalize label variants defensively
LABEL_MAP = {
    "positive": "positive", "Positive": "positive", "POSITIVE": "positive",
    "negative": "negative", "Negative": "negative", "NEGATIVE": "negative",
    "neutral":  "neutral",  "Neutral":  "neutral",  "NEUTRAL":  "neutral",
}

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Device set to use cuda:0


To move from categorical labels to a continuous scale, we define a Polarity Index. This scalar value allows us to map the expansions onto a spectrum where +1 represents absolute bullishness and -1 represents absolute bearishness. The following functions handle batch processing of expansion terms to maximize throughput while maintaining robust error handling for empty or malformed strings.

In [42]:
def polarity_index(score_dict: dict) -> float:
    """Scalar polarity: positive minus negative."""
    return float(score_dict.get("positive", 0.0) - score_dict.get("negative", 0.0))

def finbert_scores_batch(texts, batch_size: int = 32):
    """
    Run FinBERT on a list of short texts (query expansion terms).
    Returns: list[dict] with keys {positive, negative, neutral, polarity}
    """
    if texts is None:
        return []
    if isinstance(texts, str):
        texts = [texts]

    texts = [("" if t is None else str(t).strip()) for t in texts]
    if len(texts) == 0:
        return []

    outs = sentiment_pipe(texts, batch_size=batch_size)

    scored = []
    for out in outs:
        s = {"positive": 0.0, "negative": 0.0, "neutral": 0.0}
        for x in out:
            lab = LABEL_MAP.get(x.get("label", ""), None)
            if lab is not None:
                s[lab] = float(x.get("score", 0.0))
        s["polarity"] = polarity_index(s)
        scored.append(s)

    return scored

def score_terms(terms, batch_size: int = 32) -> pd.DataFrame:
    """
    Score each expansion term with FinBERT and return a dataframe:
    columns = ["text","pos","neg","neu","polarity"]
    """
    if not terms:
        return pd.DataFrame(columns=["text", "pos", "neg", "neu", "polarity"])

    scores = finbert_scores_batch(terms, batch_size=batch_size)
    rows = []
    for t, s in zip(terms, scores):
        rows.append({
            "text": t,
            "pos": s["positive"],
            "neg": s["negative"],
            "neu": s["neutral"],
            "polarity": s["polarity"]
        })
    return pd.DataFrame(rows)

In this final verification cell, we iterate through the entire expansion dataset and aggregate scores at the query level. We calculate the mean and median polarity for each mode (Neutral, Positive, Negative).

The most important part of this cell is the Sanity Check: we mathematically verify if the relationship `Positive > Neutral > Negative` holds true for each query. This confirms that our "sentiment steering" was not just noise, but a systematic shift in the semantic orientation of the search terms. The resulting summary table provides a high-level view of the "Sentiment Gap" we have successfully created.

In [43]:
BATCH_SIZE = 32
results = []
term_level_rows = []

for _, row in topics.iterrows():
    qid = str(row["qid"])
    query = row["query"]

    if qid not in expansions:
        continue

    for mode in ["neutral", "positive", "negative"]:
        terms = expansions[qid].get(mode, [])
        df_terms = score_terms(terms, batch_size=BATCH_SIZE)

        if df_terms.empty:
            continue

        results.append({
            "qid": qid,
            "query": query,
            "mode": mode,
            "n_terms_scored": int(len(df_terms)),
            "mean_pos": float(df_terms["pos"].mean()),
            "mean_neg": float(df_terms["neg"].mean()),
            "mean_neu": float(df_terms["neu"].mean()),
            "mean_polarity": float(df_terms["polarity"].mean()),
            "median_polarity": float(df_terms["polarity"].median()),
        })

        for _, r in df_terms.iterrows():
            term_level_rows.append({
                "qid": qid,
                "mode": mode,
                "term": r["text"],
                "pos": float(r["pos"]),
                "neg": float(r["neg"]),
                "neu": float(r["neu"]),
                "polarity": float(r["polarity"]),
            })

bias_df = pd.DataFrame(results)
term_level_df = pd.DataFrame(term_level_rows)

print("BIAS VERIFICATION SUMMARY (FinBERT on cached expansion terms)")

if bias_df.empty:
    print("No scores computed (bias_df is empty). Check that expansions are present and non-empty.")
else:
    summary = bias_df.groupby("mode").agg(
        queries_scored=("qid", "count"),
        avg_n_terms=("n_terms_scored", "mean"),
        avg_mean_pos=("mean_pos", "mean"),
        avg_mean_neg=("mean_neg", "mean"),
        avg_mean_neu=("mean_neu", "mean"),
        avg_mean_polarity=("mean_polarity", "mean"),
        avg_median_polarity=("median_polarity", "mean"),
    ).reset_index()

    display(summary.sort_values("avg_mean_polarity", ascending=False))

    # Sanity check: for each qid, is mean_polarity(positive) > mean_polarity(neutral) > mean_polarity(negative)?
    pivot = bias_df.pivot_table(index="qid", columns="mode", values="mean_polarity", aggfunc="mean")
    expected = pivot.dropna().copy()

    if len(expected) > 0 and all(m in expected.columns for m in ["positive", "neutral", "negative"]):
        ok = (expected["positive"] > expected["neutral"]) & (expected["neutral"] > expected["negative"])
        print(f"\nSanity check (pos > neu > neg) holds for {ok.sum()}/{len(ok)} queries "
              f"({(ok.mean()*100):.1f}%).")
    else:
        print("\nSanity check skipped: not enough data for all three modes per query.")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

BIAS VERIFICATION SUMMARY (FinBERT on cached expansion terms)


Unnamed: 0,mode,queries_scored,avg_n_terms,avg_mean_pos,avg_mean_neg,avg_mean_neu,avg_mean_polarity,avg_median_polarity
2,positive,639,7.674491,0.164352,0.051094,0.784554,0.113258,0.080602
1,neutral,635,7.845669,0.068208,0.070054,0.861738,-0.001846,-5.1e-05
0,negative,640,7.851562,0.060081,0.289059,0.650861,-0.228978,-0.183533



Sanity check (pos > neu > neg) holds for 493/625 queries (78.9%).


The table summarizes sentiment statistics for the LLM-generated query expansions, computed using FinBERT. Positive expansions exhibit a clearly positive average polarity, neutral expansions remain centered around zero, and negative expansions show strong negative polarity, confirming effective sentiment control. The sanity check indicates that the intended polarity ordering (positive > neutral > negative) holds for approximately 79% of the queries, validating the reliability of the sentiment induction while acknowledging minor overlap due to semantic ambiguity and classifier noise.

## **Full Hybrid Retrieval Pipeline Implementation**

This section implements the complete hybrid retrieval pipeline used in our experiments. The pipeline combines traditional lexical retrieval with neural semantic re-ranking and integrates LLM-based query expansion in a strictly controlled and reproducible manner.

Query expansion terms are generated offline and cached to disk. During execution, the system first checks whether expansions already exist for each query and only triggers batched LLM generation when entries are missing or inconsistent. This design ensures computational efficiency, reproducibility, and strict separation between generation and evaluation phases.

For retrieval, each query variant (Original, Neutral QE, Positive QE, Negative QE) is processed through the same hybrid architecture: BM25 is used to retrieve an initial candidate set, which is then re-ranked using a Bi-Encoder model. Expanded queries are constructed by appending a fixed number of cached expansion terms to the original query, followed by a final normalization step to ensure consistent lexical processing.

Evaluation is performed independently for each query variant using standard IR metrics (MAP, Precision@k, Recall@k, and nDCG@k). This controlled setup enables a direct comparison of how sentiment-driven query expansion alters ranking behavior while holding the retrieval architecture constant.

In [44]:
# Config
N_TERMS = 8
MAX_QUERIES = None
BATCH_SIZE = 8
MAX_NEW_TOKENS = 80

if not pt.java.started():
    pt.init()

# batched generation only to fill missing cache
def _to_chat_input(prompt: str) -> str:
    msgs = [
        {"role": "system", "content": "You strictly follow instructions and output only what is requested."},
        {"role": "user", "content": prompt},
    ]
    return tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

@torch.inference_mode()
def generate_batch(prompts: list[str], max_new_tokens: int = MAX_NEW_TOKENS) -> list[str]:
    chat_texts = [_to_chat_input(p) for p in prompts]
    enc = tokenizer(
        chat_texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
    ).to(device)

    gen_ids = model.generate(
        **enc,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        temperature=0.0,
    )

    new_ids = gen_ids[:, enc["input_ids"].shape[1]:]
    return tokenizer.batch_decode(new_ids, skip_special_tokens=True)

def fill_missing_expansions_batched(
    topics_df: pd.DataFrame,
    expansions: dict,
    n_terms: int = N_TERMS,
    batch_size: int = BATCH_SIZE,
    save_every: int = 10
) -> dict:
    rows = topics_df[["qid", "query"]].copy()
    rows["qid"] = rows["qid"].astype(str)

    # Decide which qids need generation
    to_gen = []
    for _, r in rows.iterrows():
        qid = r["qid"]
        q = r["query"]
        if (qid not in expansions) or (expansions[qid].get("original") != q):
            to_gen.append((qid, q))

    if not to_gen:
        return expansions

    # Prepare prompts
    qids = [x[0] for x in to_gen]
    qs   = [x[1] for x in to_gen]

    neu_prompts = [build_qe_prompt(q, n_terms=n_terms, mode="neutral") for q in qs]
    pos_prompts = [build_qe_prompt(q, n_terms=n_terms, mode="positive") for q in qs]
    neg_prompts = [build_qe_prompt(q, n_terms=n_terms, mode="negative") for q in qs]

    newly_saved = 0

    # Generate in batches (3 passes per batch)
    for start in range(0, len(qs), batch_size):
        neu_raw = generate_batch(neu_prompts[start:start + batch_size])
        pos_raw = generate_batch(pos_prompts[start:start + batch_size])
        neg_raw = generate_batch(neg_prompts[start:start + batch_size])

        for i in range(len(neu_raw)):
            idx = start + i
            qid = qids[idx]
            q   = qs[idx]

            neu_terms = clean_terms(neu_raw[i], original_query=q, n_terms=n_terms)
            pos_terms = clean_terms(pos_raw[i], original_query=q, n_terms=n_terms)
            neg_terms = clean_terms(neg_raw[i], original_query=q, n_terms=n_terms)

            # Cache even if slightly short
            expansions[qid] = {
                "original": q,
                "neutral": neu_terms,
                "positive": pos_terms,
                "negative": neg_terms,
            }

            newly_saved += 1
            if newly_saved % save_every == 0:
                save_expansions(expansions)

    save_expansions(expansions)
    return expansions


# Build topic variants from cached expansions
def build_topic_variants_from_cache(
    topics_df: pd.DataFrame,
    expansions: dict,
    n_terms: int = N_TERMS,
    max_queries: int | None = MAX_QUERIES
):
    base = topics_df[["qid", "query"]].copy()
    base["qid"] = base["qid"].astype(str)

    if max_queries is not None:
        base = base.head(max_queries)

    base = base.reset_index(drop=True)

    def _expand(qid: str, q: str, mode: str) -> str:
        data = expansions.get(qid, None)
        if not data or data.get("original") != q:
            return final_query_cleaner(q)

        terms = (data.get(mode, []) or [])[:n_terms]
        if len(terms) == 0:
            return final_query_cleaner(q)

        expanded = q + " " + " ".join(terms)
        return final_query_cleaner(expanded)

    qids = base["qid"].tolist()
    qs   = base["query"].tolist()

    topics_original = base.copy()
    topics_original["query"] = topics_original["query"].apply(final_query_cleaner)

    topics_neutral  = pd.DataFrame({"qid": qids, "query": [_expand(qid, q, "neutral")  for qid, q in zip(qids, qs)]})
    topics_positive = pd.DataFrame({"qid": qids, "query": [_expand(qid, q, "positive") for qid, q in zip(qids, qs)]})
    topics_negative = pd.DataFrame({"qid": qids, "query": [_expand(qid, q, "negative") for qid, q in zip(qids, qs)]})

    return topics_original, topics_neutral, topics_positive, topics_negative

# Fill missing cache first
model.eval()
device = next(model.parameters()).device

expansions = fill_missing_expansions_batched(queries_to_run, expansions, n_terms=N_TERMS, batch_size=BATCH_SIZE, save_every=10)

# build topic variants from the (possibly updated) cache
topics_original, topics_neutral, topics_positive, topics_negative = build_topic_variants_from_cache(
    queries_to_run, expansions, n_terms=N_TERMS, max_queries=MAX_QUERIES
)

# Ensure qrels qid dtype matches topics
qrels_use = qrels.copy()
qrels_use["qid"] = qrels_use["qid"].astype(str)

This code block was originally used to download the generated `llm_query_expansions.json` file after forcing query expansion generation (`FORCE_REGEN = True`). In that setting, the JSON file was created locally by the LLM and then exported for reuse and reproducibility.

In the final setup, query expansions are precomputed and stored in the repository, and the pipeline reuses the existing JSON file without regeneration. Therefore, the download command is commented out, as no file export is required during normal execution.

In [45]:
# from google.colab import files
# files.download("/content/llm_query_expansions.json")

## **Strategic Impact Analysis**

In this section, we analyze how sentiment polarity introduced through LLM-based query expansion affects the behaviour of a fixed hybrid retrieval system. By keeping the retrieval pipeline unchanged and varying only the expanded query text, we isolate the strategic impact of sentiment manipulation on ranking quality. We compare four query variants:
*   original queries (no expansion)
*   neutral LLM-based query expansion
*   positive LLM-based query expansion
*   negative LLM-based query expansion


In [46]:
metrics = [
    P@1, P@5, P@10,
    R@5, R@10,
    nDCG@5, nDCG@10,
    "map"
]

if not pt.java.started():
    pt.init()

# helper function
def run_variant(name: str, topics_df: pd.DataFrame) -> pd.DataFrame:
    """
    Run the hybrid BM25 + Bi-Encoder retriever on a given query variant.
    Returns a DataFrame with evaluation metrics.
    """
    return pt.Experiment(
        retr_systems=[hybrid_bm25_biencoder],
        topics=topics_df,
        qrels=qrels_use,
        eval_metrics=metrics,
        names=[name],
        verbose=True,
        perquery=False
    )

# Run experiments for all query variants
results_original = run_variant("Original", topics_original)
results_neutral  = run_variant("Neutral QE", topics_neutral)
results_positive = run_variant("Positive QE", topics_positive)
results_negative = run_variant("Negative QE", topics_negative)

# Concatenate results into a single table
impact_results = pd.concat(
    [results_original, results_neutral, results_positive, results_negative],
    axis=0
).reset_index(drop=True)

display(impact_results)

pt.Experiment:   0%|          | 0/1 [00:00<?, ?system/s]

pt.Experiment:   0%|          | 0/1 [00:00<?, ?system/s]

pt.Experiment:   0%|          | 0/1 [00:00<?, ?system/s]

pt.Experiment:   0%|          | 0/1 [00:00<?, ?system/s]

Unnamed: 0,name,map,P@1,P@5,P@10,R@5,R@10,nDCG@5,nDCG@10
0,Original,0.299058,0.350309,0.160802,0.097377,0.355704,0.423885,0.340156,0.360562
1,Neutral QE,0.222501,0.263889,0.114506,0.069136,0.268204,0.314921,0.254743,0.27003
2,Positive QE,0.185194,0.21142,0.098148,0.060185,0.23302,0.277325,0.213013,0.228209
3,Negative QE,0.213184,0.246914,0.110802,0.067593,0.249132,0.298835,0.240125,0.255508


The baseline (Original) consistently outperforms all expansion strategies, establishing a high ceiling for retrieval accuracy. The results reveal a clear hierarchy of performance: *Original > Neutral > Negative > Positive*. The drop in metrics suggests that while LLM-generated terms are domain-relevant, they introduce a 'semantic dilution' that shifts the focus away from the concise intent of the original financial queries. Notably, Neutral QE retains the most stability, whereas Positive QE incurs the heaviest penalty, suggesting that bullish terminology is particularly prone to introducing noise in this specific corpus

To explicitly highlight the strategic impact of sentiment, we compute performance deltas between selected pairs of conditions.

In [47]:
# Set system name as index for easier comparison
impact_idx = impact_results.set_index("name")

# Define pairwise comparisons of interest
comparisons = {
    "Original vs Neutral QE": impact_idx.loc["Neutral QE"] - impact_idx.loc["Original"],
    "Neutral vs Positive QE": impact_idx.loc["Positive QE"] - impact_idx.loc["Neutral QE"],
    "Neutral vs Negative QE": impact_idx.loc["Negative QE"] - impact_idx.loc["Neutral QE"],
    "Positive vs Negative QE": impact_idx.loc["Positive QE"] - impact_idx.loc["Negative QE"],
}

# Convert to DataFrame for visualization
delta_df = pd.DataFrame(comparisons).T

# Display deltas (positive values indicate improvement)
display(delta_df)

Unnamed: 0,map,P@1,P@5,P@10,R@5,R@10,nDCG@5,nDCG@10
Original vs Neutral QE,-0.076558,-0.08642,-0.046296,-0.028241,-0.0875,-0.108964,-0.085413,-0.090532
Neutral vs Positive QE,-0.037307,-0.052469,-0.016358,-0.008951,-0.035183,-0.037596,-0.04173,-0.041821
Neutral vs Negative QE,-0.009316,-0.016975,-0.003704,-0.001543,-0.019072,-0.016087,-0.014618,-0.014522
Positive vs Negative QE,-0.02799,-0.035494,-0.012654,-0.007407,-0.016111,-0.021509,-0.027112,-0.0273


Pairwise deltas confirm a universal performance penalty across all expansion scenarios. However, the comparison reveals a significant polarity asymmetry: expansion with negative (bearish) terms is measurably more resilient than positive (bullish) expansion. This suggests that the vocabulary of financial risk and distress is more aligned with the technical nature of the FinQA dataset than the more generic language of financial optimism.

These findings indicate that inducing sentiment doesn't just add noise, but strategically alters the ranking behavior, with 'fear-based' language being less disruptive than 'opportunity-based' language.

## **Conclusion**

In this study, we conducted a systematic evaluation of how induced sentiment polarity in Large Language Model (LLM) query expansion influences the effectiveness of financial information retrieval. Our investigation began with an extensive baseline phase, establishing a performance ceiling through traditional lexical methods, including TF-IDF, BM25, and the BM25+RM3 pseudo-relevance feedback mechanism. These benchmarks were subsequently compared against an advanced retrieval pipeline, which utilized BM25 for candidate retrieval followed by a Bi-Encoder semantic re-ranker without query expansion. This hybrid architecture provided a high-precision baseline for our primary experiments.

The core of our research involved the generation of three sentiment-controlled query expansion variants: Neutral, Positive (Bullish), and Negative (Bearish). To ensure the integrity of our experimental variables, we performed a rigorous bias verification via sentiment analysis using the Finbert classifier. This step confirmed that the LLM successfully generated distinct clusters of sentiment-laden terms, providing a valid foundation for evaluating our hybrid pipeline under four distinct query conditions (Original, Neutral, Positive, and Negative).

The empirical results demonstrate that while the hybrid pipeline is highly effective for original queries, the introduction of sentiment-oriented expansion terms leads to a measurable decrease in retrieval metrics such as MAP and nDCG. This phenomenon is primarily attributed to Query Drift: by intentionally steering the expansion toward specific polarities, the semantic focus of the query is shifted away from the user’s original information need. In the financial domain, where precision is paramount, even domain-relevant terms can act as semantic noise if they emphasize sentiment over the core factual or advisory intent of the query.

Furthermore, our analysis revealed a notable polarity asymmetry. Negative-biased expansions proved to be more resilient than positive ones, suffering a less severe degradation in performance. This suggests that the vocabulary associated with financial risk and bearish outlooks is more semantically aligned with the technical and cautious nature of the FinQA dataset. Ultimately, these findings indicate that while LLMs are powerful tools for knowledge enrichment, the strategic induction of sentiment introduces a trade-off between semantic breadth and retrieval precision, highlighting the sensitivity of hybrid IR systems to linguistic steering.

### **Limitations and Future Work**

Despite the insights gained, this study is subject to certain limitations. First, the use of a relatively small language model (`Qwen-1.5B`) may have limited the linguistic diversity and technical depth of the expansions compared to larger frontier models. Second, our analysis was restricted to the FinQA dataset, which primarily consists of forum-style peer-to-peer discussions; consequently, the observed "polarity asymmetry" might vary in more formal financial news or regulatory corpora.

Future research should explore dynamic expansion filtering, where LLMs are instructed to prune noise-heavy terms before the retrieval phase. Additionally, investigating cross-domain resilience would determine if sentiment-induced query drift is unique to finance or a universal characteristic of hybrid IR systems. Finally, implementing Chain-of-Thought (CoT) prompting could help the model better align expansions with the specific financial logic of the query, potentially mitigating the performance decay observed in this study.