# RAG Implementation using Financial PhraseBank Dataset
This notebook implements a Retrieval-Augmented Generation (RAG) pipeline
using the HuggingFace Financial PhraseBank dataset.


In [1]:
!pip install datasets sentence-transformers faiss-cpu transformers -q



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
!pip uninstall -y datasets
!pip install datasets==2.18.0



Found existing installation: datasets 2.12.0
Uninstalling datasets-2.12.0:
  Successfully uninstalled datasets-2.12.0
Collecting datasets==2.18.0
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=12.0.0 (from datasets==2.18.0)
  Downloading pyarrow-23.0.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (3.0 kB)
Collecting pyarrow-hotfix (from datasets==2.18.0)
  Downloading pyarrow_hotfix-0.7-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2024.2.0,>=2023.1.0 (from fsspec[http]<=2024.2.0,>=2023.1.0->datasets==2.18.0)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
Downloading fsspec-2024.2.0-py3-none-any.whl (170 kB)
Downloading pyarrow-23.0.0-cp311-cp311-macosx_12_0_arm64.whl (34.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.3/34.3 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading pyarrow_hotfix-0.7-py3-none-any.whl (7.9 kB)

In [5]:
from datasets import load_dataset

dataset = load_dataset("financial_phrasebank", "sentences_allagree", split="train[:500]")

documents = [item['sentence'] for item in dataset]

print("Total documents:", len(documents))
print(documents[:5])

Found cached dataset financial_phrasebank (file:///Users/abhinavroyce/.cache/huggingface/datasets/financial_phrasebank/sentences_allagree/1.0.0/550bde12e6c30e2674da973a55f57edde5181d53f5a5a34c1531c53f93b7e141)


NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(documents)

print("Embedding shape:", embeddings.shape)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding shape: (500, 384)


In [3]:
import faiss
import numpy as np

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

index.add(np.array(embeddings))

print("Total vectors in index:", index.ntotal)

Total vectors in index: 500


In [4]:
def retrieve(query, k=3):
    query_embedding = model.encode([query])
    distances, indices = index.search(np.array(query_embedding), k)
    return [documents[i] for i in indices[0]]

query = "Why did the company profit increase?"
results = retrieve(query)

print("Query:", query)
print("\nRetrieved Documents:")
for doc in results:
    print("-", doc)

Query: Why did the company profit increase?

Retrieved Documents:
- Previously , the company anticipated its operating profit to improve over the same period .
- The company 's net profit rose 11.4 % on the year to 82.2 million euros in 2005 on sales of 686.5 million euros , 13.8 % up on the year , the company said earlier .
- It estimates the operating profit to further improve from the third quarter .


In [5]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

def generate_answer(query):
    context = " ".join(retrieve(query))
    prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
    response = generator(prompt, max_length=200, num_return_sequences=1)
    return response[0]['generated_text']

print(generate_answer("What affected company earnings?"))

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Passing `generation_config` together with generation-related arguments=({'num_return_sequences', 'max_length'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Context: The transactions would increase earnings per share in the first quarter by some EUR0 .28 . Earnings per share for January-June 2010 were EUR0 .30 , an increase of 20 % year-on-year EUR0 .25 . Earnings per share EPS amounted to EUR0 .03 , up from the loss of EUR0 .08 .

Question: What affected company earnings?
Answer: Generally, EPS for companies which have substantial cash flows has a small impact on the company's earnings. The majority of companies in the Eurozone have sufficient cash and are considered to have sufficient cash flows to maintain their fiscal position within the euro area, while European companies have no cash flows at all. The impact of government measures such as tax, VAT and other revenue policies may be small, but they may have a large impact in the short term.

Question: What is the impact of the fiscal policies of the countries in the Eurozone?

Answer: The fiscal policies of the countries in the Eurozone are the main drivers of the company's earnings. T