# Executive Summary

To facilitate accurate Q&A with financial documents (OCBC credit reports), a Retrieval-Augmented Generation (RAG) approach can be used together with a LLM.  
Different retrieval methods were compared and the Maximum Marginal Relevance method was found to work best in retrieving diverse and unique results from vectorstore.  
Running a few questions to test showed encouraging answers - the LLM was pulling correct answers from relevant documents and summarizing them in a succinct manner.

Basic workflow: Financial text -> Split into chunks + OpenAI embeddings -> Load Vectorstore  
Ask questions -> Retrieval from Vectorstore -> Get most relevant embeddings -> Expose to LLM -> Get relevant answer

Some possible follow-up that extends beyond the scope of this project:
* Can this workflow scale? Three documents were used but how about a hundred?
* Chatbot functionality with memory and a clean interface can be built for non-technical stakeholders
* How can model performance drift be tracked for LLMs? What if quality of embeddings/results deteriorate over time?
* Can tabular data be created by specifically specifying the output? This could serve as input to ML/DL predictive models.

Inspired by: https://learn.deeplearning.ai/langchain-chat-with-your-data/

## Load libraries

In [1]:
import os
import numpy as np
from dotenv import load_dotenv, find_dotenv
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate


load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
# os.environ["OPENAI_API_KEY"] = your_key_here

## Initialize LLM

In [2]:
# initialize LLM
llm = ChatOpenAI(
    temperature=0,
    openai_api_key=OPENAI_API_KEY,
    model_name="gpt-4",
    request_timeout=600,
)

## Select an appropriate text splitter

### Load a single PDF

In [3]:
loader = PyMuPDFLoader("../data/sg_credit_outlook_1H2023.pdf")
pages = loader.load_and_split()

In [4]:
len(pages)

122

In [5]:
pages[0].page_content[0:500]

'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    i \n \n \n \nTreasury Advisory \nCorporate FX & Structured \nProducts  \nTel: 6349-1888 / 1881 \n \nFixed Income & Structured \nProducts \nTel: 6349-1810 \n \nInterest Rate Derivatives \nTel: 6349-1899 \n \nInvestments & Structured \nProducts \nTel: 6349-1886 \n \n \n \n \nOCBC Cre'

In [6]:
pages[0].metadata
pages[120].metadata

{'source': '../data/sg_credit_outlook_1H2023.pdf',
 'file_path': '../data/sg_credit_outlook_1H2023.pdf',
 'page': 80,
 'total_pages': 81,
 'format': 'PDF 1.7',
 'title': 'Credit Outlook –',
 'author': 'trt2',
 'subject': '',
 'keywords': '',
 'creator': 'Microsoft® Word for Microsoft 365',
 'producer': 'Microsoft® Word for Microsoft 365',
 'creationDate': "D:20230104164538+08'00'",
 'modDate': "D:20230105095819+08'00'",
 'trapped': ''}

### Compare between text splitters

Let's choose which TextSplitter to use. Here I'll compare results between `RecursiveCharacterTextSplitter` and `NLTKTextSplitter`.

In [7]:
# small paramaters for now to conveniently assess results
chunk_size = 1000
chunk_overlap = 200

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", " ", ""],  # default values
)

nltk_splitter = NLTKTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

In [8]:
# random page
snippet = pages[25].page_content[0:3000]
snippet

'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    18 \n• Assuming new debt for the purpose of the redemption of the perpetuals will increase LMRT leverage and reduce the debt \nheadroom available for the development of LMRT’s existing assets during the uncertain macroeconomic outlook. \n• Current market conditions are not favourable for LMIR Trust for the issuance of perpetual securities at a lower yield than the \nreset distribution rate. \n \nOCBC Credit Research commentary: \n• Perpetual reset date coincided with call date, distribution rate stepped up to 8.096% from only 6.6%. \n• In our view LMRT would have found it difficult to assess primary markets without external guarantees, given the protracted \nrecovery at LMRT’s underlying properties. LMRT is facing a tight adjusted interest cov

In [9]:
r_res = r_splitter.split_text(snippet)
len(r_res)

4

In [10]:
r_res

['OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    18 \n• Assuming new debt for the purpose of the redemption of the perpetuals will increase LMRT leverage and reduce the debt \nheadroom available for the development of LMRT’s existing assets during the uncertain macroeconomic outlook. \n• Current market conditions are not favourable for LMIR Trust for the issuance of perpetual securities at a lower yield than the \nreset distribution rate. \n \nOCBC Credit Research commentary: \n• Perpetual reset date coincided with call date, distribution rate stepped up to 8.096% from only 6.6%. \n• In our view LMRT would have found it difficult to assess primary markets without external guarantees, given the protracted',
 '• In our view LMRT would have found it difficult to assess primary markets without 

In [11]:
nltk_res = nltk_splitter.split_text(snippet)
len(nltk_res)

4

In [12]:
nltk_res

['OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    18 \n• Assuming new debt for the purpose of the redemption of the perpetuals will increase LMRT leverage and reduce the debt \nheadroom available for the development of LMRT’s existing assets during the uncertain macroeconomic outlook.\n\n• Current market conditions are not favourable for LMIR Trust for the issuance of perpetual securities at a lower yield than the \nreset distribution rate.\n\nOCBC Credit Research commentary: \n• Perpetual reset date coincided with call date, distribution rate stepped up to 8.096% from only 6.6%.\n\n• In our view LMRT would have found it difficult to assess primary markets without external guarantees, given the protracted \nrecovery at LMRT’s underlying properties.',
 '• In our view LMRT would have found it 

The ending of each split chunk by `RecursiveCharacterTextSplitter` doesn't coincide with the end of a sentence/puncutuation, whereas `NLTKTextSplitter` captures that nuance.  
Let's go with `NLTKTextSplitter`.

### Split PDF and assess results

In [13]:
chunk_size = 3000
chunk_overlap = chunk_size * 0.1

nltk_splitter = NLTKTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

split_res = nltk_splitter.split_documents(pages)

Created a chunk of size 3503, which is longer than the specified 3000


In [14]:
len(split_res)

182

In [15]:
split_res[50]

Document(page_content='OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    25 \nChina \n• \n2018: Announced non-participation in the pilot \nphase of the CORSIA \n• \nSep: Released "2022 China Civil Aviation Green \nDevelopment Policy and Action", targeting a \ncumulative use of 50,000 tons of SAF by 2025.\n\nSource: ICAO  \n \nSingapore is well-positioned to become an established, regional petrochemical hub that can offer a conducive \nenvironment for developing and introducing sustainable aviation products.\n\nFor instance, Neste, the world’s largest \nproducer of SAF, is expanding its production capacity in Singapore in 2023.\n\nIt aims to be able to roll out as much as 1 \nmillion metric tons of SAF per annum at its facility, making Singapore Neste’s main SAF production site globally.\n\nShe

## Load all PDFs and split

In [16]:
path = "../data/"

loaders = []

for file in os.listdir(path):
    if file.endswith(".pdf"):
        loaders.append(PyMuPDFLoader(os.path.join(path, file)))

loaders

[<langchain.document_loaders.pdf.PyMuPDFLoader at 0x2304ec3bb50>,
 <langchain.document_loaders.pdf.PyMuPDFLoader at 0x2304ec50350>,
 <langchain.document_loaders.pdf.PyMuPDFLoader at 0x2304fd8b8d0>]

In [17]:
docs = []
for loader in loaders:
    docs.extend(loader.load())

len(docs)

208

In [18]:
# splits docs

splits = nltk_splitter.split_documents(docs)

Created a chunk of size 3503, which is longer than the specified 3000


In [19]:
len(splits)

352

## Use OpenAI word embeddings

In [20]:
embedding = OpenAIEmbeddings()

In [21]:
sentence1 = (
    "The sun sets in the evening, casting a warm orange glow across the horizon."
)
sentence2 = "Twilight descends upon the land as the day draws to a close, painting the sky with hues of red and gold."
sentence3 = "Baby JJ crawled up the mattress to get his milk."

In [22]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [23]:
np.dot(embedding1, embedding2)

0.903089832355181

In [24]:
np.dot(embedding1, embedding3), np.dot(embedding2, embedding3),

(0.7355978857170626, 0.7370830702024485)

## Initialize Vectorstore

Embeddings databases (also known as vector databases/stores) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database.  
Here, Chroma is used.

In [25]:
persist_directory = "docs/chroma"

In [26]:
vectordb = Chroma.from_documents(
    documents=splits, embedding=embedding, persist_directory=persist_directory
)

vectordb.persist()

100%|██████████| 1/1 [00:05<00:00,  5.68s/it]


In [27]:
print(vectordb._collection.count())

1056


In [28]:
# does vectordb count tally with total splits?
vectordb._collection.count() == len(splits)

False

## Comparing retrieval methods


With the documents and embeddings in the vectorstore, there are several ways to retrieve this information.  
Here three methods are compared: `similarity_search`, `max_marginal_relevance_search` (MMR) and `ContextualCompressionRetriever`

Similarity search: Selects examples based on similarity to the inputs. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs.
https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/similarity

MMR: Selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity.  
It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.
https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/mmr

### Similarity Search

In [29]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

100%|██████████| 1/1 [00:01<00:00,  1.13s/it]


In [30]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [31]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

### Maximum Marginal Relevance

MMR penalizes the second text due to its similarity with the first, and instead returns the third text which is related yet different.  
Let's try it on our docs.

In [32]:
smalldb.max_marginal_relevance_search(question, k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

In [33]:
docs_qn = "What is the latest update for China Evergrande?"

Using similarity search, the first two search results are the same.

In [34]:
ss_res = vectordb.similarity_search(docs_qn, k=3)
[res.page_content[:300] for res in ss_res[:3]]

['What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the compan',
 'What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the compan',
 'What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the compan']

With mmr, there are no repeating results.

In [35]:
mmr_res = vectordb.max_marginal_relevance_search(docs_qn, k=3, fetch_k=10)
[res.page_content[:200] for res in mmr_res[:3]]

['What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to',
 'Sector wide issue though China Evergrande is emblematic of the situation: What started at EVERRE in terms of liquidity \nstress has snowballed to the rest of the market.\n\nIn part, the industry challeng',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & Strategy                                                                                              ']

In [36]:
# take a closer look at the third search result
mmr_res[2].page_content

'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & Strategy                                                                                                                                    31 \n \nFigure 24: China Residential Buildings Price Change (70 Cities) \n \nSource: National Bureau of Statistics \n \nWhat we can glean from the situation: Aside from an adverse industry outlook, certain company-level characteristics have \nexacerbated the situation.\n\nWhile EVERRE is the largest issuer with ~USD19bn of bonds outstanding (including those issued \nby Scenery Journey Ltd, an indirect wholly-owned subsidiary of EVERRE), certain of the credit considerations leading to its \nvulnerabilities in an adverse situation are shared by other bond issuers.\n\nAside from USD-bonds, EVERRE also has onshore \nbonds issued by Hengda Real Estate Group Company Limited (“Hengda”) where Hengda is 59.9%-owned by EVERRE as at \n31 December 2020.\n\nW

One thing I observed: A low `fetch_k` and a low `k` results in all results comes from the same document.  
However, the other docs also contain information about Evergrande.

In [37]:
[(res.metadata["source"], res.metadata["page"]) for res in mmr_res[:3]]

[('../data/singapore credit outlook 2022 shell_abridged.pdf', 31),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 29),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 30)]

In [38]:
mmr_res = vectordb.max_marginal_relevance_search(docs_qn, k=10, fetch_k=200)

Increasing `k` and `fetch_k` seems to fix this. All three documents are now being cited.

In [39]:
sorted(
    [(res.metadata["source"], res.metadata["page"]) for res in mmr_res],
    key=lambda x: (x[0], x[1]),
)

[('../data/sg_credit_outlook_1H2023.pdf', 12),
 ('../data/sg_credit_outlook_1H2023.pdf', 54),
 ('../data/sg_credit_outlook_2H2023.pdf', 17),
 ('../data/sg_credit_outlook_2H2023.pdf', 32),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 29),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 30),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 31),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 31),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 32),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 80)]

In [40]:
[res.page_content[:100] for res in mmr_res]

['What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient fund',
 'Sector wide issue though China Evergrande is emblematic of the situation: What started at EVERRE in ',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & St',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & St',
 'While certain private debt funds \n(with their higher demands on information rights) may still be wil',
 'Chiefly, Heungkuk Life Insurance Co, a South Korean insurer, announced to delay its early \nrepayment',
 'China earlier in 2021 \nannounced that it would monitor financial risks related to climate change and',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & St',
 'In other words, corporations or governments that \nissue the most debt represent the largest proporti',
 'Most still \nhold significant in

### Contextual Compression

In [41]:
# create compressor
compressor = LLMChainExtractor.from_llm(llm)

In [42]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vectordb.as_retriever()
)

In [43]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [44]:
compressed_docs = compression_retriever.get_relevant_documents(docs_qn)
pretty_print_docs(compressed_docs)



Document 1:

On 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its financial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan for its offshore debt. On 6 December 2021, the company announced that it has received a demand to perform its obligations under a guarantee amounting to ~USD260mn (likely on the Jumbo bond). The company subsequently announced that it has set up a seven-person risk management committee where five members comprise of non-company representatives. These five members include a representative from China Cinda Asset Management Co, an asset manager focused on bad debts and a representative from Guangdong Holdings Limited, a provincial-level state-owned company.
----------------------------------------------------------------------------------------------------
Document 2:

On 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its financ

`vectordb.as_retriever()` calls Class VectorStoreRetriever which, by default, uses similarity search.  
Hence we see repeated results once more. Let's examine the results using MMR.

In [45]:
compression_retriever_mmr = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vectordb.as_retriever(search_type="mmr")
)

In [46]:
compressed_docs = compression_retriever_mmr.get_relevant_documents(docs_qn)
pretty_print_docs(compressed_docs)

Document 1:

On 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its financial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan for its offshore debt.

On 6 December 2021, the company announced that it has received a demand to perform its obligations under a guarantee amounting to ~USD260mn (likely on the Jumbo bond).

The company subsequently announced that it has set up a seven-person risk management committee where five members comprise of non-company representatives.

These five members include a representative from China Cinda Asset Management Co, an asset manager focused on bad debts and a representative from Guangdong Holdings Limited, a provincial-level state-owned company.
----------------------------------------------------------------------------------------------------
Document 2:

Sector wide issue though China Evergrande is emblematic of the situation: What started at EVE

As we saw earlier, MMR gives better results compared to similarity search.

### Retrieval without vectorstores

In [47]:
from langchain.retrievers import SVMRetriever

In [48]:
all_docs_text = [d.page_content for d in docs]
joined_docs_text = " ".join(all_docs_text)

In [49]:
nltk_splitter = NLTKTextSplitter(chunk_size=1000, chunk_overlap=100)

docs_splits = nltk_splitter.split_text(joined_docs_text)

Created a chunk of size 1106, which is longer than the specified 1000
Created a chunk of size 2035, which is longer than the specified 1000
Created a chunk of size 1511, which is longer than the specified 1000
Created a chunk of size 1355, which is longer than the specified 1000
Created a chunk of size 5089, which is longer than the specified 1000
Created a chunk of size 1173, which is longer than the specified 1000
Created a chunk of size 2671, which is longer than the specified 1000
Created a chunk of size 1954, which is longer than the specified 1000
Created a chunk of size 2116, which is longer than the specified 1000
Created a chunk of size 1163, which is longer than the specified 1000
Created a chunk of size 2524, which is longer than the specified 1000
Created a chunk of size 1109, which is longer than the specified 1000
Created a chunk of size 1067, which is longer than the specified 1000
Created a chunk of size 1489, which is longer than the specified 1000
Created a chunk of s

In [50]:
svm_retriever = SVMRetriever.from_texts(docs_splits, embedding)

In [51]:
docs_qn = "What is the latest update for China Evergrande?"
docs_svm = svm_retriever.get_relevant_documents(docs_qn)
docs_svm[0]



Document(page_content='This points towards further defaults for the sector.\n\nAmong the limited number of \nhigh grade and “crossover” Chinese property developers, refinancing costs have increased even if access is still available.\n\nWhat next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the company announced that it has received a demand to perform its obligations \nunder a guarantee amounting to ~USD260mn (likely on the Jumbo bond).\n\nThe company subsequently announced that it \nhas set up a seven-person risk management committee where five members comprise of non-company representatives.', metadata={})

It works well but metadata is missing. There's probably is a way to include metadata, but that's out of the scope for this project.

In [52]:
docs_svm

[Document(page_content='This points towards further defaults for the sector.\n\nAmong the limited number of \nhigh grade and “crossover” Chinese property developers, refinancing costs have increased even if access is still available.\n\nWhat next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the company announced that it has received a demand to perform its obligations \nunder a guarantee amounting to ~USD260mn (likely on the Jumbo bond).\n\nThe company subsequently announced that it \nhas set up a seven-person risk management committee where five members comprise of non-company representatives.', metadata={}),
 Document(page_content='As at 30 June 2021, EVERRE’s contract liabilities which would be where such \nobligations are likely to sit was repor

## Question Answering

In [53]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever())

In [54]:
docs_qn

'What is the latest update for China Evergrande?'

In [55]:
result = qa_chain({"query": docs_qn})

In [56]:
result["result"]

'The latest update for China Evergrande, as of the information provided, is that the company is facing significant liquidity challenges and may not have sufficient funds to perform its financial obligations. It has received a demand to perform its obligations under a guarantee amounting to approximately USD260 million. In response to these challenges, Evergrande has set up a seven-person risk management committee, five of whom are non-company representatives. These representatives include individuals from China Cinda Asset Management Co and Guangdong Holdings Limited. The company is also planning to work with creditors on a restructuring plan for its offshore debt.'

In [57]:
# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, 
just say that you don't know, don't try to make up an answer.
Keep the answer as concise as possible. The tone should be informative. Use bullet points.
For each answer, indicate the year that that answer is applicable to.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [58]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [59]:
qa_res = qa_chain(
    {"query": "What are some key developments in the high yield bond market?"}
)

In [60]:
print(qa_res["result"])

- High yield bonds, also known as junk bonds, were initially issued by former investment grade issuers who lost their investment grade status, known as "fallen angels" (No specific year mentioned).
- From the 1980s, high yield bonds were issued from the outset and heavily marketed by US investment banks, becoming a fast-growing segment of the bond market (1980s).
- The Asiadollar high yield bond market, excluding Japan, is a large and liquid market with a significant institutional investor following (No specific year mentioned).
- As of 14 December 2021, the amount of high yield bonds outstanding in the market was USD286.6bn (~SGD391.6bn), with 44% of these bonds issued by property developers in China and another 19% by issuers from India and Indonesia (2021).
- The SGD corporate credit market is mainly an unrated market, leading to variations in yields and secondary market liquidity due to differing perceptions of credit risk among market participants (2021).
- OCBC Credit Research ma

In [61]:
qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine",
)

The output from `chain_type="refine"` is a lot more structured. 

In [62]:
refine_result = qa_chain_refine(
    {"query": "What are some key developments in the high yield bond market?"}
)
print(refine_result["result"])

The high yield bond market, also known as junk bonds or speculative grade bonds, has seen several key developments. Initially, these bonds were mainly issued by former investment grade issuers who lost their investment grade status, known as "fallen angels". However, starting from the 1980s, bonds issued as high yield from the outset were launched and heavily marketed by US investment banks, becoming a fast-growing segment of the bond market.

The Asiadollar (excluding Japan) high yield bond market has grown into a large and liquid market with a significant institutional investor following. As of 14 December 2021, the amount of high yield bonds outstanding in the market was calculated at USD286.6bn (~SGD391.6bn). The market is highly concentrated with 44% of these bonds issued by property developers with their main operations in China, and another 19% are issuers from India and Indonesia.

The SGD high yield market is mainly an unrated market, meaning a lack of explicitly available cre

I did some checks and found that most of the results were coming from only one PDF (singapore credit outlook 2022 shell_abridged.pdf).  
I reckon it's because this PDF talks about HY relatively more than the other two.  

I asked about a topic (Bond Indices) only found in sg_credit_outlook_2H2023.pdf; and it did return relevant output from this PDF - assuaging my concerns that only splits from one document were being sent to the LLM.

In [65]:
refine_result = qa_chain_refine(
    {"query": "Tell me about key characteristics of bond indices."}
)
print(refine_result["result"])

Bond indices are crucial tools for tracking the performance of specific segments of the bond market and are often used as benchmarks for bond index funds (BIFs). These funds are diversified portfolios of bonds that align with the performance of a specific bond index and can come in various forms, including bond mutual funds and exchange-traded funds (ETFs) that invest in bonds.

However, while bond indices offer benefits such as diversification and low-cost investments, they also present certain challenges. For instance, bond indices are typically weighted by the amount of debt outstanding by the issuers, meaning corporations or governments that issue the most debt represent the largest proportion of the index. This can lead to a situation where the index is heavily weighted towards the most indebted issuers.

Moreover, replicating a bond index can be difficult due to the high number of constituents and the relatively illiquid nature of bonds. The high trading cost of bonds can also po