## Objective

To facilitate accurate Q&A with financial documents (OCBC credit reports), we can leverage Retrieval-Augmented Generation (RAG) together with a LLM.   
This is the basic workflow: Financial text -> Split into chunks + OpenAI embeddings -> Load Vectorstore  
Ask questions -> Look up Vectorstore -> Get most relevant embeddings -> Expose to LLM -> Get relevant answer

Inspired by: https://learn.deeplearning.ai/langchain-chat-with-your-data/

## Load libraries

In [185]:
import os
import numpy as np
from dotenv import load_dotenv, find_dotenv
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate


load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
# os.environ["OPENAI_API_KEY"] = your_key_here

## Initialize LLM

In [36]:
# initialize LLM
llm = ChatOpenAI(
    temperature=0,
    openai_api_key=OPENAI_API_KEY,
    model_name="gpt-4",
    request_timeout=600,
)

## Select an appropriate text splitter

### Load a single PDF

In [2]:
loader = PyMuPDFLoader("../data/sg_credit_outlook_1H2023.pdf")
pages = loader.load_and_split()

In [3]:
len(pages)

122

In [4]:
pages[0].page_content[0:500]

'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    i \n \n \n \nTreasury Advisory \nCorporate FX & Structured \nProducts  \nTel: 6349-1888 / 1881 \n \nFixed Income & Structured \nProducts \nTel: 6349-1810 \n \nInterest Rate Derivatives \nTel: 6349-1899 \n \nInvestments & Structured \nProducts \nTel: 6349-1886 \n \n \n \n \nOCBC Cre'

In [5]:
pages[0].metadata
pages[120].metadata

{'source': '../data/sg_credit_outlook_1H2023.pdf',
 'file_path': '../data/sg_credit_outlook_1H2023.pdf',
 'page': 80,
 'total_pages': 81,
 'format': 'PDF 1.7',
 'title': 'Credit Outlook –',
 'author': 'trt2',
 'subject': '',
 'keywords': '',
 'creator': 'Microsoft® Word for Microsoft 365',
 'producer': 'Microsoft® Word for Microsoft 365',
 'creationDate': "D:20230104164538+08'00'",
 'modDate': "D:20230105095819+08'00'",
 'trapped': ''}

### Compare between text splitters

Let's choose which TextSplitter to use. Here I'll compare results between `RecursiveCharacterTextSplitter` and `NLTKTextSplitter`.

In [6]:
# small paramaters for now to conveniently assess results
chunk_size = 1000
chunk_overlap = 200

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", " ", ""],  # default values
)

nltk_splitter = NLTKTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

In [7]:
# random page
snippet = pages[25].page_content[0:3000]
snippet

'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    18 \n• Assuming new debt for the purpose of the redemption of the perpetuals will increase LMRT leverage and reduce the debt \nheadroom available for the development of LMRT’s existing assets during the uncertain macroeconomic outlook. \n• Current market conditions are not favourable for LMIR Trust for the issuance of perpetual securities at a lower yield than the \nreset distribution rate. \n \nOCBC Credit Research commentary: \n• Perpetual reset date coincided with call date, distribution rate stepped up to 8.096% from only 6.6%. \n• In our view LMRT would have found it difficult to assess primary markets without external guarantees, given the protracted \nrecovery at LMRT’s underlying properties. LMRT is facing a tight adjusted interest cov

In [8]:
r_res = r_splitter.split_text(snippet)
len(r_res)

4

In [9]:
r_res

['OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    18 \n• Assuming new debt for the purpose of the redemption of the perpetuals will increase LMRT leverage and reduce the debt \nheadroom available for the development of LMRT’s existing assets during the uncertain macroeconomic outlook. \n• Current market conditions are not favourable for LMIR Trust for the issuance of perpetual securities at a lower yield than the \nreset distribution rate. \n \nOCBC Credit Research commentary: \n• Perpetual reset date coincided with call date, distribution rate stepped up to 8.096% from only 6.6%. \n• In our view LMRT would have found it difficult to assess primary markets without external guarantees, given the protracted',
 '• In our view LMRT would have found it difficult to assess primary markets without 

In [10]:
nltk_res = nltk_splitter.split_text(snippet)
len(nltk_res)

4

In [11]:
nltk_res

['OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    18 \n• Assuming new debt for the purpose of the redemption of the perpetuals will increase LMRT leverage and reduce the debt \nheadroom available for the development of LMRT’s existing assets during the uncertain macroeconomic outlook.\n\n• Current market conditions are not favourable for LMIR Trust for the issuance of perpetual securities at a lower yield than the \nreset distribution rate.\n\nOCBC Credit Research commentary: \n• Perpetual reset date coincided with call date, distribution rate stepped up to 8.096% from only 6.6%.\n\n• In our view LMRT would have found it difficult to assess primary markets without external guarantees, given the protracted \nrecovery at LMRT’s underlying properties.',
 '• In our view LMRT would have found it 

The ending of each split chunk by `RecursiveCharacterTextSplitter` doesn't coincide with the end of a sentence/puncutuation, whereas `NLTKTextSplitter` captures that nuance.  
Let's go with `NLTKTextSplitter`.

### Split PDF and assess results

In [12]:
chunk_size = 3000
chunk_overlap = chunk_size * 0.1

nltk_splitter = NLTKTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

split_res = nltk_splitter.split_documents(pages)

Created a chunk of size 3503, which is longer than the specified 3000


In [122]:
len(split_res)

182

In [14]:
split_res[50]

Document(page_content='OCBC CREDIT RESEARCH \nSGD Credit Outlook 2023 \n \n Wednesday, January 04, 2023 \n \nTreasury Research & Strategy                                                                                                                                    25 \nChina \n• \n2018: Announced non-participation in the pilot \nphase of the CORSIA \n• \nSep: Released "2022 China Civil Aviation Green \nDevelopment Policy and Action", targeting a \ncumulative use of 50,000 tons of SAF by 2025.\n\nSource: ICAO  \n \nSingapore is well-positioned to become an established, regional petrochemical hub that can offer a conducive \nenvironment for developing and introducing sustainable aviation products.\n\nFor instance, Neste, the world’s largest \nproducer of SAF, is expanding its production capacity in Singapore in 2023.\n\nIt aims to be able to roll out as much as 1 \nmillion metric tons of SAF per annum at its facility, making Singapore Neste’s main SAF production site globally.\n\nShe

## Load all PDFs and split

In [15]:
path = "../data/"

loaders = []

for file in os.listdir(path):
    if file.endswith(".pdf"):
        loaders.append(PyMuPDFLoader(os.path.join(path, file)))

loaders

[<langchain.document_loaders.pdf.PyMuPDFLoader at 0x1db232a1990>,
 <langchain.document_loaders.pdf.PyMuPDFLoader at 0x1db2411c090>,
 <langchain.document_loaders.pdf.PyMuPDFLoader at 0x1db2411c3d0>]

In [16]:
docs = []
for loader in loaders:
    docs.extend(loader.load())

len(docs)

208

In [17]:
# splits docs

splits = nltk_splitter.split_documents(docs)

Created a chunk of size 3503, which is longer than the specified 3000


In [18]:
len(splits)

352

## Use OpenAI word embeddings

In [19]:
embedding = OpenAIEmbeddings()

In [20]:
sentence1 = (
    "The sun sets in the evening, casting a warm orange glow across the horizon."
)
sentence2 = "Twilight descends upon the land as the day draws to a close, painting the sky with hues of red and gold."
sentence3 = "Baby JJ crawled up the mattress to get his milk."

In [21]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [22]:
np.dot(embedding1, embedding2)

0.903089832355181

In [23]:
np.dot(embedding1, embedding3), np.dot(embedding2, embedding3),

(0.7355978857170626, 0.7370830702024485)

## Initialize Vectorstore

Embeddings databases (also known as vector databases/stores) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database.  
Here, Chroma is used.

In [24]:
persist_directory = "docs/chroma"

In [25]:
vectordb = Chroma.from_documents(
    documents=splits, embedding=embedding, persist_directory=persist_directory
)

vectordb.persist()

100%|██████████| 1/1 [00:05<00:00,  5.47s/it]


In [26]:
print(vectordb._collection.count())

704


In [27]:
# does vectordb count tally with total splits?
vectordb._collection.count() == len(splits)

False

## Comparing retrieval methods


With the documents and embeddings in the vectorstore, there are several ways to retrieve this information.  
Here three methods are compared: `similarity_search`, `max_marginal_relevance_search` (MMR) and `ContextualCompressionRetriever`

Similarity search: Selects examples based on similarity to the inputs. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs.
https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/similarity

MMR: Selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity.  
It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.
https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/mmr

### Similarity Search

In [28]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

100%|██████████| 1/1 [00:00<00:00,  1.99it/s]


In [29]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [30]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

### Maximum Marginal Relevance

MMR penalizes the second text due to its similarity with the first, and instead returns the third text which is related yet different.  
Let's try it on our docs.

In [31]:
smalldb.max_marginal_relevance_search(question, k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

In [32]:
docs_qn = "What is the latest update for China Evergrande?"

Using similarity search, the first two search results are the same.

In [33]:
ss_res = vectordb.similarity_search(docs_qn, k=3)
[res.page_content[:300] for res in ss_res[:3]]

['What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the compan',
 'What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the compan',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & Strategy                                                                                                                                    30 \nCase Study: China Evergrande Group (“EVERRE”) - Towards a ']

With mmr, there are no repeating results.

In [92]:
mmr_res = vectordb.max_marginal_relevance_search(docs_qn, k=3, fetch_k=10)
[res.page_content[:200] for res in mmr_res[:3]]

['What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to',
 'Sector wide issue though China Evergrande is emblematic of the situation: What started at EVERRE in terms of liquidity \nstress has snowballed to the rest of the market.\n\nIn part, the industry challeng',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & Strategy                                                                                              ']

In [93]:
# take a closer look at the third search result
mmr_res[2].page_content

'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & Strategy                                                                                                                                    31 \n \nFigure 24: China Residential Buildings Price Change (70 Cities) \n \nSource: National Bureau of Statistics \n \nWhat we can glean from the situation: Aside from an adverse industry outlook, certain company-level characteristics have \nexacerbated the situation.\n\nWhile EVERRE is the largest issuer with ~USD19bn of bonds outstanding (including those issued \nby Scenery Journey Ltd, an indirect wholly-owned subsidiary of EVERRE), certain of the credit considerations leading to its \nvulnerabilities in an adverse situation are shared by other bond issuers.\n\nAside from USD-bonds, EVERRE also has onshore \nbonds issued by Hengda Real Estate Group Company Limited (“Hengda”) where Hengda is 59.9%-owned by EVERRE as at \n31 December 2020.\n\nW

One thing I observed: A low `fetch_k` and a low `k` results in all results comes from the same document.  
However, the other docs also contain information about Evergrande.

In [96]:
[(res.metadata["source"], res.metadata["page"]) for res in mmr_res[:3]]

[('../data/singapore credit outlook 2022 shell_abridged.pdf', 31),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 29),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 30)]

In [97]:
mmr_res = vectordb.max_marginal_relevance_search(docs_qn, k=10, fetch_k=200)

Increasing `k` and `fetch_k` seems to fix this. All three documents are now being cited.

In [108]:
sorted(
    [(res.metadata["source"], res.metadata["page"]) for res in mmr_res],
    key=lambda x: (x[0], x[1]),
)

[('../data/sg_credit_outlook_1H2023.pdf', 12),
 ('../data/sg_credit_outlook_1H2023.pdf', 54),
 ('../data/sg_credit_outlook_1H2023.pdf', 64),
 ('../data/sg_credit_outlook_1H2023.pdf', 66),
 ('../data/sg_credit_outlook_2H2023.pdf', 22),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 29),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 31),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 32),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 60),
 ('../data/singapore credit outlook 2022 shell_abridged.pdf', 80)]

In [100]:
[res.page_content[:100] for res in mmr_res]

['What next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient fund',
 'Sector wide issue though China Evergrande is emblematic of the situation: What started at EVERRE in ',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & St',
 'Chiefly, Heungkuk Life Insurance Co, a South Korean insurer, announced to delay its early \nrepayment',
 'China earlier in 2021 \nannounced that it would monitor financial risks related to climate change and',
 'OCBC CREDIT RESEARCH \nSGD Credit Outlook 2022 \n \nFriday, December 31, 2021 \n \nTreasury Research & St',
 'Crédit Agricole Group \nACAFP\n3.800%\n30-Apr-26\nSGD325mn\n95.96\n5.14%\nWe are overweight the Crédit Agri',
 'In July \n2021, REITs again broke new ground with Frasers Logistics and Commercial Trust issuing the ',
 'By this time, our Macro colleagues expect greater clarity on central \nbank resolve and peak interest',
 'In May 2023, MINT disclosed th

### Contextual Compression

In [37]:
# create compressor
compressor = LLMChainExtractor.from_llm(llm)

In [38]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vectordb.as_retriever()
)

In [39]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [None]:
compressed_docs = compression_retriever.get_relevant_documents(docs_qn)
pretty_print_docs(compressed_docs)

`vectordb.as_retriever()` calls Class VectorStoreRetriever which, by default, uses similarity search.  
Hence we see repeated results once more. Let's examine the results using MMR.

In [41]:
compression_retriever_mmr = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vectordb.as_retriever(search_type="mmr")
)

In [None]:
compressed_docs = compression_retriever_mmr.get_relevant_documents(docs_qn)
pretty_print_docs(compressed_docs)

As we saw earlier, MMR gives better results compared to similarity search.

### Retrieval without vectorstores

In [118]:
from langchain.retrievers import SVMRetriever

In [115]:
all_docs_text = [d.page_content for d in docs]
joined_docs_text = " ".join(all_docs_text)

In [123]:
nltk_splitter = NLTKTextSplitter(chunk_size=1000, chunk_overlap=100)

docs_splits = nltk_splitter.split_text(joined_docs_text)

Created a chunk of size 1106, which is longer than the specified 1000
Created a chunk of size 2035, which is longer than the specified 1000
Created a chunk of size 1511, which is longer than the specified 1000
Created a chunk of size 1355, which is longer than the specified 1000
Created a chunk of size 5089, which is longer than the specified 1000
Created a chunk of size 1173, which is longer than the specified 1000
Created a chunk of size 2671, which is longer than the specified 1000
Created a chunk of size 1954, which is longer than the specified 1000
Created a chunk of size 2116, which is longer than the specified 1000
Created a chunk of size 1163, which is longer than the specified 1000
Created a chunk of size 2524, which is longer than the specified 1000
Created a chunk of size 1109, which is longer than the specified 1000
Created a chunk of size 1067, which is longer than the specified 1000
Created a chunk of size 1489, which is longer than the specified 1000
Created a chunk of s

In [124]:
svm_retriever = SVMRetriever.from_texts(docs_splits, embedding)

In [126]:
docs_qn = "What is the latest update for China Evergrande?"
docs_svm = svm_retriever.get_relevant_documents(docs_qn)
docs_svm[0]



Document(page_content='This points towards further defaults for the sector.\n\nAmong the limited number of \nhigh grade and “crossover” Chinese property developers, refinancing costs have increased even if access is still available.\n\nWhat next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the company announced that it has received a demand to perform its obligations \nunder a guarantee amounting to ~USD260mn (likely on the Jumbo bond).\n\nThe company subsequently announced that it \nhas set up a seven-person risk management committee where five members comprise of non-company representatives.', metadata={})

It works well but metadata is missing. There's probably is a way to include metadata, but that's out of the scope for this project.

In [131]:
docs_svm

[Document(page_content='This points towards further defaults for the sector.\n\nAmong the limited number of \nhigh grade and “crossover” Chinese property developers, refinancing costs have increased even if access is still available.\n\nWhat next for Evergrande?\n\nOn 3 December 2021, EVERRE announced that it may not have sufficient funds to perform its \nfinancial obligations given the group’s current liquidity situation and planned to work with creditors on a restructuring plan \nfor its offshore debt.\n\nOn 6 December 2021, the company announced that it has received a demand to perform its obligations \nunder a guarantee amounting to ~USD260mn (likely on the Jumbo bond).\n\nThe company subsequently announced that it \nhas set up a seven-person risk management committee where five members comprise of non-company representatives.', metadata={}),
 Document(page_content='As at 30 June 2021, EVERRE’s contract liabilities which would be where such \nobligations are likely to sit was repor

## Question Answering

In [146]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever())

In [151]:
docs_qn

'What is the latest update for China Evergrande?'

In [147]:
result = qa_chain({"query": docs_qn})

In [150]:
result["result"]

'The latest update for China Evergrande, as of 6 December 2021, is that the company has received a demand to perform its obligations under a guarantee amounting to approximately USD260 million. This is likely related to the Jumbo bond. In response to its financial difficulties, Evergrande has set up a seven-person risk management committee, five of whom are non-company representatives. These representatives include individuals from China Cinda Asset Management Co, an asset manager focused on bad debts, and Guangdong Holdings Limited, a provincial-level state-owned company. Evergrande is planning to work with creditors on a restructuring plan for its offshore debt due to its current liquidity situation.'

In [172]:
# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, 
just say that you don't know, don't try to make up an answer.
Keep the answer as concise as possible. The tone should be informative. Use bullet points.
For each answer, indicate the year that that answer is applicable to.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [173]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [174]:
qa_res = qa_chain(
    {"query": "What are some key developments in the high yield bond market?"}
)

In [175]:
print(qa_res["result"])

- High yield bonds are more suited for investors with higher holding power due to their illiquidity, especially in stressed liquidity scenarios (2020).
- High yield bonds can offer equity-like returns, with potential for sizable capital gains in the riskier parts of the market (2020).
- High yield bonds are less sensitive to interest rate rises, making them an ideal investment in a rising interest rate environment (2020).
- The SGD high yield market lacks depth and liquidity, with no reliable default rate indicators and highly concentrated defaults since 2015 (2020).
- The opportunity set for high yield bonds expands with the inclusion of "crossover" bullets and perpetuals (2020).
- Primary market issuances may have peaked in 2021 due to significant frontloading of issuances for essential refinancing (2021).
- China's deleveraging campaign could pose a potential headwind to the high yield bond market (2021).
- Mergers and acquisitions activities could drive issuances across investment 

In [186]:
qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine",
)

The output from `chain_type="refine"` is a lot more structured. 

In [187]:
refine_result = qa_chain_refine(
    {"query": "What are some key developments in the high yield bond market?"}
)
print(refine_result["result"])

The high yield bond market has seen several key developments:

1. Peak in Primary Market Issuances: There is a belief that significant frontloading of issuances for essential refinancing could mean that primary market issuances may have already peaked in 2021.

2. China's Deleveraging Campaign: The Chinese government's deleveraging policies could serve as a potential headwind. Chinese companies have contributed significantly to global bond issuances, and increased regulatory risk in certain sectors such as property, technology, tuition, food delivery, and commodities highlight the increased regulatory risk.

3. Uptick in Mergers and Acquisitions: M&A activities are expected to serve as a key driver for issuances across investment grade and high-yield bonds. With many companies having covered much of their refinancing needs in 2021, corporate priorities may begin to shift towards long-term growth, business diversification and supply chain strengthening strategies. Sectors such as techno

In [None]:
refine_result = qa_chain_refine(
    {"query": "What are some key developments in the high yield bond market?"}
)
print(refine_result["result"])

The high yield bond market has seen several key developments:

1. Peak in Primary Market Issuances: There is a belief that significant frontloading of issuances for essential refinancing could mean that primary market issuances may have already peaked in 2021.

2. China's Deleveraging Campaign: The Chinese government's deleveraging policies could serve as a potential headwind. Chinese companies have contributed significantly to global bond issuances, and increased regulatory risk in certain sectors such as property, technology, tuition, food delivery, and commodities highlight the increased regulatory risk.

3. Uptick in Mergers and Acquisitions: M&A activities are expected to serve as a key driver for issuances across investment grade and high-yield bonds. With many companies having covered much of their refinancing needs in 2021, corporate priorities may begin to shift towards long-term growth, business diversification and supply chain strengthening strategies. Sectors such as techno

In [None]:
refine_result = qa_chain_refine(
    {"query": "What are some key developments in the high yield bond market?"}
)
print(refine_result["result"])

The high yield bond market has seen several key developments:

1. Peak in Primary Market Issuances: There is a belief that significant frontloading of issuances for essential refinancing could mean that primary market issuances may have already peaked in 2021.

2. China's Deleveraging Campaign: The Chinese government's deleveraging policies could serve as a potential headwind. Chinese companies have contributed significantly to global bond issuances, and increased regulatory risk in certain sectors such as property, technology, tuition, food delivery, and commodities highlight the increased regulatory risk.

3. Uptick in Mergers and Acquisitions: M&A activities are expected to serve as a key driver for issuances across investment grade and high-yield bonds. With many companies having covered much of their refinancing needs in 2021, corporate priorities may begin to shift towards long-term growth, business diversification and supply chain strengthening strategies. Sectors such as techno

I did some checks and found that most of the results were coming from only one PDF (singapore credit outlook 2022 shell_abridged.pdf). I reckon it's because this PDF talks about HY relatively more than the other two.  

Asking a question on Bond Indices, a topic only found in sg_credit_outlook_2H2023.pdf, returns an output - assuaging my concerns that only splits from one document were being sent to the LLM.

In [189]:
refine_result = qa_chain_refine(
    {"query": "Tell me about key characteristics of bond indices."}
)
print(refine_result["result"])

Bond indices are essential tools for tracking the performance of specific segments of the bond market. They are often used as benchmarks for Bond Index Funds (BIFs), which are portfolios of bonds that aim to match the performance of a particular bond index. Some of the most common bond indices track the U.S. investment-grade bond market and the Asiadollar credit market.

However, passive investing in bonds through index replication may not always be the best strategy. This is primarily due to several reasons. Firstly, bond indices are typically weighted by the amount of debt outstanding by the various issuers. This means that the most indebted issuers represent the largest proportion of the index, which can lead to a heavy weighting towards these issuers.

Secondly, due to the high number of constituents, bond indices can be challenging to replicate accurately, leading to a higher tracking error. Thirdly, the relatively illiquid nature of bonds can pose challenges for replication. Last