# Retrieval-Augmented Generation (RAG)
This is a notebook used for the 'Methods for Fintech and Artificial Intelligence in Finance'. The notebook experiment in different ways of adding valuable information to the answer generated by an LLM (retrieval-augmented generation, few shot learning). This notebook does not consider fine-tuning LLM as this typically comes with significant cost.

# Setting up the environment

https://community.anaconda.cloud/t/how-do-i-use-an-existing-environment-on-a-new-computer/55641

Load environment with conda
* cd ~/affine

* conda env create -f environment.yml

(If fist time, install a kernel)

* python -m ipykernel install --user --name=**kernel-name**


Remove kernel from system
* jupyter kernelspec list
* jupyter kernelspec uninstall **kernel-name**

Example code:

conda env create -n affine-project-env -f environment.yml

conda activate affine-project-env

python -m ipykernel install --user --name=affine-project-kernel


## Additional notes:
Create venv able with kernel:
* conda env export > environment.yml

In [1]:
%pip --version

pip 24.2 from /opt/miniconda3/envs/affine-project-env/lib/python3.12/site-packages/pip (python 3.12)
Note: you may need to restart the kernel to use updated packages.


In [None]:
#note: first we will use %pip install to test --> when package should be included in the environment use conda install
#dotenv
%pip install python-dotenv

# langchain set-up packages
%pip install --upgrade --quiet langchain
%pip install -qU "langchain-chroma>=0.1.2"
%pip install --upgrade --quiet  langchain-google-genai
%pip install --upgrade --quiet langchain-openai
%pip install --upgrade --quiet langchain-unstructured
# %pip install "unstructured[all-docs]"
#CHANGED BUT NOT TESTED YET
%pip install --upgrade --quiet unstructured-client
%pip install unstructured

In [4]:
%pip freeze | grep langchain

langchain==0.3.2
langchain-chroma==0.1.4
langchain-core==0.3.9
langchain-google-genai==2.0.0
langchain-openai==0.2.2
langchain-text-splitters==0.3.0
langchain-unstructured==0.1.5
Note: you may need to restart the kernel to use updated packages.


In [65]:
from dotenv import dotenv_values
ENV = dotenv_values(".env")

In [None]:
# from tutorial may be useful package to think about although langchain 0.2
# import os
# from langchain.vectorstores import Chroma
# from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
# from langchain.schema.runnable import RunnablePassthrough
# from langchain_core.output_parsers import StrOutputParser
# from langchain.prompts import (
#     ChatPromptTemplate,
#     FewShotChatMessagePromptTemplate,
# )
# # from operator import itemgetter

# Prepare data
The data contains apples: 
* 10-Q for Q2-2024 and Q3-2024*
* Apples Q3 earnings call
* 100 articles from last month related to Apple

The 10-Q .pdf files have been converted to json files using high resolution models as the pdf contain tables, more information:
* https://docs.unstructured.io/api-reference/how-to/choose-hi-res-model
* Also see: preprocessing_10-Q.py

In [2]:
# from langchain_unstructured import UnstructuredLoader
# loader = UnstructuredLoader('book-Copy1.txt')
# raw_doc = loader.load()

In [10]:
from langchain_text_splitters import CharacterTextSplitter
from os import listdir
from os.path import isfile, join
# segmenting the document into segments

def load_texts(input_folder_path):
  # Load an example document
  documents = []
  file_names = [f for f in listdir(input_folder_path) if isfile(join(input_folder_path, f))]
  for file_name in file_names:
    file_path = "./" + input_folder_path + "/" + file_name
    with open(file_path, "r") as f:
      documents.append(f.read())

  text_splitter = CharacterTextSplitter(
      separator="\n\n",
      chunk_size=1000,
      chunk_overlap=200,
      length_function=len,
      is_separator_regex=False,
  )
  texts = text_splitter.create_documents(documents)
  return texts

In [23]:
# texts = load_texts('./data/final_data') #load_texts('./rawdata/apple_10-Q-Q2-2024-As-Filed.pdf')

In [4]:
from langchain_openai import AzureOpenAIEmbeddings

def text_embedding_3_small_azure():
  AZURE_OPENAI_ENDPOINT = ENV.get('AZURE_OPENAI_ENDPOINT')
  AZURE_OPENAI_API_KEY = ENV.get('AZURE_OPENAI_API_KEY')
  AZURE_OPENAI_API_VERSION = ENV.get('AZURE_OPENAI_API_VERSION')

  embedding_model = AzureOpenAIEmbeddings(
      model="text-embedding-3-small",
      # dimensions: Optional[int] = None, # Can specify dimensions with new text-embedding-3 models
      azure_endpoint=AZURE_OPENAI_ENDPOINT, #If not provided, will read env variable AZURE_OPENAI_ENDPOINT
      api_key=AZURE_OPENAI_API_KEY, # Can provide an API key directly. If missing read env variable AZURE_OPENAI_API_KEY
      openai_api_version=AZURE_OPENAI_API_VERSION, # If not provided, will read env variable AZURE_OPENAI_API_VERSION
  )
  return embedding_model

In [None]:
KDBAI_ENDPOINT = ENV.get('KDBAI_ENDPOINT')
KDBAI_TOKEN = ENV.get('KDBAI_TOKEN')

# Adding table description to table embedding

In [5]:
# DOCS:
# please be aware that chroma uses the embedding model in creating the vector store as well as the query
from langchain_chroma import Chroma
# Document Embedding with Chromadb

def create_vector_store(texts, embedding_model, persist_directory):
  vector_store = Chroma.from_documents(texts, embedding_model, persist_directory=persist_directory)
  return vector_store

def load_vector_store(embedding_model, persist_directory):
  vector_store = Chroma(embedding_function=embedding_model, persist_directory=persist_directory)
  return vector_store

#Chroma class https://python.langchain.com/api_reference/chroma/vectorstores/langchain_chroma.vectorstores.Chroma.html
def retrieve_relevant_chunks(chroma_vector_store, query):
  # Connection to query with Chroma indexing using a retriever
  retriever = chroma_vector_store.as_retriever(
      search_type="similarity",
      search_kwargs={'k':4}
  )
  
  chunks = retriever.invoke(query)
  return chunks

def print_chunks(chunks):
  for d in docs:
    print('--------------------------------NEW DOCS ----------------------------------------')
    print(d.page_content)

# Function to add all docs returned by retriever

# def format_docs(docs):
#   return "\n\n".join(doc.page_content for doc in docs)

In [13]:
# creating a new vector database
def create_vector_store_test_pipeline(document_path, persist_directory):
  texts = load_texts(document_path)
  embedding_model = text_embedding_3_small_azure()
  vector_store = Chroma.from_documents(texts, embedding_model, persist_directory=persist_directory)
  return vector_store

In [57]:
#basic_vector_store = create_vector_store_test_pipeline("./data/final_data", "./vector_stores/basic_vector_store")

In [58]:
basic_vector_store = load_vector_store(text_embedding_3_small_azure(), "./vector_stores/basic_vector_store")

In [78]:
enhanced_table_vector_store = create_vector_store_test_pipeline("./data/final_v2_data", "./vector_stores/enhanced_table_vector_store")

Created a chunk of size 1622, which is longer than the specified 1000
Created a chunk of size 2377, which is longer than the specified 1000
Created a chunk of size 1825, which is longer than the specified 1000
Created a chunk of size 1480, which is longer than the specified 1000
Created a chunk of size 2411, which is longer than the specified 1000
Created a chunk of size 3085, which is longer than the specified 1000
Created a chunk of size 1379, which is longer than the specified 1000
Created a chunk of size 1207, which is longer than the specified 1000
Created a chunk of size 2249, which is longer than the specified 1000
Created a chunk of size 2553, which is longer than the specified 1000
Created a chunk of size 1204, which is longer than the specified 1000
Created a chunk of size 1132, which is longer than the specified 1000
Created a chunk of size 2303, which is longer than the specified 1000
Created a chunk of size 1316, which is longer than the specified 1000
Created a chunk of s

In [79]:
enhanced_table_vector_store = load_vector_store(text_embedding_3_small_azure(), "./vector_stores/enhanced_table_vector_store")

In [76]:
# retrieve_relevant_chunks(vector_store, "Does Pebblebee's news tracker work with Apple find my network")

# See difference in answer with context and no context

In [80]:
from langchain_google_genai import GoogleGenerativeAI
GEMINI_API_KEY = ENV.get('GEMINI_API_KEY')
# initializing the LLM

def compare_added_context_through_rag(query, vector_store, log = False):
  llm = GoogleGenerativeAI(model="models/gemini-1.5-flash", api_key=GEMINI_API_KEY)
  response_without_context = llm.invoke(query)

  chunks = retrieve_relevant_chunks(vector_store, query)
  context = "\n\n------\n\n".join([chunk.page_content for chunk in chunks]) #first_chunk.page_content
  rag_query = "\n\n".join([context, "\n\n", query])
  response_with_context = llm.invoke(rag_query)
  
  if log:
    print("response without context:\n")
    print(response_without_context + '\n\n')
    print("context:\n")
    print(context + "\n")
    print("response with context:\n")
    print(response_with_context)
  return response_without_context, response_with_context, context

In [33]:
query = "Does Pebblebee's news tracker work with Apple find my network?"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, basic_vector_store, log=True)

response without context:

Pebblebee's trackers do **not** work with Apple's Find My network. 

Here's why:

* **Different networks:** Pebblebee uses its own proprietary network and app, while Apple's Find My network is its own separate system.
* **Bluetooth-based:** Pebblebee trackers primarily rely on Bluetooth connectivity, while Apple's Find My network utilizes a combination of Bluetooth, Ultra Wideband (UWB), and crowdsourced location data. 

Therefore, you cannot use Pebblebee trackers with Apple's Find My app or network. 



context:

Pebblebee’s new item trackers works with both Apple and Google 'Find My' networks
Apple’s Find My network and Google’s Find My Device are both smart tracking solutions to help us prevent losing our items, but devices made for one aren’t typically compatible with the other. However, Pebblebee is changing this by introducing its Pebblebee Un…

How to mirror your iPhone on macOS Sequoia
With macOS Sequoia and iOS 18, Apple has a handy new way to hop b

In [38]:
query = "Based on recent earning call is the company apple on track to reach its 2024 goals?"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, basic_vector_store, log=True)

response without context:

I do not have access to real-time information, including recent earnings calls or company goals. Therefore, I cannot provide an answer to whether Apple is on track to reach its 2024 goals. 

To get this information, I recommend the following:

* **Check Apple's Investor Relations website:**  You can find transcripts of recent earnings calls, press releases, and other investor-related information on Apple's official website.
* **Read financial news articles:** Major business news outlets like Bloomberg, Reuters, and The Wall Street Journal will publish articles analyzing Apple's earnings calls and their implications for the company's future.
* **Consult with a financial advisor:** If you have specific questions about Apple's financial performance or goals, a financial advisor can provide you with personalized insights.

Remember, financial information is constantly changing, and it is important to stay up-to-date on the latest developments. 



context:

Apple

In [39]:
query = "Are sales of the IPhone 15 up compared to IPhone 14"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, basic_vector_store, log=True)

response without context:

I do not have access to real-time information, including sales figures for the iPhone 15. 

To get the most up-to-date information on iPhone 15 sales compared to iPhone 14 sales, I recommend checking:

* **Apple's official website:** They often release press releases or statements regarding sales performance.
* **Reputable tech news websites:** Sites like CNET, The Verge, TechCrunch, and others will report on sales figures and analysis.
* **Market research firms:** Companies like IDC, Gartner, and Canalys track smartphone sales data and publish reports.

Please note that sales figures can vary significantly depending on the source and the time period analyzed. 



context:

If you look at iPhone in particular for Greater China, the installed base set a record. We also in Mainland China set a June quarter record for upgraders and so that's a very strong signal and in fact from Kantar -- the survey from Kantar this quarter showed that iPhones were the top three

In [55]:
query = "How was Apple's Q3 performance compared to Q2"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, basic_vector_store, log=True)

response without context:

I do not have access to real-time information, including financial data like Apple's quarterly performance. To get the most up-to-date comparison of Apple's Q3 performance to Q2, I recommend checking the following sources:

* **Apple's Investor Relations Website:** You can find official earnings releases, transcripts of conference calls, and other financial information on Apple's website.
* **Financial News Websites:** Websites like Bloomberg, Reuters, and The Wall Street Journal provide coverage of Apple's earnings reports and analysis of their performance.
* **Financial Data Providers:** Services like Yahoo Finance, Google Finance, and Morningstar provide financial data and analysis for Apple and other companies.

Please note that the information on these websites will be the most accurate and up-to-date. 



context:

iPhone net sales decreased during the second quarter of 2024 compared to the second quarter of 2023 due to lower net sales of Pro models. Ye

In [56]:
query = "Could you return a balance sheet for Apple's 3rd Quarter?"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, basic_vector_store, log=True)

response without context:

I do not have access to real-time financial data, including Apple's balance sheet for their 3rd quarter. 

To get the most up-to-date information, I recommend checking these resources:

* **Apple's Investor Relations website:** You can find their latest earnings releases and financial statements here: [https://investor.apple.com/](https://investor.apple.com/)
* **SEC Edgar Database:** This database contains all public company filings, including Apple's 10-Q reports, which include their balance sheets. You can access it here: [https://www.sec.gov/edgar/searchedgar/companysearch.html](https://www.sec.gov/edgar/searchedgar/companysearch.html)
* **Financial News Websites:** Websites like Bloomberg, Reuters, and Yahoo Finance often publish summaries of companies' financial results, including their balance sheets.

Please note that the specific quarter you are referring to will affect the data you find. 



context:

Apple Inc. CONDENSED CONSOLIDATED BALANCE SHEETS

In [81]:
query = "Could you return a table depicting the balance sheet for Apple's 3rd Quarter?"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, enhanced_table_vector_store, log=True)

response without context:

I do not have access to real-time data, including financial statements like balance sheets for specific companies. To get Apple's 3rd Quarter balance sheet, I recommend checking the following sources:

* **Apple's Investor Relations Website:** Visit Apple's official investor relations website. You can usually find their latest earnings releases, including the balance sheet, in the "Financial Information" or "Earnings" section.
* **SEC Filings:** Apple, as a publicly traded company, is required to file its financial reports with the Securities and Exchange Commission (SEC). You can access these filings, including the 10-Q for the 3rd quarter, on the SEC's EDGAR database (https://www.sec.gov/edgar/searchedgar/companysearch.html).
* **Financial News Websites:** Websites like Yahoo Finance, Google Finance, and Bloomberg usually provide access to company financials, including balance sheets. 

By accessing these resources, you can find Apple's most recent 3rd Quar

In [82]:
query = "Could you return a table depicting the total liabilities and shareholders' equity?"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, enhanced_table_vector_store, log=True)

response without context:

Please provide me with the financial statements or data for the company you are interested in. I need the following information to create a table depicting total liabilities and shareholders' equity:

* **Balance Sheet:**  This will contain the information needed for the table. Specifically, it will show the liabilities and equity sections.
* **Specific Company:**  Please let me know the name of the company you want the table for. 

Once you provide this information, I will be able to generate the table you requested. 



context:

The table starts with a summary of current liabilities, followed by a detailed breakdown of non-current liabilities. The last row shows the total liabilities for each period.  The table helps to understand the company's financial health by showcasing its total debt obligations and how they are distributed between short-term and long-term liabilities. 

<table><tbody><tr><td colspan="3">Non-current liabilities:</td></tr><tr><td>Term

## Apple Inc. - Total Liabilities and Shareholders' Equity

| Period | Total Liabilities | Shareholders' Equity | Total Liabilities & Shareholders' Equity |
|---|---|---|---|
| Current Year | $264,904 | $74,194 | $337,411 |
| Prior Year | $290,437 | $62,146 | $352,583 | 


In [86]:
query = "Could you provide a percentage of the total liabilities vs shareholders' equity?"
response_without_context, response_with_context, context = compare_added_context_through_rag(query, enhanced_table_vector_store, log=True)

response without context:

Please provide me with the financial statements (balance sheet) of the company you are interested in. I need the total liabilities and shareholders' equity figures to calculate the percentage. 

Once you provide the information, I can calculate the percentage for you using the following formula:

**Percentage of Total Liabilities to Shareholders' Equity = (Total Liabilities / Shareholders' Equity) * 100** 



context:

Shareholders’ equity:
This table presents the company's liabilities and shareholders' equity for two periods, likely representing the current year and the prior year. The table is divided into three main sections: Liabilities, Shareholders' Equity, and Total Liabilities and Shareholders' Equity.

**Liabilities** are categorized into current liabilities and non-current liabilities. Current liabilities include accounts payable, other current liabilities, deferred revenue, commercial paper, and term debt. Non-current liabilities include term debt 