<div align="center">
    <div><img src="../assets/redis_logo.svg" style="width: 130px"> </div>
    <div style="display: inline-block; text-align: center; margin-bottom: 10px;">
        <span style="font-size: 36px;"><b>Multi-document Single-index RAG with LangChain and Redis Hybrid Search</b></span>
        <br />
    </div>
    <br />
</div>

## Environment Setup

In [1]:
import os
import warnings
import dotenv
# mute warnings
warnings.filterwarnings('ignore')
# load env vars from .env file
dotenv.load_dotenv()

dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["ROOT_DIR"] = parent_directory
print(dir_path)
print(parent_directory)

/Users/rouzbeh.farahmand/PycharmProjects/boa-financial-rag-workshop/2_RAG_patterns_with_redis
/Users/rouzbeh.farahmand/PycharmProjects/boa-financial-rag-workshop


### Install Python Dependencies

In [2]:
%pip install -r $ROOT_DIR/requirements.txt

Note: you may need to restart the kernel to use updated packages.


### Configure your Redis Stack


In [3]:
REDIS_URL = os.getenv("REDIS_URL")

### SentenceTransformerEmbeddings Models Cache folder
We are using `SentenceTransformerEmbeddings` in this demo and here we specify the cache folder. If you already downloaded the models in a local file system, set this folder here, otherwise the library tries to download the models in this folder if not available locally.

In particular, this models will be downloaded if not present in the cache folder:

models/models--sentence-transformers--all-MiniLM-L6-v2


In [4]:
#setting the local downloaded sentence transformer models folder
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"

## RAG with LangChain

### Create Custom index based on your data using RedisVL

In [5]:
from redisvl.index import SearchIndex
from redisvl.schema import IndexSchema
from redis import Redis
index_name = 'langchain'
prefix = 'chunk'
schema = IndexSchema.from_yaml('sec_index.yaml')
client = Redis.from_url(REDIS_URL)
# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

16:17:45 redisvl.index.index INFO   Index already exists, overwriting.


In [6]:
# get info about the index
!rvl index info -i langchain

[32m16:17:46[0m [34m[RedisVL][0m [1;30mINFO[0m   Using Redis address from environment variable, REDIS_URL


Index Information:
╭──────────────┬────────────────┬────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes   │ Index Options   │   Indexing │
├──────────────┼────────────────┼────────────┼─────────────────┼────────────┤
│ langchain    │ HASH           │ ['chunk']  │ []              │          0 │
╰──────────────┴────────────────┴────────────┴─────────────────┴────────────╯
Index Fields:
╭────────────────┬────────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name           │ Attribute      │ Type    │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├────────────────┼────────────────┼─────────┼────────────────┼────────────────┼──────

### Ingestion and Indexing



In [7]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings 
embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))

In [8]:
from helpers.ingestion import get_sec_data
from helpers.ingestion import redis_bulk_upload 
sec_data = get_sec_data()

 ✅ Loaded doc info for  111 tickers...


In [9]:
redis_bulk_upload(sec_data, index, embeddings, tickers=['AAPL', 'AMZN'])

✅ Loaded 108 10K chunks for ticker=AAPL from AAPL-2021-10K.pdf
✅ Loaded 94 10K chunks for ticker=AAPL from AAPL-2023-10K.pdf
✅ Loaded 103 10K chunks for ticker=AAPL from AAPL-2022-10K.pdf
✅ Loaded 27 earning_call chunks for ticker=AAPL from 2018-May-01-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2019-Oct-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2016-Jan-26-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2020-Jul-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2017-Aug-01-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2020-Jan-28-AAPL.txt
✅ Loaded 34 earning_call chunks for ticker=AAPL from 2016-Apr-26-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2017-Jan-31-AAPL.txt
✅ Loaded 28 earning_call chunks for ticker=AAPL from 2019-Apr-30-AAPL.txt
✅ Loaded 26 earning_call chunks for ticker=AAPL from 2017-Nov-02-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2016-Oct-25-AAPL.tx

## Vector Search with LangChain
**Important Note-2**: LangChain does not support JSON data types yet. Only supports HASH for now. This update should be coming soon.

In [10]:
from langchain_community.vectorstores import Redis as LangChainRedis
from helpers.utils import *

index_name = 'langchain'

vec_schema , main_schema = create_langchain_schemas_from_redis_schema('sec_index.yaml')

rds = LangChainRedis.from_existing_index( embedding=embeddings, 
                                          index_name= index_name, 
                                          schema = main_schema)

### Query the database
Now we can use the LangChain vector store class to perform similarity search operations on Redis

In [11]:
from langchain.vectorstores.redis import RedisText
from langchain.vectorstores.redis import RedisTag

In [12]:
f = RedisTag("ticker") == "AAPL"
rds.similarity_search(query="How many employees work at this company???", k=4, distance_threshold=0.8, filter=f)

[Document(page_content='The Company has historically experienced higher net sales in its first quarter compared to other quarters in its fiscal year due in part to seasonal holiday demand. Additionally, new product and service introductions can significantly impact net sales, cost of sales and operating expenses. The timing of product introductions can also impact the Company’s net sales to its indirect distribution channels as these channels are filled with new inventory following a product launch, and channel inventory of an older product often declines as the launch of a newer product approaches. Net sales can also be affected when consumers and distributors anticipate a product introduction.\n\nHuman Capital\n\nThe Company believes it has a talented, motivated, and dedicated team, and is committed to supporting the development of all of its team members and to continuously building on its strong culture. As of September 25, 2021, the Company had approximately 154,000 full-time equi

In [13]:
f = RedisTag("doc_type") == "10K"
rds.similarity_search(query="What did Tim Cook said in 2020 earning calls regarding NANDs?", k=4, distance_threshold=0.8, filter=f)

[Document(page_content='David A. Zapolsky. Mr. Zapolsky has served as Senior Vice President, General Counsel, and Secretary since May 2014, Vice President, General Counsel, and Secretary from September 2012 to May 2014, and as Vice President and Associate General Counsel for Litigation and Regulatory matters from April 2002 until September 2012.\n\nBoard of Directors Name\n\nAge\n\nPosition\n\nJeffrey P. Bezos\n\n58\n\nExecutive Chair\n\nAndrew R. Jassy\n\n54\n\nPresident and Chief Executive Officer\n\nKeith B. Alexander\n\n70\n\nCo-CEO, President, and Chair of IronNet Cybersecurity, Inc.\n\nEdith W. Cooper\n\n60\n\nFormer Executive Vice President, Goldman Sachs Group, Inc.\n\nJamie S. Gorelick\n\n71\n\nPartner, Wilmer Cutler Pickering Hale and Dorr LLP\n\nDaniel P. Huttenlocher\n\n63\n\nDean, MIT Schwarzman College of Computing\n\nJudith A. McGrath\n\n69\n\nFormer Chair and CEO, MTV Networks\n\nIndra K. Nooyi\n\n66\n\nFormer Chief Executive Officer, PepsiCo, Inc.\n\nJonathan J. Rubins

In [14]:
f = RedisTag("doc_type") == "earning_call"
rds.similarity_search(query="What did Tim Cook said in 2020 earning calls regarding NANDs?", k=4, distance_threshold=0.8, filter=f)

[Document(page_content="Thank you. Good afternoon, and thanks to everyone for joining us. Speaking first today is Apple's CEO, Tim Cook; and he'll be followed by CFO, Luca Maestri. After that, we'll open the call to questions from analysts. Please note that some of the information you'll hear during our discussion today will consist of forward-looking statements, including, without limitation, those regarding revenue, gross margin, operating expenses, other income and expense, taxes, capital allocation and future business outlook. Actual results or trends could differ materially from our forecast. For more information, please refer to the risk factors discussed in Apple's most recently filed periodic reports on Form 10-K and Form 10-Q and the Form 8-K filed with the SEC today, along with the associated press release. Apple assumes no obligation to update any forward-looking statements or information which speak as of their respective dates. I'd now like to turn the call over to Tim for

In [15]:
# vector search with combinations of metadata filtering
f = (RedisText("content") % "profit") | (RedisText("content") % "revenue")

rds.similarity_search_with_score(query="Apple company revenue", k=4, filter=f)


[(Document(page_content="Earlier this month, released macOS Catalina with all new entertainment apps, innovative Sidecar feature that uses iPad to expand Mac workspace and new accessibility tools that enable users to control their Mac entirely with their voice. 1. Catalina brings Apple Arcade experience to Mac. 1. Already seeing some third-party developers bring their iPad apps to Mac App Store with Mac Catalyst, including Twitter, Post-it and more. 4. Launching newly redesigned Mac Pro this fall, which Co. is manufacturing in Austin, Texas. 7. Others: 1. In FY19, crossed $100b in revenue in US for first time. 2. Introduce new services from Apple Card to Apple TV+ and generated over $46b in total Services revenue, setting new yearly Services records in all five geographic segments and driving Services business to size of Fortune 70 co. 3. Delivered new hardware in all device categories. 4. Wearables business showed explosive growth and generated more annual revenue than two-thirds of c

## RAG with Ollama running Llama 3 LLM

### Initialize a llama  LLM served via Ollama
Alternatively, if you like to connect to a local Ollama LLM, you can use below LLM. If you have a local OpenAI-compatible server running via vLLM , add your LLM here.

In [16]:
from helpers.utils import *
llm = get_llm()

### Setup prompt
PromptTemplate defines the exect text of the response that would be fed to the LLM. This step is optional, but the defaults usually work well for OpenAI and might fall short for other models.

In [17]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

### Putting it all together

This is where the Langchain brings all the components together in a form of a simple RAG application with the financial PDF document.

In [18]:
from langchain.chains import RetrievalQA

def get_search_kwargs(filters, distance_threshold):
    return {"distance_threshold":distance_threshold,"filter":filters}
    

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold",
                               search_kwargs={"distance_threshold":0.8, 'include_metadata': True}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

### Finally - let's ask questions!



In [19]:
query = "What was Apple's revenue last year compared to this year??"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"Based on the provided context from Apple's 10K filing, we can answer the question as follows:\n\nLast year (presumably FY2015), Apple's revenue was almost $118 billion.\n\nThis year (presumably FY2016), Apple's revenue was around $6.1 billion in Q1, with a guidance range of $50-53 billion for Q2.\n\nSo, compared to last year, this year's revenue is significantly lower, indicating a decline of nearly 95% from the previous year's revenue."

In [20]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"I apologize, but the context provided appears to be related to Apple Inc., not Nike. Therefore, I cannot answer the question about how many products Nike offers or what industry Nike is part of.\n\nHowever, if you provide me with a 10-K filing from Nike's annual report, I would be happy to help answer your question based on that information."

In [21]:
query = "what was the deferred revenue of Apple in 2022?"
res=qa(query)
res



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what was the deferred revenue of Apple in 2022?',
 'result': 'The answer is not available from the provided context. The financial statements and notes do not disclose the deferred revenue of Apple for 2022. The information provided focuses on cash flow, debt, and other financial metrics, but does not specifically mention deferred revenue.',
 'source_documents': [Document(page_content='The Company’s proportion of net sales by disaggregated revenue source was generally consistent for each reportable segment in Note 11, “Segment Information and Geographic Data” for 2022, 2021 and 2020, except in Greater China, where iPhone revenue represented a moderately higher proportion of net sales in 2022 and 2021.\n\nAs of September 24, 2022 and September 25, 2021, the Company had total deferred revenue of $12.4 billion and $11.9 billion, respectively. As of September 24, 2022, the Company expects 64% of total deferred revenue to be realized in less than a year, 27% within one-to-two yea

Wrong Answer: because we could not fetch the right chunk. From Apple 10K in 2022 we have: "As of September 24, 2022 and September 25, 2021, the Company had total deferred revenue of $12.4 billion and $11.9 billion,
respectively. As of September 24, 2022, the Company expects 64% of total deferred revenue to be realized in less than a year, 27%
within one-to-two years, 7% within two-to-three years and 2% in greater than three years."

In [22]:
query = "what was revenue of Apple in 2022?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"According to the provided financial data, the revenue for Apple in 2022 is not explicitly stated. However, we can extract some relevant information from the report:\n\n* The table showing net sales by reportable segment for 2023 and 2022 (in millions) is:\n\t+ 2022: $394,328\n\t+ Change: (3)% from 2021\n* The same table shows that the total net sales for each segment in 2022 are:\n\t+ Americas: $169,658\n\t+ Europe: $95,118\n\t+ Greater China: $74,200\n\t+ Japan: $29,375\n\t+ Rest of Asia Pacific: $25,977\n\nUsing this information, we can estimate the total revenue for Apple in 2022. Assuming a similar distribution of revenue across segments as in 2023 (although this is not explicitly stated), we can approximate the revenue for each segment in 2022:\n\n* Americas: $169,658\n* Europe: ~88% of 2023's $94,294 = $82,841\n* Greater China: ~81% of 2023's $72,559 = $58,934\n* Japan: ~84% of 2023's $29,615 = $24,821\n* Rest of Asia Pacific: ~87% of 2023's $25,977 = $22,593\n\nAdding up these 

In [23]:
query = "How many employees work at Nike???"
res=qa(query)
res



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'How many employees work at Nike???',
 'result': 'Question: How many employees work at Amazon?\nAnswer: The document states that "We welcomed more than 175,000 new employees in March and April..." and later mentions that "As of September 25, 2021, the Company had approximately 154,000 full-time equivalent employees." This suggests that there are over 154,000 full-time equivalent employees at Amazon.\n\nSource: [20] Amazon.com, Inc. - Senior VP & CFO Brian T. Olsavsky and [22] Amazon.com, Inc. - Human Capital section of the filing.',
 'source_documents': [Document(page_content="This included changes to over 150 of our processes to provide for social distancing as well as costs to onboard and train over 175,000 new employees who are hired to meet the higher customer demand. This $4 billion also included investments in personal protective equipment for employees and enhanced cleaning for our facilities. Our consolidated revenue and operating income significantly exceeded the top

wrong answer, it does not have the Nike data, but it did hallucinate given the wrong context by retrieval.

### Adding query analysis and hybrid search in QA chain

In [24]:
from helpers.custom_ners import get_redis_filters

 ✅ Loaded doc info for  111 tickers...


In [25]:
#Plugin your own query_analysis here, that includes NER, topic detection, intent detection, semantic routing etc. 
def query_analysis(q):
    filters = get_redis_filters(q)
    print(filters)
    return filters
    

def ask_question(question,
                 filters = None,
                 filter_strategy = 'AND',
                 distance_threshold =0.8,
                 search_type="similarity_distance_threshold"):
    
    q_filters = query_analysis(question)
    print(f"inferred filters: {q_filters}")
    if filters is None:
        filters = q_filters
    else:
        filters = " ( "+q_filters + " ) " + filter_strategy+ " ( " + filters + " ) "
    
    print(f"Final filters: {filters} to apply")
    if filters is not None:
        search_args = {"distance_threshold":distance_threshold, 
                   'include_metadata': True, 
                   'filter':filters}
    else:
        search_args = {"distance_threshold":distance_threshold, 
                   'include_metadata': True}
        
    fqa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=rds.as_retriever(search_type=search_type,
                                   search_kwargs= search_args),
        return_source_documents=True,
        chain_type_kwargs={"prompt": get_prompt()},
        verbose=True
    )
    response = fqa(question)
    return response  

In [26]:
ask_question("what is the revenue of aapl?")

@ticker:{AAPL}
inferred filters: @ticker:{AAPL}
Final filters: @ticker:{AAPL} to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the revenue of aapl?',
 'result': 'Here is the answer:\n\nQuestion: What is the revenue of AAPL?\nAnswer: $58.3b (as reported in the 2Q18 Financials)\nSource: Apple Inc.\'s Form 10-K filing for 2019, specifically the section titled "II. 2Q18 Financials"',
 'source_documents': [Document(page_content="Generated almost $34b in earnings in six month; bullish on Co.'s future. 12. Has best pipeline of products and services Co. ever had. 1. Has huge installed base of active devices that is growing across all products. 1. Has highest customer loyalty and satisfaction in industry. 13. Services business is growing dramatically. 14. Balance sheet and cash flow generation are strong. 1. Allows Co. to invest significantly in product roadmap and still return meaningful amount of capital to shareholders. 15. Recent corporate tax reform enables Co. to deploy global cash more efficiently. 1. In US, expects direct investment in economy to exceed $350b over next five years, including $

In [27]:
ask_question("what is the revenue of aapl in 2022?", filters = "@doc_type:{10K}")

@ticker:{AAPL}
inferred filters: @ticker:{AAPL}
Final filters:  ( @ticker:{AAPL} ) AND ( @doc_type:{10K} )  to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the revenue of aapl in 2022?',
 'result': 'Based on the provided data, we can find the net sales for each year:\n\n* 2021: $133,803 + $68,366 + $163,648 = $365,817 million\n* 2020: $109,197 + $40,308 + $125,010 = $274,515 million\n* 2019: $102,266 + $43,678 + $114,230 = $260,174 million\n\nTo find the net sales for 2022, we can use the data from the segment operating performance table in the 2023 Form 10-K filing:\n\n* Total net sales: $394,328 million\n* Breakdown by reportable segment:\n\t+ Americas: $169,658 million\n\t+ Europe: $95,118 million\n\t+ Greater China: $74,200 million\n\t+ Japan: $25,977 million\n\t+ Rest of Asia Pacific: $29,375 million\n\nNote that the provided data does not explicitly mention the net sales for 2022. However, we can infer that the total net sales for 2022 is likely to be around $394 million based on the pattern of growth in previous years.',
 'source_documents': [Document(page_content='Research and Development\n\nThe year-over-year g

In [28]:
ask_question("what is the total deferred revenue of aapl in 2022?", filters = "@doc_type:{10K} AND @content:(deferred revenue)")

@ticker:{AAPL}
inferred filters: @ticker:{AAPL}
Final filters:  ( @ticker:{AAPL} ) AND ( @doc_type:{10K} AND @content:(deferred revenue) )  to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the total deferred revenue of aapl in 2022?',
 'result': 'Based on the provided financial statements, the deferred revenue for Apple Inc. (AAPL) in 2022 is not explicitly stated.\n\nHowever, we can find the relevant information by analyzing the context:\n\n1. In the "Note 3 – Revenue" section, it\'s mentioned that "Deferred revenue" was $5,742 million as of September 24, 2022.\n2. In the "Deferred Tax Assets and Liabilities" section, there is an entry under "Deferred revenue" with a value of $5,742 million as of September 25, 2021 (previous year).\n3. Since there is no update on deferred revenue in the 2022 filing, we can assume that the value remains the same.\n\nTherefore, the total deferred revenue for AAPL in 2022 is approximately **$5,742 million**.',
 'source_documents': [Document(page_content='The Company’s proportion of net sales by disaggregated revenue source was generally consistent for each reportable segment in Note 11, “Segment Information and Geographi

Correct Retrieval by Redis Search but wrong extraction and generation by LLM!

In [29]:
rds.similarity_search_with_score(query="what is the total deferred revenue of Apple in 2022?", k=5, filter='(@content:(deferred) | @content:(revenue))')

[(Document(page_content='The Company’s proportion of net sales by disaggregated revenue source was generally consistent for each reportable segment in Note 11, “Segment Information and Geographic Data” for 2022, 2021 and 2020, except in Greater China, where iPhone revenue represented a moderately higher proportion of net sales in 2022 and 2021.\n\nAs of September 24, 2022 and September 25, 2021, the Company had total deferred revenue of $12.4 billion and $11.9 billion, respectively. As of September 24, 2022, the Company expects 64% of total deferred revenue to be realized in less than a year, 27% within one-to-two years, 7% within two-to-three years and 2% in greater than three years.\n\nApple Inc. | 2022 Form 10-K | 37\n\n2020\n\n137,781 28,622 23,724 30,620 53,768 274,515\n\nNote 3 – Financial Instruments\n\nCash, Cash Equivalents and Marketable Securities\n\nThe following tables show the Company’s cash, cash equivalents and marketable securities by significant investment category as

Correct Retrieval by Redis Search!

## Cleanup

Cleanup the index and data.

In [30]:
#rds.drop_index(index_name=index_name, redis_url=REDIS_URL, delete_documents=True)