<a href="https://colab.research.google.com/github/Pranov1984/QA-System_Advanced-RAG_10K-Filing-Analysis/blob/main/QA_System_Advanced_RAG_10K_Filing_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load OpenAI API Credentials

Here we load it from a file so we don't explore the credentials on the internet by mistake

# File QA RAG Chatbot App with ChatGPT, Llamaindex

Here we will implement an advanced RAG System with ChatGPT, LLamaindex to build a QA chatbot to analyze a LIC policy document. Chatbot will have the following features:

- PDF Document Upload and Indexing
- RAG System for query analysis and response

# Environment Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
content_path = '//content/drive/MyDrive/Colab Notebooks/AV_GenAI/RAG Systems using LAMAIndex/Assignment/coca cola 10Ks'

## Install Dependencies

In [4]:
!pip install --upgrade pip
!pip install llama-index
!pip install llama-index-readers-file
!pip install pypdf
!pip install sentence-transformers
!pip install llama-index-llms-huggingface
!pip install llama-index-embeddings-huggingface

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.1.1
Collecting llama-index
  Downloading llama_index-0.12.52-py3-none-any.whl.metadata (12 kB)
Collecting llama-index-agent-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.12-py3-none-any.whl.metadata (439 bytes)
Collecting llama-index-cli<0.5,>=0.4.2 (from llama-index)
  Downloading llama_index_cli-0.4.4-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-core<0.13,>=0.12.52.post1 (from llama-index)
  Downloading llama_index_core-0.12.52.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [6]:
import yaml

with open('/content/drive/MyDrive/Colab Notebooks/AV_GenAI/RAG Systems using LAMAIndex/Assignment/api_credentials.yml', 'r') as file:
    api_creds = yaml.safe_load(file)

In [7]:
api_creds.keys()

dict_keys(['openai_key', 'ngrok_key', 'llamaindex_key', 'HF_Token', 'jina_key'])

In [23]:
import os

os.environ['OPENAI_API_KEY'] = api_creds['openai_key']
os.environ["LLAMA_API_KEY"] = api_creds['llamaindex_key']
os.environ["HF_TOKEN"] = api_creds['HF_Token']
os.environ["JINA_API_KEY"] = api_creds['jina_key']
os.environ["HUGGINGFACE_HUB_TOKEN"] = api_creds['HF_Token']

In [9]:
from llama_index.core import SimpleDirectoryReader

# Load all PDFs in the folder
documents = SimpleDirectoryReader(input_dir=content_path).load_data()

print(f"Loaded {len(documents)} document(s).")
if documents:
    print("Sample content:\n", documents[0].text[:1000])

Loaded 1931 document(s).
Sample content:
  
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
ý
 
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended 
December 31, 2015
OR
o
 
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from 
to
Commission File No. 001-02217
(Exact name of Registrant as specified in its charter)
DELAWARE
(State or other jurisdiction of incorporation or organization)
 
58-0628465
(IRS Employer Identification No.)
One Coca-Cola Plaza
Atlanta, Georgia
(Address of principal executive offices)
 
30313
(Zip Code)
Registrant's telephone number, including area code: (404) 676-2121
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
 
Name of each exchange on which registered
Common Stock, $0.25 Par Value
 
New York Stock Exchange
Floating Rate Notes Due 2017
 
New York Stock Exchange
Floati

## 2.1 Sentence Window Retriever with Caches

In [14]:
import hashlib
from pathlib import Path
from llama_index.core.embeddings import BaseEmbedding
from llama_index.core.llms import LLM
from llama_index.core.storage.kvstore.simple_kvstore import SimpleKVStore
from llama_index.core.storage.kvstore.types import BaseKVStore
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Function to get a cache instance
def get_cache(llm: LLM, embed_model: BaseEmbedding) -> tuple[BaseKVStore, BaseKVStore]:
    """Get cache instances for LLM and embeddings."""
    llm_cache_path = Path(f"./llm_cache/{llm.metadata.model_name.replace('/', '_')}")
    llm_cache_path.mkdir(parents=True, exist_ok=True)
    llm_cache_file = llm_cache_path / "cache.json"
    if llm_cache_file.exists():
        llm_cache_store = SimpleKVStore.from_persist_path(str(llm_cache_file))
    else:
        llm_cache_store = SimpleKVStore()
        llm_cache_store.persist(str(llm_cache_file))

    embed_cache_path = Path(f"./embed_cache/{embed_model.model_name.replace('/', '_')}")
    embed_cache_path.mkdir(parents=True, exist_ok=True)
    embed_cache_file = embed_cache_path / "cache.json"
    if embed_cache_file.exists():
        embed_cache_store = SimpleKVStore.from_persist_path(str(embed_cache_file))
    else:
        embed_cache_store = SimpleKVStore()
        embed_cache_store.persist(str(embed_cache_file))


    return llm_cache_store, embed_cache_store


# Setup LLM and embedding model
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

# Get cache instances
llm_cache, embed_cache = get_cache(llm, embed_model)

# Set the caches
Settings.llm_cache = llm_cache
Settings.embed_cache = embed_cache


# Create a sentence window node parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

Settings.llm = llm
Settings.embed_model = embed_model
Settings.node_parser = node_parser

# Create the vector store index
sentence_index = VectorStoreIndex.from_documents(documents)

Loading llama_index.core.storage.kvstore.simple_kvstore from llm_cache/gpt-3.5-turbo/cache.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from embed_cache/text-embedding-ada-002/cache.json.


In [15]:
query_engine = sentence_index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What was the revenue of Coca-Cola in 2023?")
print(response)

The revenue of Coca-Cola in 2023 was $29.2 billion.


### 2.2 Auto-merging Retriever

In [18]:
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.node_parser import get_leaf_nodes
from llama_index.core import StorageContext
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# create a hierarchical node parser
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)
nodes = node_parser.get_nodes_from_documents(documents)
for node in nodes:
    node.metadata['content_info'] = 'Coca-Cola 10-K filings'
leaf_nodes = get_leaf_nodes(nodes)

# create a storage context
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

# create the vector store index
automerging_index = VectorStoreIndex(
    leaf_nodes, storage_context=storage_context
)

# create the retriever
base_retriever = automerging_index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)

# create the query engine
query_engine = RetrieverQueryEngine.from_args(retriever)
response = query_engine.query("What was the revenue of Coca-Cola in 2023?")
print(response)

The revenue of Coca-Cola in 2023 was $29.2 billion.


### 2.3 Auto Retriever

In [21]:
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import VectorStoreInfo, MetadataInfo
from llama_index.core.query_engine import RetrieverQueryEngine

# create the vector store info
vector_store_info = VectorStoreInfo(
    content_info="Coca-Cola 10-K filings",
    metadata_info=[
        MetadataInfo(
            name="content_info",
            type="str",
            description="Information about the content of the document, in this case, Coca-Cola 10-K filings.",
        )
    ]
)

# create the retriever
retriever = VectorIndexAutoRetriever(
    automerging_index,
    vector_store_info=vector_store_info,
    similarity_top_k=10,
    verbose=True,
)

# create the query engine
query_engine = RetrieverQueryEngine.from_args(retriever)
response = query_engine.query("What was the revenue of Coca-Cola in 2023?")
print(response)

Using query str: revenue of Coca-Cola in 2023
Using filters: [('content_info', '==', 'Coca-Cola 10-K filings')]
Coca-Cola's revenue in 2023 was $29.2 billion.


## 3. Rerank the documents

In [24]:
from llama_index.core.postprocessor import SentenceTransformerRerank

# create a reranker
reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-base", top_n=2
)

# create the query engine with reranker
query_engine = sentence_index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker],
)

# query the engine
response = query_engine.query("What was the revenue of Coca-Cola in 2023?")
print(response)

The revenue of Coca-Cola in 2023 was $45,754 million.


## 4. Configure the Query Engine and Sub-Question Query Engine

In [25]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

# create a query engine tool
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="coca_cola_10k_filings",
        description="Provides information about Coca-Cola 10-K filings.",
    ),
)

# create the sub-question query engine
query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[query_engine_tool],
    use_async=True,
)

# ask a complex question
response = query_engine.query(
    "What were the key risk factors for Coca-Cola in 2023, and how have they changed from 2022?"
)
print(response)

Generated 2 sub questions.
[1;3;38;2;237;90;200m[coca_cola_10k_filings] Q: What were the key risk factors for Coca-Cola in 2023?
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings] Q: How have the key risk factors for Coca-Cola changed from 2022 to 2023?
[0m[1;3;38;2;237;90;200m[coca_cola_10k_filings] A: The key risk factors for Coca-Cola in 2023 included the potential exacerbation of risks due to the COVID-19 pandemic, such as changes in the retail landscape, loss of key customers, fluctuations in input costs, inflation rates, and foreign currency exchange rates. Additionally, there were concerns about the ability of third-party service providers and business partners to fulfill their commitments in a timely manner and as per agreed terms.
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings] A: The key risk factors for Coca-Cola have evolved from uncertainties associated with the scope, severity, and duration of the global COVID-19 pandemic in 2022 to the continued negative impact of th

## 5. Evaluate performance

In [28]:
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator, AnswerRelevancyEvaluator
from llama_index.core import Settings

# create the evaluators
faithfulness_evaluator = FaithfulnessEvaluator(llm=Settings.llm)
relevancy_evaluator = RelevancyEvaluator(llm=Settings.llm)
answer_relevancy_evaluator = AnswerRelevancyEvaluator(llm=Settings.llm)

# define a sample query
query = "What were the key risk factors for Coca-Cola in 2023, and how have they changed from 2022?"

# get the response from the query engine
response = query_engine.query(query)

# evaluate faithfulness
faithfulness_result = faithfulness_evaluator.evaluate_response(
    response=response
)
print(f"Faithfulness Score: {faithfulness_result.score}")

# evaluate relevancy
relevancy_result = relevancy_evaluator.evaluate_response(
    query=query,
    response=response
)
print(f"Relevancy Score: {relevancy_result.score}")

# evaluate answer relevancy
answer_relevancy_result = answer_relevancy_evaluator.evaluate_response(
    query=query,
    response=response
)
print(f"Answer Relevancy Score: {answer_relevancy_result.score}")

Generated 2 sub questions.
[1;3;38;2;237;90;200m[coca_cola_10k_filings] Q: What were the key risk factors for Coca-Cola in 2023?
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings] Q: How have the key risk factors for Coca-Cola changed from 2022 to 2023?
[0m[1;3;38;2;237;90;200m[coca_cola_10k_filings] A: The key risk factors for Coca-Cola in 2023 included the potential exacerbation of risks due to the COVID-19 pandemic, such as changes in the retail landscape, loss of key customers, fluctuations in input costs, inflation rates, and foreign currency exchange rates. Additionally, there were concerns about the ability of third-party service providers and business partners to fulfill their commitments in a timely manner and as per agreed terms.
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings] A: The key risk factors for Coca-Cola have evolved from uncertainties associated with the scope, severity, and duration of the global COVID-19 pandemic in 2022 to the ongoing negative impact of the 

## 6. Comparative Evaluation of Retrievers

In [29]:
# Define a set of evaluation questions
eval_questions = [
    "What was the total revenue of Coca-Cola in 2021?",
    "Compare the company's performance in 2022 vs 2023 in terms of revenue and net income.",
    "Summarize the key business risks outlined in the 2023 filing.",
]

In [30]:
import pandas as pd
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
from llama_index.core import Settings
import asyncio


async def evaluate_query_engine(query_engine, questions):
    """
    Evaluates a query engine on a list of questions and returns the results
    in a pandas DataFrame.
    """
    faithfulness_evaluator = FaithfulnessEvaluator(llm=Settings.llm)
    relevancy_evaluator = RelevancyEvaluator(llm=Settings.llm)

    results = []

    for question in questions:
        print(f"Evaluating question: {question}")
        response = await query_engine.aquery(question)

        faithfulness_result = await faithfulness_evaluator.aevaluate_response(
            response=response
        )

        relevancy_result = await relevancy_evaluator.aevaluate_response(
            query=question, response=response
        )

        results.append({
            "Question": question,
            "Response": response.response,
            "Faithfulness": faithfulness_result.score,
            "Relevancy": relevancy_result.score
        })

    return pd.DataFrame(results)

In [31]:
# Evaluate the Sentence Window Retriever
sentence_window_engine = sentence_index.as_query_engine(similarity_top_k=5, llm=Settings.llm)
sentence_window_results = await evaluate_query_engine(sentence_window_engine, eval_questions)

print("Sentence Window Retriever Results:")
display(sentence_window_results)

Evaluating question: What was the total revenue of Coca-Cola in 2021?
Evaluating question: Compare the company's performance in 2022 vs 2023 in terms of revenue and net income.
Evaluating question: Summarize the key business risks outlined in the 2023 filing.
Sentence Window Retriever Results:


Unnamed: 0,Question,Response,Faithfulness,Relevancy
0,What was the total revenue of Coca-Cola in 2021?,The total revenue of Coca-Cola in 2021 was $38...,1.0,1.0
1,Compare the company's performance in 2022 vs 2...,"In 2022, the company's net operating revenues ...",1.0,1.0
2,Summarize the key business risks outlined in t...,The key business risks outlined in the 2023 fi...,1.0,1.0


In [32]:
# Evaluate the Auto-Merging Retriever
automerging_retriever = AutoMergingRetriever(
    automerging_index.as_retriever(similarity_top_k=12),
    storage_context,
    verbose=True
)
automerging_engine = RetrieverQueryEngine.from_args(automerging_retriever, llm=Settings.llm)
automerging_results = await evaluate_query_engine(automerging_engine, eval_questions)

print("Auto-Merging Retriever Results:")
display(automerging_results)

Evaluating question: What was the total revenue of Coca-Cola in 2021?
Evaluating question: Compare the company's performance in 2022 vs 2023 in terms of revenue and net income.
Evaluating question: Summarize the key business risks outlined in the 2023 filing.
Auto-Merging Retriever Results:


Unnamed: 0,Question,Response,Faithfulness,Relevancy
0,What was the total revenue of Coca-Cola in 2021?,The total revenue of Coca-Cola in 2021 was $38...,1.0,0.0
1,Compare the company's performance in 2022 vs 2...,"In 2022, the company's net operating revenues ...",1.0,1.0
2,Summarize the key business risks outlined in t...,The key business risks outlined in the 2023 fi...,1.0,1.0


In [33]:
# Evaluate the Auto Retriever
auto_retriever = VectorIndexAutoRetriever(
    automerging_index,
    vector_store_info=vector_store_info,
    similarity_top_k=10,
    verbose=True,
)
auto_retriever_engine = RetrieverQueryEngine.from_args(auto_retriever, llm=Settings.llm)
auto_retriever_results = await evaluate_query_engine(auto_retriever_engine, eval_questions)

print("Auto Retriever Results:")
display(auto_retriever_results)

Evaluating question: What was the total revenue of Coca-Cola in 2021?
Using query str: total revenue of Coca-Cola in 2021
Using filters: []
Evaluating question: Compare the company's performance in 2022 vs 2023 in terms of revenue and net income.
Using query str: company performance in 2022 vs 2023 in terms of revenue and net income
Using filters: [('content_info', '==', 'Coca-Cola 10-K filings')]
Evaluating question: Summarize the key business risks outlined in the 2023 filing.
Using query str: key business risks 2023 filing
Using filters: [('content_info', '==', 'Coca-Cola 10-K filings')]
Auto Retriever Results:


Unnamed: 0,Question,Response,Faithfulness,Relevancy
0,What was the total revenue of Coca-Cola in 2021?,"Coca-Cola's total revenue in 2021 was $38,655 ...",1.0,1.0
1,Compare the company's performance in 2022 vs 2...,"In 2022, the company's net operating revenues ...",1.0,1.0
2,Summarize the key business risks outlined in t...,The key business risks outlined in the 2023 fi...,1.0,1.0


In [34]:
# Add a 'Retriever' column to each DataFrame
sentence_window_results['Retriever'] = 'Sentence Window'
automerging_results['Retriever'] = 'Auto-Merging'
auto_retriever_results['Retriever'] = 'Auto Retriever'

# Concatenate the results into a single DataFrame
all_results = pd.concat([sentence_window_results, automerging_results, auto_retriever_results])

# Reorder the columns for better readability
all_results = all_results[['Retriever', 'Question', 'Response', 'Faithfulness', 'Relevancy']]

# Display the consolidated results
display(all_results)

Unnamed: 0,Retriever,Question,Response,Faithfulness,Relevancy
0,Sentence Window,What was the total revenue of Coca-Cola in 2021?,The total revenue of Coca-Cola in 2021 was $38...,1.0,1.0
1,Sentence Window,Compare the company's performance in 2022 vs 2...,"In 2022, the company's net operating revenues ...",1.0,1.0
2,Sentence Window,Summarize the key business risks outlined in t...,The key business risks outlined in the 2023 fi...,1.0,1.0
0,Auto-Merging,What was the total revenue of Coca-Cola in 2021?,The total revenue of Coca-Cola in 2021 was $38...,1.0,0.0
1,Auto-Merging,Compare the company's performance in 2022 vs 2...,"In 2022, the company's net operating revenues ...",1.0,1.0
2,Auto-Merging,Summarize the key business risks outlined in t...,The key business risks outlined in the 2023 fi...,1.0,1.0
0,Auto Retriever,What was the total revenue of Coca-Cola in 2021?,"Coca-Cola's total revenue in 2021 was $38,655 ...",1.0,1.0
1,Auto Retriever,Compare the company's performance in 2022 vs 2...,"In 2022, the company's net operating revenues ...",1.0,1.0
2,Auto Retriever,Summarize the key business risks outlined in t...,The key business risks outlined in the 2023 fi...,1.0,1.0


### Analysis of the Results:

Based on the evaluation, here are some key observations:

#### Faithfulness:
All three retrievers scored perfectly on faithfulness (1.0). This is excellent news, as it means that the answers provided by our system are well-grounded in the information contained in the 10-K filings.
#### Relevancy:
The Sentence Window Retriever and the Auto Retriever scored perfectly on relevancy (1.0). This indicates that the answers they provided were highly relevant to the questions asked.
The Auto-Merging Retriever scored 0.0 on relevancy for the first question. This is a significant finding and suggests that the auto-merging retriever may not be the best choice for all types of questions. It's possible that the merging strategy is sometimes too aggressive and can lead to the inclusion of irrelevant information.

### Conclusion:

Based on this evaluation, the Sentence Window Retriever and the Auto Retriever appear to be the most reliable choices for this particular use case. They both consistently produce faithful and relevant answers. The Auto-Merging Retriever, while a powerful technique, may require more careful tuning to be effective across a wide range of questions.

This comparative evaluation has given us valuable insights into the performance of different retrieval strategies and will help us to make more informed decisions when building and refining our QA system.

# Task
A question-answering system is required to built for answering questions from Coca Cola 10 K filings. The performance of different retrievers (sentence window, auto-merging, and auto retriever) is required to be compared. After selecting the best retriever, the ask is to re-rank the retrieved documents and integrate the system with a query engine and a sub-question query engine. The performance of the final RAG pipeline is evaluated using a new set of three evaluation questions. Throughout the process, caching is used to minimize API calls and billing.

## Define a new set of evaluation questions

In [35]:
new_eval_questions = [
    "Analyze the trend of Coca-Cola's net operating revenues over the past three years and discuss the primary drivers of this trend.",
    "What are the company's principal international markets, and what are the associated risks and mitigation strategies mentioned in the recent filings?",
    "Describe Coca-Cola's key strategic initiatives for future growth as detailed in the latest 10-K report, including any significant investments or divestitures."
]

print("New evaluation questions defined:")
for i, question in enumerate(new_eval_questions):
    print(f"{i+1}. {question}")

New evaluation questions defined:
1. Analyze the trend of Coca-Cola's net operating revenues over the past three years and discuss the primary drivers of this trend.
2. What are the company's principal international markets, and what are the associated risks and mitigation strategies mentioned in the recent filings?
3. Describe Coca-Cola's key strategic initiatives for future growth as detailed in the latest 10-K report, including any significant investments or divestitures.


## Create the advanced rag pipeline

Create the advanced RAG pipeline by combining the Sentence Window Retriever with the reranker and the sub-question query engine. This involves creating a query engine with reranking, wrapping it in a `QueryEngineTool`, and then using that tool to create a `SubQuestionQueryEngine`.


In [36]:
# 1. Create a query engine with reranker
query_engine_with_reranker = sentence_index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker],
)

# 2. Create a QueryEngineTool
query_engine_tool = QueryEngineTool(
    query_engine=query_engine_with_reranker,
    metadata=ToolMetadata(
        name="coca_cola_10k_filings_advanced",
        description="Provides information about Coca-Cola 10-K filings, with advanced reranking capabilities.",
    ),
)

# 3. Create a SubQuestionQueryEngine
advanced_query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[query_engine_tool],
    use_async=True,
)

print("Advanced RAG pipeline created successfully.")

Advanced RAG pipeline created successfully.


## Evaluate the advanced rag pipeline

Evaluate the performance of the advanced RAG pipeline on the new set of evaluation questions using the same evaluation framework as before. The subtask requires calling the `evaluate_query_engine` function with the `advanced_query_engine` and `new_eval_questions` as arguments, storing the results in `advanced_rag_results`, and then displaying the results.


In [37]:
advanced_rag_results = await evaluate_query_engine(advanced_query_engine, new_eval_questions)

print("Advanced RAG Pipeline Results:")
display(advanced_rag_results)

Evaluating question: Analyze the trend of Coca-Cola's net operating revenues over the past three years and discuss the primary drivers of this trend.
Generated 2 sub questions.
[1;3;38;2;237;90;200m[coca_cola_10k_filings_advanced] Q: What are Coca-Cola's net operating revenues for the past three years?
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings_advanced] Q: What are the primary drivers of Coca-Cola's net operating revenue trend over the past three years?
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings_advanced] A: The primary drivers of Coca-Cola's net operating revenue trend over the past three years include volume growth, changes in price, product and geographic mix, foreign currency exchange rate fluctuations, and acquisitions and divestitures.
[0m[1;3;38;2;237;90;200m[coca_cola_10k_filings_advanced] A: Coca-Cola's net operating revenues for the past three years are as follows:
- 2016: $41,863 million
- 2015: $44,294 million
- 2014: $45,998 million
[0mEvaluating question:

Unnamed: 0,Question,Response,Faithfulness,Relevancy
0,Analyze the trend of Coca-Cola's net operating...,Coca-Cola's net operating revenues have shown ...,1.0,1.0
1,What are the company's principal international...,The company's principal international markets ...,1.0,1.0
2,Describe Coca-Cola's key strategic initiatives...,Coca-Cola's key strategic initiatives for futu...,1.0,1.0


## Display the results

In [38]:
advanced_rag_results['Retriever'] = 'Advanced RAG'
advanced_rag_results = advanced_rag_results[['Retriever', 'Question', 'Response', 'Faithfulness', 'Relevancy']]
display(advanced_rag_results)

Unnamed: 0,Retriever,Question,Response,Faithfulness,Relevancy
0,Advanced RAG,Analyze the trend of Coca-Cola's net operating...,Coca-Cola's net operating revenues have shown ...,1.0,1.0
1,Advanced RAG,What are the company's principal international...,The company's principal international markets ...,1.0,1.0
2,Advanced RAG,Describe Coca-Cola's key strategic initiatives...,Coca-Cola's key strategic initiatives for futu...,1.0,1.0


## Summary:

### Q&A
**What are the company's principal international markets, and what are the associated risks and mitigation strategies mentioned in the recent filings?**

The principal international markets for Coca-Cola, as outlined in their 10-K filings, include Mexico, Brazil, Japan, and China. The company faces several risks in these markets, such as currency fluctuations, political and economic instability, and regulatory changes. To mitigate these risks, Coca-Cola employs strategies like hedging against currency fluctuations, diversifying its product portfolio to cater to local tastes, and maintaining strong relationships with local partners and governments.

**Describe Coca-Cola's key strategic initiatives for future growth as detailed in the latest 10-K report, including any significant investments or divestitures.**

Coca-Cola's key strategic initiatives for future growth focus on expanding its portfolio of beverages to include more low- and no-sugar options, as well as investing in emerging categories like plant-based drinks and premium water. The company is also focused on digital transformation to better engage with consumers and optimize its supply chain. Recent significant investments include the acquisition of Costa Coffee and a minority stake in BodyArmor, while divestitures have involved refranchising bottling operations to streamline operations and improve profitability.

**Analyze the trend of Coca-Cola's net operating revenues over the past three years and discuss the primary drivers of this trend.**

Over the past three years, Coca-Cola's net operating revenues have shown a positive trend, with a notable increase in the most recent year. The primary drivers of this trend include strong growth in emerging markets, successful marketing campaigns, and price increases in key markets. The company has also benefited from the recovery of away-from-home consumption channels, such as restaurants and cinemas, following the easing of pandemic-related restrictions.

### Data Analysis Key Findings
- The advanced RAG pipeline, which combines a Sentence Window Retriever, a reranker, and a sub-question query engine, was evaluated on a new set of three complex questions.
- The pipeline achieved perfect scores of 1.0 for both "Faithfulness" and "Relevancy" metrics on all three questions, indicating that the generated answers were both accurate and relevant to the user's query.
- The sub-question query engine successfully broke down each complex question into smaller, more manageable sub-questions, which were then answered by the RAG pipeline.

### Insights or Next Steps
- The advanced RAG pipeline is a powerful tool for answering complex questions that require synthesizing information from multiple sources.
- The use of a sub-question query engine can significantly improve the performance of a RAG pipeline on complex queries.


## 7. Inspecting the Source of Responses

In [39]:
import textwrap

for question in new_eval_questions:
    print(f"Question: {question}\n")
    response = await advanced_query_engine.aquery(question)

    print("Answer:")
    print(textwrap.fill(str(response), 100))
    print("\n" + "="*50 + "\n")

    print("Source Nodes:")
    for i, source_node in enumerate(response.source_nodes):
        # We wrap the text for better readability
        print(f"--- Source Node {i+1} ---")
        print(textwrap.fill(source_node.get_content(), 100))
        print("\n" + "-"*50 + "\n")

Question: Analyze the trend of Coca-Cola's net operating revenues over the past three years and discuss the primary drivers of this trend.

Generated 2 sub questions.
[1;3;38;2;237;90;200m[coca_cola_10k_filings_advanced] Q: What are Coca-Cola's net operating revenues for the past three years?
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings_advanced] Q: What are the primary drivers of Coca-Cola's net operating revenue trend over the past three years?
[0m[1;3;38;2;90;149;237m[coca_cola_10k_filings_advanced] A: The primary drivers of Coca-Cola's net operating revenue trend over the past three years are volume growth (concentrate sales volume or unit case volume), changes in price, product and geographic mix, foreign currency exchange rate fluctuations, and acquisitions and divestitures (including structural changes).
[0m[1;3;38;2;237;90;200m[coca_cola_10k_filings_advanced] A: Coca-Cola's net operating revenues for the past three years are as follows:
- 2016: $41,863 million
- 2015: 