<a href="https://colab.research.google.com/github/PatchDirectory/AWS/blob/master/Group86_Pfizer_Inc_10KFiling_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Group #: 86
Group members:


1.   K B N Ramesh
2.   KISTIPATI PRASANTH KUMAR REDDY
3.   KOWSALYA S
4.   SHIVANGI RANA

Problem Statement:
Implement an Advanced-RAG system using Vector Store of choice for Retrieval and Pre trained LLM (Any LLM of choice) Generation. You need to implement 2 services:
Ingestion Service: Ingest the data from files into a vector store.[Only text data]
Retrieval Service: Takes user query, Retrieve relevant data from vector store and generates answer to the user query.
You can use LLM and embedding models of your choice.
You have to use either Langchain or Llamaindex for implementation of the above services.




# Setup

In [363]:
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-index-llms-openai
!pip install llama-index-readers-file
!pip install llama-index-retrievers-bm25

!pip install pymupdf




# Import

In [390]:
import logging
import os

import pymupdf
from IPython.display import Markdown, display


from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.schema import TransformComponent
from llama_index.core.ingestion import IngestionPipeline

from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import (
    SemanticSplitterNodeParser,
)
from llama_index.core.settings import Settings


from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine.retriever_query_engine import (
    RetrieverQueryEngine,
)
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core import PromptTemplate
from llama_index.core.response.notebook_utils import display_response
from llama_index.core.postprocessor import LLMRerank


from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer
import nest_asyncio

# This is the OPENAI API KEY declared in colab secrets
from google.colab import userdata
os.environ['OPENAI_API_KEY']= userdata.get('OPEN_API_TOKEN')


# Ingestion Service

#### Load Data

We used the PyMuPDFReader in llamaindex to convert load the text from the pdf (10K Filing for 2023 of Pfizer Inc). Here the file is uploaded to the ephemeral storage in the colab (/content/pfe-20231231.pdf)

In [365]:
loader = PyMuPDFReader()
documents = loader.load("/content/pfe-20231231.pdf")


#### Use SemanticSplitter to convert text into Chunks.
###### Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other.

In [366]:
# Define SemanticSplitter (https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/)
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
semantic_chunker = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)


In [367]:
Semantic_Chunker_Pipeline = IngestionPipeline(
    transformations=[
        semantic_chunker,
    ],
)


In [368]:
Semantic_Chunker_Nodes = Semantic_Chunker_Pipeline.run(documents=documents)


print(f"Nodes in Sementic Chunker: {len(Semantic_Chunker_Nodes)}")

Nodes in Sementic Chunker: 363


In [369]:
print(Semantic_Chunker_Nodes[0].text)

UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended December 31, 2023
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from to
Commission file number 1-3619
PFIZER INC.
(Exact name of registrant as specified in its charter)
Delaware
13-5315170
(State or other jurisdiction of incorporation or organization)
(I.R.S. Employer Identification Number)
66 Hudson Boulevard East, New York, New York 10001-2192
(Address of principal executive offices) (zip code)
(212) 733-2323
(Registrant’s telephone number, including area code)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading Symbol(s)
Name of each exchange on which registered
Common Stock, $0.05 par value
PFE
New York Stock Exchange
1.000% Notes due 2027
PFE27
New York Stock Exchange
Secur

#### Ingest data into Vector Store

In [370]:
vector_index = VectorStoreIndex(Semantic_Chunker_Nodes)


# Retrieval Service

#### 1. User provides a text query.

#### User Queries - Simple, Medium & Complex

In [371]:
# Ask question
userQuery1 = "List all vacines and what their use" # Simple: Easy and direct question whose answer is present on a single page
userQuery2 = "What are risks related to industry"  # Medium: Question whose answer is present on multiple pages [2-3 pages of pdf]
userQuery3 = "summarize management of cyber security and also financial statements for 2023" # Complex: Multiple questions asked in a single query


#### 2. Query cleaning and rewriting
#### Simple & Query Decomposition

In [372]:
# Query rewrite
basic_query_cleanup_prompt = """\
You are a helpful assistant that re write the user queries. You need to perform below operations:
1. You only need to correct the grammer of the query.
2. Correct any spelling mistakes
3. Convert original query into question if it is already not a question.
4. Decompose single question into multiple questions if needed.
Original Query: {query}
Corrected Query:
"""
basic_query_cleanup_Prompt_Template = PromptTemplate(basic_query_cleanup_prompt)


In [373]:
llm = OpenAI(model="gpt-4")
cleanedup_user_query1 = llm.predict(basic_query_cleanup_Prompt_Template,query=userQuery1)
cleanedup_user_query2 = llm.predict(basic_query_cleanup_Prompt_Template,query=userQuery2)
cleanedup_user_query3 = llm.predict(basic_query_cleanup_Prompt_Template,query=userQuery3)
print(cleanedup_user_query1 + "\n======\n")
print(cleanedup_user_query2 + "\n======\n")
print(cleanedup_user_query3 + "\n======\n")

What are all the vaccines and what are their uses?

What are the risks related to the industry?

1. Can you summarize the management of cyber security?
2. Can you also provide the financial statements for 2023?



#### Retriever 1: BM25 Retriever

In [374]:
bm25_retriever = BM25Retriever.from_defaults(
    nodes=Semantic_Chunker_Nodes,
    similarity_top_k=10,
    # Optional: We can pass in the stemmer and set the language for stopwords
    # This is important for removing stopwords and stemming the query + text
    # The default is english for both
    stemmer=Stemmer.Stemmer("english"),
    language="english",
)
bm25_retriever.persist("./bm25_retriever")
loaded_bm25_retriever = BM25Retriever.from_persist_dir("./bm25_retriever")

DEBUG:bm25s:Building index from IDs objects


Finding newlines for mmindex:   0%|          | 0.00/1.24M [00:00<?, ?B/s]

Retriever 2: Semantic Retriever

In [375]:
vector_retriever = vector_index.as_retriever(similarity_top_k=10)

#### Retriever 3:Fusion Retriever

In [376]:
fusion_retriever = QueryFusionRetriever(
    [loaded_bm25_retriever, vector_retriever],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    #mode="reciprocal_rerank",
    use_async=True,
    verbose=True,

)

#### 4a. Implement reranking technique to select most relevant chunks

*   LLMRerank

5a. Precision can be improved by selecting the ndoes that are actually relevant to the query




In [377]:
# LLM-based retrieval to dynamically select the nodes that are actually relevant to the query.

class NodePostprocessorLLMRerank(BaseNodePostprocessor):
    """
    This processing is applied mandatory at framework level
    """
    def _postprocess_nodes(self, nodes, query_bundle):
        # add more steps here
        reranker = LLMRerank(choice_batch_size=5, top_n=6,)
        retrieved_nodes = reranker.postprocess_nodes(nodes, query_bundle)
        return retrieved_nodes

#### 5. Response Generation






#### Hallucinations can be minimized by using response syntheisizers for

###### Consolidation of Information Response synthesizers can aggregate and integrate information from multiple retrieved documents. By synthesizing several relevant sources into a cohesive response, they reduce the likelihood of generating content that is not backed by any source, thus helping to ensure that the output is grounded in factual information.

###### Clarity and Coherence By refining and structuring the output, response synthesizers can enhance clarity and coherence, making it less likely for the model to generate vague or ambiguous statements that might lead to hallucination. A clear presentation of accurate information helps significantly in preventing misunderstandings.

##### Precision can be improved by Node

In [378]:
compact_response_synthesizer = get_response_synthesizer(
    response_mode="compact", llm=llm, verbose=True
)

refine_response_synthesizer = get_response_synthesizer(
    response_mode="refine", llm=llm, verbose=True,
)

#### Use Fusion Retriever

In [379]:
# apply nested async to run in a notebook
nest_asyncio.apply()
nodes_with_scores = fusion_retriever.retrieve(cleanedup_user_query1)

In [380]:
for node in nodes_with_scores:
    print(f"Node Score: {node.score:.2f} - Node ID: {node.node_id}\n")

Node Score: 4.86 - Node ID: 24ae6b8d-1f91-43bc-b14e-7e7c0ea83ffb

Node Score: 4.39 - Node ID: 4d0f5291-6868-4e5a-aa73-87bb2fe630f1

Node Score: 4.07 - Node ID: a8d46d53-a407-452d-9139-2dd8a7b5f607

Node Score: 3.12 - Node ID: 7e77ddb5-583a-4dc1-b9f6-2d65c50ba2af

Node Score: 2.55 - Node ID: f49bdcf6-03fe-4163-84dd-9412d59b802c

Node Score: 2.54 - Node ID: 76c457e8-55e4-46ea-a3e1-e03760aabecc

Node Score: 2.27 - Node ID: 8fb4cac6-e71b-4206-94f3-1520691e52c0

Node Score: 2.26 - Node ID: 1f6d1db5-5791-4686-995a-0bb06f92350a

Node Score: 2.21 - Node ID: e494fa52-0462-46aa-91d4-ab3d52408699

Node Score: 2.16 - Node ID: 152bfa52-4790-400b-ae80-83b2ffef1670



#### 6a. Answer with Compact Response synthesizer

In [381]:
query_engine1 = RetrieverQueryEngine(
    retriever=fusion_retriever,
    node_postprocessors=[NodePostprocessorLLMRerank()],
    response_synthesizer=compact_response_synthesizer,
)

response = query_engine1.query(cleanedup_user_query1)

In [382]:
display_response(response)

**`Final Response:`** The vaccines mentioned in the context and their uses are:

1. Comirnaty (COVID-19 Vaccine, mRNA, 2023-2024 Formula): This is used for active immunization to prevent COVID-19 caused by SARS-CoV-2. It is approved for individuals 6 months through 4 years of age, 5 through 11 years of age, and 12 years of age and older.

2. Pfizer-BioNTech COVID-19 Vaccine (2023-2024 Formula): This is also used for active immunization to prevent COVID-19 caused by SARS-CoV-2. It is authorized for emergency use for individuals 6 months through 11 years of age.

3. Abrysvo (vaccine): This is used for active immunization to prevent RSV infection in adults aged 18-59.

4. PF-06425090 (Vaccine): This is used for immunization to prevent primary clostridioides difficile infection.

5. VLA15 (PF-07307405) vaccine: This is used for immunization to prevent Lyme disease.

6. PF-07252220 (quadrivalent mRNA-based vaccine): This is used for immunization to prevent influenza.

7. PF-07926307 (COVID/flu combo vaccine): This is used for immunization to prevent COVID infection and influenza.

In [383]:
response.metadata

{'f49bdcf6-03fe-4163-84dd-9412d59b802c': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '52'},
 '152bfa52-4790-400b-ae80-83b2ffef1670': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '55'}}

#### Answer with Refine response synthesizer

In [384]:
query_engine = RetrieverQueryEngine(
    retriever=fusion_retriever,
    node_postprocessors=[NodePostprocessorLLMRerank()],
    response_synthesizer=refine_response_synthesizer,
)

response = query_engine.query(cleanedup_user_query1)
display_response(response)
response.metadata

> Refine context: total_pages: 131
file_path: /content/pfe-202312...


**`Final Response:`** The vaccines mentioned in the context are PF-06425090, VLA15 (PF-07307405), PF-07252220, and PF-07926307. PF-06425090 is used for immunization to prevent primary clostridioides difficile infection. VLA15 (PF-07307405) vaccine is used for immunization to prevent Lyme disease. PF-07252220, a quadrivalent mRNA-based vaccine, is used for immunization to prevent influenza. PF-07926307 is a COVID/flu combo vaccine used for immunization to prevent COVID infection and influenza.

{'f49bdcf6-03fe-4163-84dd-9412d59b802c': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '52'},
 '152bfa52-4790-400b-ae80-83b2ffef1670': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '55'}}

#### 6b. Answer for - Medium: Question whose answer is present on multiple pages [2-3 pages of pdf]

In [385]:
response = query_engine1.query(cleanedup_user_query2)
display_response(response)
response.metadata

**`Final Response:`** The risks related to the industry include managed care trends where private payors and other managed care entities continue to manage the utilization and costs of drugs in the U.S. This has led to increased negotiating power of Managed Care Organizations (MCOs) and other private third-party payors due to consolidation. They, along with state and federal governments, increasingly employ formularies to control costs and encourage utilization of certain drugs. This may lead to demands for rebates from biopharmaceutical manufacturers for preferred placement on a drug formulary. 

There are also risks related to third-party payors using measures such as new-to-market blocks, exclusion lists, indication-based pricing and value-based pricing/contracting to improve their cost containment efforts and cost efficiency. As the U.S. private third-party payor market consolidates further, there may be greater pricing pressure from private third-party payors as they continue to drive more of their patients to use lower cost alternatives.

Additionally, there are risks related to intellectual property, technology, and security. This includes the risk of significant breakdown or interruption of IT systems and infrastructure, business disruption, theft of confidential or proprietary information, security threats on facilities or infrastructure, and risks related to the use of artificial intelligence-based software. There are also risks to products, patents, and other intellectual property, such as claims of invalidity, patent infringement, and pressure from various stakeholders or governments that could result in not seeking intellectual property protection or agreeing not to enforce or being restricted from enforcing intellectual property rights related to products. 

Furthermore, there are risks related to government regulation and legal proceedings. This includes the impact of any U.S. healthcare reform or legislation or any significant spending reduction or cost control efforts affecting Medicare, Medicaid or other publicly funded or subsidized health programs. There are also risks related to the impact of product recalls, withdrawals, and other unusual items, including uncertainties related to regulator-directed risk evaluations and assessments. 

Lastly, there are risks related to the impact of disruptions related to climate change and natural disasters, changes in business, political and economic conditions due to actual or threatened terrorist activity, geopolitical instability, political or civil unrest or military action, and the impact of, and risks and uncertainties related to, restructurings and internal reorganizations, as well as any other corporate strategic initiatives and growth strategies, and cost-reduction and productivity initiatives.

{'799b8f96-94fd-4833-a703-01ca53b66d17': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '26'},
 'af663196-2293-4625-97bf-1de3c7235886': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '12'},
 '417c0e90-fac1-4b6e-a701-5fbfb14adbe6': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '11'},
 '7e77ddb5-583a-4dc1-b9f6-2d65c50ba2af': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '10'}}

#### 6c. Answer for - Complex: Multiple questions asked in a single query.


In [386]:
response = query_engine1.query(cleanedup_user_query3)
display_response(response)
response.metadata

**`Final Response:`** 1. The management of cybersecurity involves extensive reliance on sophisticated IT systems, including cloud services, to operate the business. Large amounts of confidential information, including personal data and intellectual property, are produced, collected, processed, stored, and transmitted. Various technical and procedural controls are deployed to maintain the confidentiality, integrity, and availability of this information. The company also manages relationships with many third-party providers who may have access to their confidential information. Despite investments in data protection and IT, service interruptions, data theft, or unauthorized information disclosure can still occur. The company maintains cyber liability insurance, but it may not be sufficient to cover all losses resulting from an interruption or breach of their systems.

2. The financial statements for 2023 indicate that total revenues were $58.5 billion and net cash flow from operations was $8.7 billion. Both of these figures represent a decrease from 2022, with revenues down by 42% and net cash flow down by 70%. The reported diluted EPS for 2023 was $0.37, a decrease of 93% compared to 2022. The adjusted diluted EPS (Non-GAAP) was $1.84, a decrease of 72% compared to 2022.

{'c897932f-1d72-4041-ac61-a8573c152c4a': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '33'},
 'a39aca0f-57b4-4f90-be7b-155985a2f2aa': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '40'},
 '1ebedbb8-4278-4985-a254-f86a8e1f09ac': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '123'},
 '12da0138-a3b7-4304-a4f4-55a16a556942': {'total_pages': 131,
  'file_path': '/content/pfe-20231231.pdf',
  'source': '63'}}