# **RAG Pipeline**

<image src="Images\langchain_rag.jpg">

**credits:** [ThatAIGuy GitHub Repository](https://github.com/bansalkanav/Generative-AI-Scratch-2-Advance-By-ThatAIGuy)

## **Building An Insight Retrieval System for Pharmaceutical Sciences**

### **Steps**
- Step 1: Initialize an Embedding Model
- Step 2: Setting a Connection with the ChromaDB
- Step 3: Load necessary documents
- Step 4: Split the documents into chunks
- Step 5: Add Chunks to ChromaDB
- Step 6: Create a Retriever Object and apply Similarity Search
- Step 7: Initialize a Chat Prompt Template
- Step 8: Initialize a Generator (i.e. Chat Model)
- Step 9: Initialize a Output Parser
- Step 10: Define a RAG Chain
- Step 11: Invoke the Chain

### **Step 1: Initialize an Embedding Model**

In [1]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

## **Step 2: Initialize the Chroma DB Connection**

In [2]:
from langchain_chroma import Chroma

db = Chroma(collection_name="pharma_database",
            embedding_function=embedding_model,
            persist_directory='./pharma_db')

In [3]:
db.get()

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

### **Start 3: Load necessary documents**

In [4]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader

loader = DirectoryLoader(
    path="research-papers", glob="*.pdf", show_progress=True, loader_cls=PyPDFLoader
)

data = loader.load()

100%|██████████| 5/5 [00:02<00:00,  1.81it/s]


In [6]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print()
print("Page Content:")
print(data[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  39

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': 'research-papers\\2060-AI-in-Life-Sciences.pdf', 'page': 0}

Page Content:
Executive Insights
Artificial Intelligence in Life Sciences: The Formula for Pharma Success Across the Drug Lifecycle was written by  
Clay Heskett, Partner, Ben Faircloth, Partner, and Stephen Roper,


In [8]:
data[38].metadata

{'source': 'research-papers\\Artificial-Intelligence-in-Pharma-and-Biotech.pdf',
 'page': 10}

### **Step 4: Split the document into chunks**

In [9]:
doc_metadata = [data[i].metadata for i in range(len(data))]
doc_content = [data[i].page_content for i in range(len(data))]

In [18]:
doc_metadata[0], doc_content[0][:100]

({'source': 'research-papers\\2060-AI-in-Life-Sciences.pdf', 'page': 0},
 'Executive Insights\nArtificial Intelligence in Life Sciences: The Formula for Pharma Success Across t')

In [19]:
from langchain_text_splitters.sentence_transformers import SentenceTransformersTokenTextSplitter

st_text_splitter = SentenceTransformersTokenTextSplitter(model_name="sentence-transformers/all-mpnet-base-v2", 
                                                         chunk_size=100, 
                                                         chunk_overlap=50)

st_chunks = st_text_splitter.create_documents(doc_content, doc_metadata)

In [29]:
print("Total number of documents inside chunks:", len(st_chunks))
print()
for i, chunk in enumerate(st_chunks, start=1):
    print(f"Document {i} metadata: {chunk.metadata}")
    print(f"Document {i} chunks: {chunk.page_content[:100]}")
    if i == 5: break
    print("-" * 100)

Total number of documents inside chunks: 95

Document 1 metadata: {'source': 'research-papers\\2060-AI-in-Life-Sciences.pdf', 'page': 0}
Document 1 chunks: executive insights artificial intelligence in life sciences : the formula for pharma success across 
----------------------------------------------------------------------------------------------------
Document 2 metadata: {'source': 'research-papers\\2060-AI-in-Life-Sciences.pdf', 'page': 0}
Document 2 chunks: autonomy and iteratively optimize their processes. within life sciences, we apply the term “ ai ” to
----------------------------------------------------------------------------------------------------
Document 3 metadata: {'source': 'research-papers\\2060-AI-in-Life-Sciences.pdf', 'page': 1}
Document 3 chunks: executive insights page 2 l. e. k. consulting / ex ecutive insights, volume xx, issue 60ai ’ s abili
----------------------------------------------------------------------------------------------------
Document 4 metad

### **Step 5: Add Chunks to ChromaDB**

In [30]:
db.add_documents(st_chunks)

['3ff85bd2-094a-4e76-a6d3-2509a951aafc',
 '97268648-87a0-41b8-8ca9-333ab3b8e7dd',
 '43cc2aff-d3f4-4771-89a2-4b97bdf5d65c',
 'a2db085e-8803-4118-a44f-d6caa05524d9',
 'd747c9b7-885f-4125-805a-985ace9c1d50',
 '7f8a6aa7-dd8f-46ad-9dc7-c3f28455391a',
 'd01ca10d-12ed-47a3-bb3f-7d231b62a1e6',
 '5503ab5e-29fd-40aa-976b-85f04f0aee3f',
 '35585c3d-14af-4053-b7a8-04e601572948',
 '3c325288-4804-475c-b44d-8ad5d05919c8',
 'd65df7f8-774a-4c35-99be-14d7b2a75b91',
 '73dde220-2f4d-4002-b1fb-2205ea2b6812',
 'dd669d2a-24e6-432b-847a-3927d799f6a4',
 '5c21cff9-148c-404d-bff0-b371140b6e27',
 '50fd0bdc-7cce-4328-9378-880e59c3005d',
 '0af11247-4d2e-4cf8-af43-ea5ae61a1073',
 '1a214041-7d36-49b6-bb65-7938cf1eb1ad',
 '4127fb2b-736f-4a1a-9786-c208be0c443e',
 '422a5fe4-959b-4fd1-877d-f6a5630b95f0',
 '32412ab0-8335-46e7-9e20-c1be6caa214c',
 '8da8e847-a2e6-45e3-960b-208d082efcd5',
 '8dbb4feb-9727-4296-bbe1-af38bc702cea',
 'bcbf086c-21e7-48ff-94c3-28b452d0e488',
 '4da567bf-008d-457f-a7be-58045e3a23cb',
 '23ce66fd-7a75-

In [45]:
db.get()['documents'][1][:100]

'who should take this course? this program is designed for business leaders in pharmaceutical science'

In [38]:
db.get()['metadatas'][:5]

[{'page': 0, 'source': 'research-papers\\AI-In-Pharmaceutical-Industries.pdf'},
 {'page': 2,
  'source': 'research-papers\\Artificial-Intelligence-in-Pharma-and-Biotech.pdf'},
 {'page': 1, 'source': 'research-papers\\AI-In-Pharmaceutical-Industries.pdf'},
 {'page': 1, 'source': 'research-papers\\4839-AI-In-Pharmacy-Article.pdf'},
 {'page': 7, 'source': 'research-papers\\4839-AI-In-Pharmacy-Article.pdf'}]

> ---> Pharmaceutical Sciences Database is ready! <---

### **Step 6: Create a Retriever Object and apply Similarity Search**

In [46]:
retriever = db.as_retriever(search_type="similarity", search_kwargs={'k': 5})

### **Step 7: Initialize a Chat Prompt Template**

In [47]:
from langchain_core.prompts import ChatPromptTemplate

PROMPT_TEMPLATE = """
You are a highly knowledgeable assistant specializing in pharmaceutical sciences. 
Answer the question based only on the following context:
{context}

Answer the question based on the above context:
{question}

Use the provided context to answer the user's question accurately and concisely.
Don't justify your answers.
Don't give information not mentioned in the CONTEXT INFORMATION.
Do not say "according to the context" or "mentioned in the context" or similar.
"""

prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

### **Step 8: Initialize a Generator (i.e. Chat Model)**

In [49]:
from dotenv import load_dotenv
load_dotenv()

True

In [50]:
from langchain_google_genai import ChatGoogleGenerativeAI

chat_model = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    temperature=1
)

### **Step 9: Initialize a Output Parser**

In [52]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

### **Step 10: Define a RAG Chain**

In [53]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt_template | chat_model | output_parser

### **Step 11: Invoke the Chain**

In [55]:
query = "What is Pharmaceutical industry?"

rag_chain.invoke(query)

'The pharmaceutical industry uses AI in many areas including drug discovery and development, drug repurposing, accelerating pharmaceutical manufacturing, and clinical studies.  Top pharmaceutical companies are using AI in manufacturing for research and development.\n'

#### **Markdown code (Optional)**

In [56]:
from IPython.display import display
from IPython.display import Markdown
import textwrap

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [57]:
query = "What are the different AI applications used in pharmaceutical industry?"

response = rag_chain.invoke(query)

to_markdown(response)

> AI is used in:
> 
> * **Target discovery and drug discovery:**  Companies like BenevolentAI, Atomwise, and XtalPi use AI for this purpose.
> * **Optimizing clinical processes:**  Companies like Antidote and Bullfrog AI use patient data for recruitment and monitoring.
> * **Post-development activities:** This includes patient monitoring (CardioDiagnostics), compliance monitoring (AiCure), and marketing optimization (Eularis).
> * **Patient support:** Novo Nordisk's chatbot Sofia answers patient questions.  
> * **Drug manufacturing:** AI optimizes processes, controls quality, manages supply chains, and performs predictive maintenance.
