#    Healthcare Assistant Chatbot using RAG

#### This project implements an AI-powered healthcare assistant capable of answering medical queries based on healthcare-related PDF documents. By leveraging Retrieval-Augmented Generation (RAG) and Google‚Äôs Gemini API, the assistant provides accurate and context-aware responses grounded in real medical literature.

### Reading PDF Files

In [5]:
import os
from PyPDF2 import PdfReader


pdf_folder = "healthcare_pdfs"

# Function to read PDF content
def read_pdf_text(file_path):
    reader = PdfReader(file_path)
    text = ''
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text + '\n'
    return text

# Read and print first few lines from each PDF
for pdf_file in os.listdir(pdf_folder):
    if pdf_file.endswith(".pdf"):
        file_path = os.path.join(pdf_folder, pdf_file)
        print(f"\nüìÑ Reading file: {pdf_file}")
        text = read_pdf_text(file_path)
        lines = text.strip().splitlines()
        print("üßæ First 5 lines:")
        for line in lines[:5]:
            print(f"   {line}")



üìÑ Reading file: disease-handbook-complete.pdf
üßæ First 5 lines:
   Disease Handbook  
   for 
   Childcare Providers  
        
   New Hampshire Department of Health and Human Services  

üìÑ Reading file: Guide-to-Common-Childhood-Infections-2023_Final-Approved.pdf
üßæ First 5 lines:
   Signs and 
   Symptoms of 
   Infectious  Diseases  How Infectious 
   Diseases  Spread  How to Prevent 
   Spread  of Infectious 

üìÑ Reading file: Outpatient Guide 508.pdf
üßæ First 5 lines:
   GUIDE TO INFECTION PREVENTION  
   FOR OUTPATIENT SETTINGS:
    MINIMUM EXPECTATIONS FOR SAFE CARE
   National Center for Emerging and Zoonotic Infectious Diseases
   Division of Healthcare Quality Promotion

üìÑ Reading file: quick-guide-to-common-childhood-diseases.pdf
üßæ First 5 lines:
   A 
   Quick Guide 
   To 
   Common Childhood  
   Diseases 


### Splitting Text into Chunks

In [6]:
import os
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter


pdf_folder = './healthcare_pdfs'

# Update chunk size and overlap 
def split_documents(text, document_name):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = text_splitter.split_text(text)

    documents = []
    for i, chunk in enumerate(chunks):
        documents.append({
            "id": f"{document_name}_{i}",
            "text": chunk,
            "source": document_name
        })
    
    return documents

# Function to extract text from a PDF
def read_pdf_text(file_path):
    reader = PdfReader(file_path)
    text = ''
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text + '\n'
    return text

# Process all PDFs and split into chunks
all_docs = []

for pdf_file in os.listdir(pdf_folder):
    if pdf_file.endswith(".pdf"):
        file_path = os.path.join(pdf_folder, pdf_file)
        print(f"üìÑ Reading and chunking: {pdf_file}")
        text = read_pdf_text(file_path)
        chunks = split_documents(text, pdf_file.replace('.pdf', ''))
        all_docs.extend(chunks)

# Output basic stats
print(f"\n‚úÖ Total chunks created: {len(all_docs)}")
print(f"üßæ Sample chunk (first 100 chars):\n{all_docs[0]['text'][:100]}...")


üìÑ Reading and chunking: disease-handbook-complete.pdf
üìÑ Reading and chunking: Guide-to-Common-Childhood-Infections-2023_Final-Approved.pdf
üìÑ Reading and chunking: Outpatient Guide 508.pdf
üìÑ Reading and chunking: quick-guide-to-common-childhood-diseases.pdf

‚úÖ Total chunks created: 524
üßæ Sample chunk (first 100 chars):
Disease Handbook  
for 
Childcare Providers  
     
New Hampshire Department of Health and Human Ser...


### Generating Embeddings

In [7]:
import os
from langchain_google_genai import GoogleGenerativeAIEmbeddings

#  Set Google Gemini API Key
os.environ["GOOGLE_API_KEY"] = "MY API KEY"  

#  Initialize Gemini embedding model
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

#  Embedding function
def generate_embeddings(documents):
    embedded_docs = []
    for doc in documents:
        try:
            embedding_vector = embeddings.embed_query(doc["text"])
            embedded_doc = {
                "id": doc["id"],
                "text": doc["text"],
                "source": doc["source"],
                "embedding": embedding_vector
            }
            embedded_docs.append(embedded_doc)
        except Exception as e:
            print(f"‚ùå Error embedding document {doc['id']}: {str(e)}")
    
    return embedded_docs

#  Generate and store embeddings in memory
embedded_docs = generate_embeddings(all_docs)

# Summary
print(f"‚úÖ Generated embeddings for {len(embedded_docs)} documents.")
print(f"üî¢ Sample embedding (first 5 values): {embedded_docs[0]['embedding'][:5]}")


‚úÖ Generated embeddings for 524 documents.
üî¢ Sample embedding (first 5 values): [-0.0005785746616311371, -0.028399024158716202, -0.04462270438671112, 0.003386411117389798, 0.08531650900840759]


### Set up Zilliz Cloud connection

In [8]:

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

def connect_to_milvus():
    try:
        connections.connect(
            alias="default",
            uri ="Zilliz public endpoint",  
            token="My Token",                   
            secure=True
        )
        print("Connected to Zilliz Cloud")
        return True
    except Exception as e:
        print(f"Error connecting to Zilliz Cloud: {e}")

In [14]:
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection, utility

#  Create or Load Collection
def create_collection(collection_name="healthcare_docs", dimension=768):
    if utility.has_collection(collection_name):
        print(f"‚úÖ Collection '{collection_name}' already exists. Loading it.")
        return Collection(collection_name)

    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=150),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=8000),
        FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=150),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dimension),
    ]

    schema = CollectionSchema(fields, description="Healthcare document chunks with Gemini embeddings")
    collection = Collection(name=collection_name, schema=schema)

    # ‚úÖ Create vector index
    index_params = {
        "metric_type": "L2",
        "index_type": "IVF_FLAT",
        "params": {"nlist": 1024}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    collection.load()
    print(f"‚úÖ Collection '{collection_name}' created and loaded.")
    return collection

# ‚úÖ Step 2: Insert Documents
def insert_documents(collection, embedded_docs):
    ids = [doc["id"] for doc in embedded_docs]
    texts = [doc["text"] for doc in embedded_docs]
    sources = [doc["source"] for doc in embedded_docs]
    embeddings = [doc["embedding"] for doc in embedded_docs]

    entities = [ids, texts, sources, embeddings]
    insert_result = collection.insert(entities)
    collection.flush()
    print(f"‚úÖ Inserted {len(ids)} documents into '{collection.name}' collection.")


In [19]:
# üß† Step: Vector Search Function (using query embedding)
def search_similar_docs(collection, query_embedding, top_k=5):
    collection.load()
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}

    results = collection.search(
        data=[query_embedding],
        anns_field="embedding",
        param=search_params,
        limit=top_k,
        output_fields=["id", "text", "source"]
    )

    similar_docs = []
    for hits in results:
        for hit in hits:
            similar_docs.append({
                "id": hit.entity.get("id"),
                "text": hit.entity.get("text"),
                "source": hit.entity.get("source"),
                "distance": hit.distance
            })

    return similar_docs


In [30]:
# üîå Connect to Zilliz
connect_to_milvus()

# üì¶ Create or load collection
collection = create_collection("healthcare_docs", dimension=768)

# ‚¨ÜÔ∏è Insert embedded documents
insert_documents(collection, embedded_docs)

# üîç Test search
query = "What are symptoms of diabetes?"
query_embedding = embeddings.embed_query(query)

similar_docs = search_similar_docs(collection, query_embedding, top_k=3)

# üßæ Print top docs
for doc in similar_docs:
    print(f"\nüìÑ Source: {doc['source']}\n‚úèÔ∏è Chunk: {doc['text'][:200]}...\nüìê Distance: {doc['distance']}")


Connected to Zilliz Cloud
‚úÖ Collection 'healthcare_docs' already exists. Loading it.
‚úÖ Inserted 524 documents into 'healthcare_docs' collection.

üìÑ Source: disease-handbook-complete
‚úèÔ∏è Chunk: What are the symptoms?  
There are a wide range of signs and symptoms 
seen in HIV -infected children.  Symptoms may 
include failure to  thrive, weight loss, fever, mild 
or severe developmental dela...
üìê Distance: 0.7271164655685425

üìÑ Source: disease-handbook-complete
‚úèÔ∏è Chunk: adults aged 55-59.  Most cases of Lyme disease 
occur between April and October. Current data 
indicates that it is possible for someone to get Lyme disease more than once.  
 What are the symptoms?  ...
üìê Distance: 0.7748762965202332

üìÑ Source: Guide-to-Common-Childhood-Infections-2023_Final-Approved
‚úèÔ∏è Chunk: *Indicates a reportable disease           Return to top  Rotavirus  
What  are the 
symptoms?  
Fever,  vomiting  followed  
by watery  diarrhea.  
Symptoms  usually  last 
four  to 

In [34]:
import requests


OPENROUTER_API_KEY = "My API KEY" 

# Function to call OpenRouter LLM API
def generate_answer(query, retrieved_docs):
    # üîß Build context from top retrieved chunks
    context = "\n\n".join([doc['text'] for doc in retrieved_docs])

    # üß† Format prompt for LLM
    prompt = f"""You are a helpful medical assistant chatbot. Use the following context to answer the user's question ethically and accurately.

Context:
{context}

Question:
{query}

Answer:"""

    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json"
    }

    body = {
        "model": "openai/gpt-3.5-turbo", 
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.2
    }

    response = requests.post("https://openrouter.ai/api/v1/chat/completions", headers=headers, json=body)

    if response.status_code == 200:
        answer = response.json()['choices'][0]['message']['content']
        return answer.strip()
    else:
        print("‚ùå Error:", response.text)
        return "Sorry, I couldn't generate an answer at the moment."


In [32]:
# üîé User query
user_query = "What are the side effects of paracetamol?"

# üîó Embed query using Gemini
query_embedding = embeddings.embed_query(user_query)

# üîç Retrieve similar docs from Zilliz
top_docs = search_similar_docs(collection, query_embedding, top_k=3)

# üß† Generate answer using OpenRouter LLM
final_answer = generate_answer(user_query, top_docs)

# üì¢ Output
print("\nüí¨ Chatbot Answer:\n", final_answer)



üí¨ Chatbot Answer:
 Common side effects of paracetamol (also known as acetaminophen) include nausea, vomiting, and liver damage if taken in high doses. It is important to follow the recommended dosage instructions and consult with a healthcare provider if you experience any concerning side effects.


In [33]:
# üîé User query
user_query = "What are the symptoms of malaria?"

# üîó Embed query using Gemini
query_embedding = embeddings.embed_query(user_query)

# üîç Retrieve similar docs from Zilliz
top_docs = search_similar_docs(collection, query_embedding, top_k=3)

# üß† Generate answer using OpenRouter LLM
final_answer = generate_answer(user_query, top_docs)

# üì¢ Output
print("\nüí¨ Chatbot Answer:\n", final_answer)



üí¨ Chatbot Answer:
 Malaria is not listed in the provided context, but I can still provide you with information on its symptoms. Symptoms of malaria typically include fever, chills, sweats, headaches, muscle aches, fatigue, nausea, and vomiting. In more severe cases, malaria can lead to jaundice, seizures, coma, or even death. It is important to seek medical attention if you suspect you have malaria, as it is a serious and potentially life-threatening disease.
