# Eunoia: A RAG-based Generative AI for Women's Health

This project implements a **Retrieval-Augmented Generation (RAG)** system to provide **accurate and context-aware answers** on women’s health topics.  

- RAG combines **retrieval-based methods** with **generative AI**, enabling the model to access **external knowledge sources dynamically** while generating responses.  
- This ensures that answers are **factually grounded**, reducing hallucinations that are common in standard generative models.  
- The system is designed to assist users with **common health questions, symptom guidance, and educational content** in a **reliable and explainable** manner.

In [None]:
import logging
from pathlib import Path
from docling.document_converter import DocumentConverter

logging.basicConfig(level=logging.INFO)

# Folder with input PDFs
input_folder = Path('/Users/sathammai/Desktop/Research Paper')
# Folder to save Markdown outputs
output_folder = Path('/Users/sathammai/Desktop/ResearchPaper_Markups')
output_folder.mkdir(parents=True, exist_ok=True)

converter = DocumentConverter()

# Iterate over all PDFs in the input folder
for pdf_path in input_folder.glob('*.pdf'):
    logging.info(f'Processing {pdf_path.name}...')

    # Convert PDF
    result = converter.convert(pdf_path)

    # Markdown filename = same as PDF but with .md extension
    md_filename = output_folder / f"{pdf_path.stem}.md"

    # Save Markdown
    result.document.save_as_markdown(md_filename)
    logging.info(f"Saved Markdown: {md_filename}")

print("All PDFs converted to Markdown!")


## Preprocessing

1. Medical papers covering various aspects of women’s health were **manually collected**.  
2. Texts were **extracted from the PDFs** and converted to Markdown using `docling`, preserving headings and structure.  
3. The Markdown documents were then **broken into chunks** based on subheadings or a **word range of 300–600 words**.  
   - A **chunk overlap of 100 words** was maintained to preserve context between consecutive chunks.  
   - Chunks with fewer than 50 words were **merged with the previous chunk** to avoid tiny, context-less passages.  

These steps ensure that each chunk is **self-contained, coherent, and suitable for retrieval**, enabling the RAG system to generate **accurate and context-aware responses**.


In [None]:
import re

def smart_hybrid_chunk(text, min_words=300, max_words=600, chunk_overlap=100, tiny_threshold=50):
    # Split on headings
    heading_split_pattern = r"(#+\s.*\n)"
    parts = re.split(heading_split_pattern, text)
    parts = [p.strip() for p in parts if p.strip()]

    chunks = []
    buffer = ""

    for part in parts:
        part_words = part.split()
        buffer_words = buffer.split()

        # If adding this part keeps us under max_words, append it
        if len(buffer_words) + len(part_words) <= max_words:
            buffer += " " + part if buffer else part
        else:
            # If buffer is big enough, finalize it
            if buffer_words:
                chunks.append(buffer.strip())
            # If part itself is bigger than max_words, split it
            while len(part_words) > max_words:
                chunks.append(" ".join(part_words[:max_words]))
                part_words = part_words[max_words - chunk_overlap:]
            buffer = " ".join(part_words)

    # Append any leftover buffer
    if buffer:
        chunks.append(buffer.strip())

    # Merge tiny chunks smartly
    final_chunks = []
    i = 0
    while i < len(chunks):
        current = chunks[i]
        current_words = current.split()

        # Merge with next if tiny
        while len(current_words) < tiny_threshold and i + 1 < len(chunks):
            current += " " + chunks[i + 1]
            current_words = current.split()
            i += 1

        # Merge with previous if still tiny
        if len(current_words) < tiny_threshold and final_chunks:
            final_chunks[-1] += " " + current
        else:
            final_chunks.append(current)

        i += 1

    return final_chunks


In [None]:
from pathlib import Path

markdown_folder = Path('/Users/sathammai/Desktop/ResearchPaper_Markups')
output_folder = Path('/Users/sathammai/Desktop/NewChunks')
output_folder.mkdir(exist_ok=True)

for md_file in markdown_folder.glob("*.md"):
    text = md_file.read_text(encoding="utf-8")
    chunks = smart_hybrid_chunk(text)

    for i, chunk in enumerate(chunks):
        chunk_file = output_folder / f"{md_file.stem}_chunk_{i+1}.md"
        chunk_file.write_text(chunk, encoding="utf-8")


### Embedding Chunks using BioBERT

- All text chunks were converted into **vector embeddings** using Hugging Face's **BioBERT** model.  
  - **BioBERT** was chosen because it is pre-trained on biomedical and medical corpora, enabling it to **better capture subtle distinctions in medical terminology** compared to a standard BERT model.

- For each chunk, embeddings were obtained by **mean pooling over all token embeddings** rather than using the `[CLS]` token.  
  - Although mean pooling is slightly more computationally expensive, it provides a **more representative vector** of the entire chunk by aggregating information from all tokens.  
  - Positional information is already encoded within the token embeddings, so mean pooling effectively summarizes the semantic content of the chunk.

- Each chunk was **tokenized** and passed through BioBERT to compute its **embedding vector**, which forms the knowledge base for the RAG retriever.


In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import os

# 1. Load BioBERT
model_name = "dmis-lab/biobert-base-cased-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()  # ensure we're in evaluation mode

# 2. Mean pooling function
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state  # [batch_size, seq_len, hidden_dim]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
    sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
    return sum_embeddings / sum_mask

# 3. Folder with chunks
chunks_folder = "/Users/sathammai/Desktop/NewerChunks"
all_chunks = []

# Read all .md files
for filename in os.listdir(chunks_folder):
    if filename.endswith(".md"):
        with open(os.path.join(chunks_folder, filename), "r", encoding="utf-8") as f:
            chunk_text = f.read().strip()
            if chunk_text:  # only add non-empty chunks
                all_chunks.append(chunk_text)

print(f"Total non-empty chunks: {len(all_chunks)}")

# 4. Tokenize and embed all chunks
def embed_texts(texts, max_length=512):
    encoded_input = tokenizer(
        texts, padding=True, truncation=True, return_tensors="pt", max_length=max_length
    )
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    return embeddings

# 5. Compute embeddings
embeddings = embed_texts(all_chunks)

print("Number of chunks:", len(all_chunks))
print("Shape of embeddings:", embeddings.shape)  # [num_chunks, hidden_dim]


### Storing Embeddings in ChromaDB

- The chunk embeddings generated by BioBERT were stored in **ChromaDB**, a vector database that supports **efficient similarity search**.  
- While ChromaDB may not be the most advanced or production-grade solution, it was chosen here for **prototyping purposes**.  
- This setup allows the **RAG system** to quickly retrieve relevant chunks during inference.  
- The database can be **switched to a more robust solution** (e.g., Milvus, Pinecone, or Weaviate) in the future without changing the overall architecture.


In [None]:
#!pip install chromadb

import chromadb
from chromadb.utils import embedding_functions

# Initialize client
client = chromadb.Client()

# Create collection (name it relevantly)
collection = client.create_collection(
    name="womens_health", 
)


In [None]:

# Convert embeddings to list
embeddings_list = embeddings.tolist()

# Create IDs for each chunk
ids = [str(i) for i in range(len(all_chunks))]

# Insert into collection
collection.add(
    ids=ids,
    embeddings=embeddings_list,
    metadatas=[{"text": text} for text in all_chunks],
    documents=all_chunks
)


In [None]:
!ollama pull llama3.2
!ollama run llama3.2
!pip install tensorflow==2.19.0 tensorboard==2.19.0 tf-keras==2.19.0 --quiet
!pip install tf-keras --quiet
!pip install -U langchain-chroma --quiet

## RAG-based QA Pipeline

This code implements the **final Retrieval-Augmented Generation (RAG) pipeline** for answering women's health queries:

1. **Query input:**  
   - The user provides a question in natural language (e.g., `"What are the common symptoms of PCOS?"`).

2. **Query embedding & retrieval:**  
   - The query is **embedded using BioBERT** and compared against the **precomputed chunk embeddings** stored in ChromaDB.  
   - The **top-k most relevant chunks** are retrieved to serve as the context for the LLM.

3. **Prompting the LLM:**  
   - The retrieved context is inserted into a **prompt template** and passed to **LLaMA 3.2** via Ollama.  
   - The prompt instructs the model to **answer concisely, accurately, and based only on the provided context**.

4. **Answer generation:**  
   - The LLM generates a **grounded answer** based on the retrieved chunks.  
   - Both the **answer** and the **source documents** used for retrieval are returned for transparency and verification.

This pipeline ensures that answers are **factually informed**, context-aware, and traceable to the original source documents.


In [None]:
import os
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_chroma import Chroma
from langchain.embeddings import HuggingFaceEmbeddings


# -----------------------------
# Load LLaMA 3.2 via Ollama
# -----------------------------
llm = Ollama(model="llama3.2", temperature=0.2)

# -----------------------------
# Connect ChromaDB with BioBERT embeddings
# -----------------------------
embeddings = HuggingFaceEmbeddings(
    model_name="dmis-lab/biobert-base-cased-v1.1"
)

vectordb = Chroma(
    collection_name="womens_health",
    embedding_function=embeddings,
    persist_directory="./chroma_db"  
)

# -----------------------------
# Create a Prompt Template
# -----------------------------
prompt_template = """
You are a knowledgeable medical assistant specialized in women's reproductive health.
Answer the user's question based on the following context. Be concise, clear, and accurate.
If the answer is not in the context, say "Not enough information provided."

Context:
{context}

Question:
{question}

Answer:
"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# -----------------------------
# Build the RAG QA Chain
# -----------------------------
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
    chain_type="stuff",
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

# -----------------------------
# Test the chain with a query
# -----------------------------
query = "What are the common symptoms of PCOS?"
result = qa_chain(query)

print("Answer:", result['result'])
print("Source docs:", [doc.metadata for doc in result['source_documents']])


No sentence-transformers model found with name dmis-lab/biobert-base-cased-v1.1. Creating a new one with mean pooling.
  result = qa_chain(query)


Answer: PCOS (Polycystic Ovary Syndrome) is a hormonal disorder that affects women of reproductive age. The common symptoms of PCOS include:

* Irregular menstrual cycles or amenorrhea (absence of periods)
* Weight gain and obesity
* Acne
* Excess hair growth on the face, chest, back, and buttocks
* Male pattern baldness
* Cysts on the ovaries (detected by ultrasound)
* Infertility or difficulty getting pregnant
* High levels of androgens (male hormones) in the blood
* Insulin resistance and high blood sugar levels

These symptoms can vary in severity and may not be present in all women with PCOS.
Source docs: [{}, {}, {}, {}, {}]


In [5]:
query = "What are the long-term metabolic and reproductive implications of polycystic ovary syndrome (PCOS) in adolescents, and how do lifestyle interventions compare with pharmacological treatments in mitigating these effects?"
result = qa_chain({"query": query})
print(result["result"])


The long-term metabolic and reproductive implications of PCOS in adolescents can be significant. Lifestyle interventions, such as diet and physical activity changes, have been shown to improve insulin sensitivity, reduce body mass index (BMI), and promote menstrual regularity. However, lifestyle modification alone may not be sufficient to mitigate all the effects of PCOS.

Pharmacological treatments, particularly metformin, have also been found to be effective in improving metabolic outcomes, such as lowering BMI and subcutaneous adipose tissue, and promoting menstruation. The combination of lifestyle modification and metformin has been shown to have an additive effect on improving cardio-metabolic outcomes in high-risk groups.

In terms of reproductive implications, lifestyle interventions can help improve fertility by regulating menstrual cycles and ovulation. Pharmacological treatments, such as birth control pills, may also be used to regulate menstrual cycles and reduce symptoms of

In [6]:
query = "For patients undergoing laparoscopic excision of deep infiltrating endometriosis, what are the post-surgical recurrence rates, fertility outcomes, and potential complications, and how do these outcomes vary with different surgical techniques or adjunctive medical therapies?"
result = qa_chain({"query": query})
print(result["result"])

Based on the provided context, here is a concise answer to your question:

For patients undergoing laparoscopic excision of deep infiltrating endometriosis, the post-surgical recurrence rates are as follows:

* Persistence or recurrence rate: 22% at 2 years and 40%-50% at 5 years
* Advanced stage of disease: significantly fewer eggs collected, lower fertilization rate

Fertility outcomes for patients undergoing laparoscopic excision of deep infiltrating endometriosis include:

* Similar reproductive outcomes compared to those without the disease, although cycle cancellation rates are higher
* Natural conception rates of 2% to 4.5% per 30 days for mild cases and <2% for moderate and severe cases

Potential complications associated with laparoscopic excision of deep infiltrating endometriosis include:

* Centralized pain before undergoing a hysterectomy, which may lead to refractory pain
* Reduced ovarian reserve and function due to cystectomy (excision of endometriomas)

The outcomes va

In [7]:
query="How do aberrant inflammatory cytokine profiles and dysregulated estrogen receptor signaling contribute to the progression of endometriosis, and what are the current therapeutic strategies targeting these pathways?"
result = qa_chain({"query": query})
print(result["result"])

Aberrant inflammatory cytokine profiles and dysregulated estrogen receptor signaling play significant roles in the progression of endometriosis. Inflammation is a key component of endometriosis pathogenesis, with various cytokines contributing to the development and maintenance of ectopic lesions. Pro-inflammatory cytokines such as TNF-alpha, IL-1beta, and IL-6 promote inflammation, angiogenesis, and cell proliferation in endometriotic tissues.

Dysregulated estrogen receptor signaling is also crucial in endometriosis. Estrogen stimulates the growth and survival of endometrial cells, and its dysregulation can lead to increased cell proliferation, angiogenesis, and tissue adhesion properties. The imbalance of estrogen receptors, particularly the estrogen receptor beta (ERβ), has been linked to endometriosis.

Current therapeutic strategies targeting these pathways include:

1. Anti-inflammatory agents: Nonsteroidal anti-inflammatory drugs (NSAIDs) and corticosteroids can help reduce inf