# Research Analyst Agent for Hypoglycemia Analysis

## Overview
This notebook implements a **Research Analyst Agent** that processes medical literature about hypoglycemia (low blood sugar) in diabetic patients. The agent uses:
- **Document Processing**: Load and chunk PDF research papers
- **Vector Search**: Find relevant information using semantic similarity
- **LLM Summarization**: Generate evidence-based summaries

## Architecture
1. **Data Ingestion**: Load PDF documents from medical literature
2. **Embedding Model**: Convert text to vectors using HuggingFace embeddings
3. **Vector Store**: FAISS (Facebook AI Similarity Search) for efficient retrieval
4. **LLM**: FLAN-T5 for generating research summaries

## Use Case
Given a query about hypoglycemia, the agent:
1. Searches through medical papers
2. Retrieves the most relevant information
3. Generates an evidence-based summary with citations

---

## Step 1: Install Required Packages

**What**: Install the LangChain community package
**Why**: Provides document loaders, vector stores, and integrations with various LLMs
**Key**: LangChain is modular - we need `langchain-community` for PDF loading and FAISS integration

In [2]:
pip install langchain-community

Note: you may need to restart the kernel to use updated packages.


## Step 2: Import Core Libraries

**Imports explained**:
- `PyPDFLoader`: Extracts text from PDF files page by page
- `RecursiveCharacterTextSplitter`: Splits large documents into smaller chunks while preserving context
- `FAISS`: Vector database for similarity search (used by Facebook, very fast)
- `HuggingFaceEmbeddings`: Converts text to numerical vectors (embeddings)
- `OpenAIEmbeddings`, `ChatOpenAI`: Alternative options (not used in final version)
- `HumanMessage`, `SystemMessage`: For structuring conversations with LLMs

In [2]:
import os
from pathlib import Path

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage


In [11]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [23]:
%pip install sentence-transformers


Note: you may need to restart the kernel to use updated packages.


## Step 3: Initialize Embedding Model

**What**: Create an embedding model using HuggingFace
**Model**: `all-MiniLM-L6-v2`
- Small, fast, and efficient (22M parameters)
- Converts text to 384-dimensional vectors
- Runs locally (no API costs, data stays private)
- Works well for general-purpose semantic search

**Why this model?**:
- ✅ Free and open-source
- ✅ Good balance of speed vs. accuracy
- ✅ Suitable for medical text (though domain-specific models like BiomedBERT could be better)
- ✅ Downloads automatically on first run (~90MB)

In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Step 4: Set Data Directory

**What**: Point to the folder containing hypoglycemia research PDFs
**Path**: Absolute path ensures it works regardless of where notebook is run
**Data**: Contains multiple PDF research papers about hypoglycemia from various sources (WHO, medical journals, etc.)

In [24]:
DATA_DIR = Path("/home/santanu/code/SciPrimeX/Glucoza-Agent/research_analyst/data/hypoglycemia")


## Step 5: Load All PDF Documents

**Process**:
1. Iterate through all PDF files in the directory
2. For each PDF, use `PyPDFLoader` to extract text page by page
3. Add source filename to metadata for citation tracking
4. Combine all documents into one list

**Output**: List of Document objects, each containing:
- `page_content`: The actual text from a PDF page
- `metadata`: Source filename and page number

**Why metadata?**: Essential for citing sources in the final summary

In [25]:
documents = []

for pdf_path in DATA_DIR.glob("*.pdf"):
    loader = PyPDFLoader(str(pdf_path))
    docs = loader.load()
    for d in docs:
        d.metadata["source"] = pdf_path.name
    documents.extend(docs)

len(documents)


154

## Step 6: Split Documents into Chunks

**Why chunk?**:
- PDFs are too long for embedding models (which have token limits)
- Smaller chunks = more precise retrieval
- Better to retrieve 3 relevant paragraphs than 1 entire paper

**Parameters**:
- `chunk_size=800`: Each chunk is ~800 characters (roughly 150-200 words)
- `chunk_overlap=150`: 150 characters overlap between chunks to preserve context
  - Prevents sentences from being cut mid-thought
  - Ensures important info near chunk boundaries isn't lost

**RecursiveCharacterTextSplitter**: Intelligently splits on:
1. Paragraphs first (preserves structure)
2. Then sentences
3. Then words
4. Finally characters (last resort)

**Output**: Larger number of smaller, more manageable chunks

In [26]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150
)

chunks = text_splitter.split_documents(documents)
len(chunks)


963

## Step 7: Create Vector Database (FAISS)

**What happens here**:
1. Each chunk is converted to a 384-dimensional vector using the embedding model
2. All vectors are stored in a FAISS index
3. FAISS builds an efficient search structure (enables fast similarity search)

**FAISS (Facebook AI Similarity Search)**:
- Optimized for finding similar vectors quickly
- Can search millions of vectors in milliseconds
- Used in production by Facebook, Google, etc.

**Why vector search?**:
- Traditional keyword search: "hypoglycemia" only matches exact word
- Vector search: Understands "low blood sugar", "glucose levels drop", etc.
- Captures semantic meaning, not just keywords

**Time**: This step may take a minute (embedding all chunks)

In [27]:
vectorstore = FAISS.from_documents(
    chunks,
    embedding=embeddings
)

## Step 8: Query the Vector Database

**What**: Perform semantic search to find relevant information

**Process**:
1. Query is converted to a vector (same 384-dimensional space)
2. FAISS finds the `k=4` most similar chunks
3. Similarity is measured using cosine similarity (or L2 distance)

**Why k=4?**:
- Balance between breadth and depth
- Too few (k=1): Might miss important info
- Too many (k=10): Too much noise, LLM gets confused
- 3-5 is the sweet spot for most RAG systems

**Output**: 4 most relevant document chunks with their source information

In [33]:
query = "Symptoms and recommended response to hypoglycemia in diabetic patients"

retrieved_docs = vectorstore.similarity_search(query, k=4)

for i, doc in enumerate(retrieved_docs, 1):
    print(f"\n--- Result {i} ({doc.metadata['source']}) ---")
    print(doc.page_content[:200])



--- Result 1 (Hypoglycemia in Adults.pdf) ---
hypoglycemia in a given year, with a higher incidence in people
with type 1 diabetes [11]. Hypoglycemia is rare in individuals with
type 2 diabetes who are not using insulin or insulin secretagogues,


--- Result 2 (ADA_guidelines.pdf) ---
abetes, and caregiver should increase
vigilance for hypoglycemia.B
Hypoglycemia Deﬁnitions and Event
Rates
Hypoglycemia is often the major limiting
factor in the glycemic management of
type 1 and type

--- Result 3 (review_hypoglycemia_symptoms.pdf) ---
the last 20 years, albeit some recent studies have reported decreasing trends, especially 
among patients with type 2 diabetes[10-12].
In patients  with diabetes,  it is not easy to determine  a speci

--- Result 4 (ADA_guidelines.pdf) ---
Shah ND, Wermers RA, Smith SA. Increased
mortality of patients with diabetes reporting severe
hypoglycemia. Diabetes Care 2012;35:1897–1901
80. Bloomﬁeld HE, Greer N, Newman D, et al.
Predictors and C


## Step 9: Load Language Model (FLAN-T5)

**What**: Load Google's FLAN-T5-base model for text generation

**FLAN-T5 characteristics**:
- **Size**: Base model (~250M parameters)
- **Training**: Instruction-tuned (follows prompts well)
- **Strengths**: Good at summarization, Q&A, reasoning tasks
- **Limits**: 512 token input limit (why we needed chunking)

**Components**:
- `AutoTokenizer`: Converts text to numbers (tokens)
- `AutoModelForSeq2SeqLM`: The actual model (sequence-to-sequence)
- Downloads automatically (~1GB, may take a few minutes)

**Why FLAN-T5?**:
- ✅ Free and runs locally
- ✅ Good instruction-following
- ✅ Suitable for scientific summarization
- Alternative: Could use MedGemma for medical-specific tasks

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

MODEL_NAME = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



## Step 10: Create Text Generation Pipeline

**What**: Wrap the model in a pipeline for easier use

**Parameters**:
- `task="text-generation"`: Generates text based on input
- `max_length=512`: Maximum output length (in tokens)
- `truncation=True`: Automatically truncates if input > 512 tokens

**Pipeline benefits**:
- Handles tokenization automatically
- Simplifies the generation process
- Manages model input/output formatting

In [50]:
summarizer = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    truncation=True
)

The model 'T5ForConditionalGeneration' is not supported for text-generation. Supported models are ['PeftModelForCausalLM', 'AfmoeForCausalLM', 'ApertusForCausalLM', 'ArceeForCausalLM', 'AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BitNetForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'BltForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'CwmForCausalLM', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'DogeForCausalLM', 'Dots1ForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'Ernie4_5ForCausalLM', 'Ernie4_5_MoeForCausalLM', 'Exaone4ForCausalLM', 'FalconForCausalLM', 'FalconH1ForCausalLM', 'FalconMambaForCausa

## Step 11: Prepare Context for LLM

**What**: Combine all retrieved documents into a single context string

**Format**:
```
Source: paper1.pdf
<content from paper 1>

Source: paper2.pdf
<content from paper 2>
...
```

**Why this format?**:
- Clear source attribution
- LLM can cite specific papers
- Structured input leads to better structured output

In [None]:
context = "\n\n".join(
    [f"Source: {d.metadata['source']}\n{d.page_content}"
     for d in retrieved_docs]
)


## Step 12: Create Structured Prompt

**Prompt Engineering Best Practices**:

1. **Role definition**: "You are a medical research analyst"
   - Sets the context and expertise level
   
2. **Clear task**: "Summarize the SCIENTIFIC EVIDENCE ONLY"
   - Specific, actionable instruction
   
3. **Strict rules**:
   - "Do NOT give medical advice" - Critical for safety/legal reasons
   - "Do NOT invent facts" - Prevents hallucinations
   - "Cite sources explicitly" - Ensures traceability
   
4. **Output format**: Bullet points with clear sections
   - Makes output consistent and parseable
   
5. **Context injection**: Actual retrieved documents
   - RAG (Retrieval-Augmented Generation) - grounds the LLM in facts

**Why this matters**: Without proper constraints, LLMs can:
- Hallucinate facts
- Give dangerous medical advice
- Produce inconsistent formats

In [47]:
prompt = f"""
You are a medical research analyst.

Task:
Summarize the SCIENTIFIC EVIDENCE ONLY from the context below.

Rules:
- Do NOT give medical advice
- Do NOT invent facts
- Use bullet points
- Cite sources explicitly
- Keep it concise

Context:
{context}

Output format:
- Summary:
- Key findings:
- Sources:
"""


## Step 13: Generate Summary

**What**: Send prompt to FLAN-T5 and get generated summary

**Process**:
1. Prompt is tokenized (converted to numbers)
2. Model generates tokens autoregressively (one at a time)
3. Output is decoded back to text
4. Pipeline returns a list of dictionaries with `generated_text`

**Expected output**: Evidence-based summary with:
- Key findings about hypoglycemia
- Symptoms and responses
- Citations to source papers

In [48]:
result = summarizer(prompt)
print(result[0]["generated_text"])



You are a medical research analyst.

Task:
Summarize the SCIENTIFIC EVIDENCE ONLY from the context below.

Rules:
- Do NOT give medical advice
- Do NOT invent facts
- Use bullet points
- Cite sources explicitly
- Keep it concise

Context:
Source: Hypoglycemia in Adults.pdf
disorders are at high risk of future hypoglycemic episodes,
Table 1
Symptoms of hypoglycemia
Adrenergic
(autonomic)
Neuroglycopenic
/C15 Trembling
/C15 Palpitations
/C15 Sweating
/C15 Anxiety
/C15 Hunger
/C15 Nausea
/C15 Tingling
/C15 Difﬁculty concentrating
/C15 Confusion, weakness, drowsiness, vision changes
/C15 Slurred speech, headache, dizziness
I.C. Lega et al. / Can J Diabetes 47 (2023) 548e559 549

Source: ADA_guidelines.pdf
abetes, and caregiver should increase
vigilance for hypoglycemia.B
Hypoglycemia Deﬁnitions and Event
Rates
Hypoglycemia is often the major limiting
factor in the glycemic management of
type 1 and type 2 diabetes. Recommen-
dations regarding the classiﬁcation of hy-
poglycemia are outline

## Step 14: Package Final Output

**What**: Structure the agent's response as a JSON-like dictionary

**Components**:
1. **scientific_summary**: The LLM-generated text
2. **sources**: List of unique PDF filenames used
3. **reasoning_trace**: Step-by-step explanation of what the agent did

**Why reasoning_trace?**:
- **Transparency**: Users know how the answer was generated
- **Debugging**: Helps identify where things went wrong
- **Trust**: Users can verify the process
- **Auditability**: Important in medical/regulated domains

**Output format**: Easily serializable to JSON for APIs or logging



In [45]:
research_output = {
    "scientific_summary": result[0]["generated_text"],
    "sources": list(
        set(d.metadata["source"] for d in retrieved_docs)
    ),
    "reasoning_trace": [
        "Received hypoglycemia-related query from Coordinator",
        "Performed vector similarity search on medical corpus",
        "Selected top-k relevant documents",
        "Generated evidence-based summary using LLM_3"
    ]
}

research_output


{'scientific_summary': '\nYou are a medical research analyst.\n\nTask:\nSummarize the SCIENTIFIC EVIDENCE ONLY from the context below.\n\nRules:\n- Do NOT give medical advice\n- Do NOT invent facts\n- Use bullet points\n- Cite sources explicitly\n- Keep it concise\n\nContext:\nSource: Hypoglycemia in Adults.pdf\ndisorders are at high risk of future hypoglycemic episodes,\nTable 1\nSymptoms of hypoglycemia\nAdrenergic\n(autonomic)\nNeuroglycopenic\n/C15 Trembling\n/C15 Palpitations\n/C15 Sweating\n/C15 Anxiety\n/C15 Hunger\n/C15 Nausea\n/C15 Tingling\n/C15 Difﬁculty concentrating\n/C15 Confusion, weakness, drowsiness, vision changes\n/C15 Slurred speech, headache, dizziness\nI.C. Lega et al. / Can J Diabetes 47 (2023) 548e559 549\n\nSource: ADA_guidelines.pdf\nabetes, and caregiver should increase\nvigilance for hypoglycemia.B\nHypoglycemia Deﬁnitions and Event\nRates\nHypoglycemia is often the major limiting\nfactor in the glycemic management of\ntype 1 and type 2 diabetes. Recommen-\n

---

## Summary of the Complete Pipeline

```
User Query
    ↓
Vector Search (FAISS) → Retrieves relevant chunks
    ↓
Context Preparation → Combines chunks with metadata
    ↓
Prompt Engineering → Structured instruction to LLM
    ↓
LLM Generation (FLAN-T5) → Evidence-based summary
    ↓
Output Packaging → JSON response with sources & trace
```

**Key Advantages**:
- ✅ **Evidence-based**: Grounded in actual research papers
- ✅ **Transparent**: Full citation and reasoning trace
- ✅ **Safe**: No medical advice, only evidence summary
- ✅ **Scalable**: Can handle thousands of papers
- ✅ **Cost-effective**: Runs locally, no API costs