**Group Members**

* Rameesha  |  24F-8014  |  MSDS
* Ajwa Rafiq|  24F-7810  |  MSCS

# Task 1: Medical RAG QA System

**Problem:**
Develop	a	Retrieval-Augmented	Generation	(RAG)	pipeline	using	LangChain that	answers	medical	questions	using	a	publicly	available	clinical	or biomedical	dataset.	The	system	should	retrieve	context-sensitive chunks	and	generate	accurate,	citation-aware	responses.

**Dataset:**
[View here](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions	)

**Objective:**
To	create	a	safe	medical	assistant	capable	of	retrieving evidence-based information	and	producing	grounded answers	using	LangChain	retrievers	and the	Gemini	API.

**Deliverables:**
* Preprocessed	dataset	with	chunks	and	metadata.
* FAISS/Chroma	vector	store	creation	script.
* RAG	pipeline	(Retriever	+	Gemini	LLM	chain).
* Streamlit/Gradio	medical	QA	demo.
* Evaluation	on	at	least	30	medical	queries.

**required packages**

In [None]:
!pip install langchain==0.1.16 langchain-core==0.1.40 langchain-community==0.0.42 langchain-google-genai==0.0.10


Collecting langchain==0.1.16
  Downloading langchain-0.1.16-py3-none-any.whl.metadata (13 kB)
Collecting langchain-core==0.1.40
  Downloading langchain_core-0.1.40-py3-none-any.whl.metadata (5.9 kB)
[31mERROR: Ignored the following yanked versions: 0.0.9, 0.2.14, 1.0.0a1[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement langchain-community==0.0.42 (from versions: 0.0.1rc1, 0.0.1rc2, 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.8, 0.0.10, 0.0.11, 0.0.12, 0.0.13, 0.0.14, 0.0.15, 0.0.16, 0.0.17, 0.0.18, 0.0.19, 0.0.20, 0.0.21, 0.0.22, 0.0.23, 0.0.24, 0.0.25, 0.0.26, 0.0.27, 0.0.28, 0.0.29, 0.0.30, 0.0.31, 0.0.32, 0.0.33, 0.0.34, 0.0.35, 0.0.36, 0.0.37, 0.0.38, 0.2.0rc1, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.9, 0.2.10, 0.2.11, 0.2.12, 0.2.13, 0.2.15, 0.2.16, 0.2.17, 0.2.18, 0.2.19, 0.3.0.dev1, 0.3.0.dev2, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10, 0.3.11, 0.3.12, 0.3.13, 0.3.14, 0.3.15, 0.3.16,

In [2]:
!pip install -q langchain-google-genai sentence-transformers faiss-cpu pandas

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m94.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-generativeai 0.8.5 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativelanguage 0.9.0 which is incompatible.[0m[31m
[0m

**libraries**

In [14]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from langchain_google_genai import ChatGoogleGenerativeAI
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
from getpass import getpass

In [4]:
print("Mounting Google Drive...")
drive.mount('/content/drive')

Mounting Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Our dataset contains medical transcripts with different columns. Let me explain what each
column typically contains:

- **description**: A brief description of the medical case/procedure
- **medical_specialty**: The medical field (like Cardiology, Surgery, etc.)
- **sample_name**: Name/title of the sample
- **transcription**: The actual medical transcription (this is what we'll mainly use!)
- **keywords**: Important medical terms related to the case

We'll mainly use the 'transcription' column for our RAG system since it contains the detailed medical information.

In [5]:
dataset_path = '/content/drive/MyDrive/ANLP/project_3/task_1/mtsamples.csv'
df = pd.read_csv(dataset_path)

print(f"\n data loaded")
print(f"Total record = {len(df)}")
print(f"\n data columns = {df.columns.tolist()}")

df.head()


 data loaded
Total record = 4999

 data columns = ['Unnamed: 0', 'description', 'medical_specialty', 'sample_name', 'transcription', 'keywords']


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         4999 non-null   int64 
 1   description        4999 non-null   object
 2   medical_specialty  4999 non-null   object
 3   sample_name        4999 non-null   object
 4   transcription      4966 non-null   object
 5   keywords           3931 non-null   object
dtypes: int64(1), object(5)
memory usage: 234.5+ KB


**data cleaning nd preprocessing**

In [7]:
print("Missing values in each column = ")
print(df.isnull().sum())

# Del rows where transcription is missing
df_clean = df.dropna(subset=['transcription'])
print(f"remov missing transriptiion rows done and remaining rows are {len(df_clean)}")

# Fill missing values in other columns with empty strings so we use them as metadata
df_clean['description'] = df_clean['description'].fillna('')
df_clean['medical_specialty'] = df_clean['medical_specialty'].fillna('General')
df_clean['keywords'] = df_clean['keywords'].fillna('')

# Create a combined text field that includes description + transcription
df_clean['full_text'] = df_clean['description'] + '\n\n' + df_clean['transcription']

print(f" data cleaning done and now Final data size is  {len(df_clean)} ")

print("\n Sample data (first 270 characters):")
print(df_clean['full_text'].iloc[0][:270])


Missing values in each column = 
Unnamed: 0              0
description             0
medical_specialty       0
sample_name             0
transcription          33
keywords             1068
dtype: int64
remov missing transriptiion rows done and remaining rows are 4966
 data cleaning done and now Final data size is  4966 

 Sample data (first 270 characters):
 A 23-year-old white female presents with complaint of allergies.

SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Cla


**Text chunking**

In [10]:
def chunk_text(text, chunk_size=500, overlap=100):

  #Split text into overlapping chunks. Returns list of text chunks

    chunks = []
    start = 0
    text_len = len(text)

    while start < text_len:
        # end point
        end = start + chunk_size

        # If this is not last chunk break at sentence or space
        if end < text_len:
            last_period = text[start:end].rfind('.')
            last_question = text[start:end].rfind('?')
            last_exclaim = text[start:end].rfind('!')

            # sentence ending
            break_point = max(last_period, last_question, last_exclaim)

            if break_point > chunk_size * 0.5:
                end = start + break_point + 1

        # Add chunk
        chunk = text[start:end].strip()
        if chunk:  # Only add non empty chunks
            chunks.append(chunk)

        start = end - overlap if end < text_len else text_len

    return chunks

# lists to store our chunks and metadata
all_texts = []
all_metadatas = []

print("docs processing")
for idx, row in df_clean.iterrows():
    # Split into chunks
    chunks = chunk_text(row['full_text'], chunk_size=500, overlap=100)

    # each chunk store it along with its metadata
    for chunk in chunks:
        all_texts.append(chunk)
        all_metadatas.append({
            'medical_specialty': row['medical_specialty'],
            'description': row['description'][:100],  # First 100 chars only
            'source': f"Document_{idx}"
        })

    if (idx + 1) % 100 == 0:
        print(f"processed {idx + 1}/{len(df_clean)} docs")

print(f"Total chunks created: {len(all_texts)}")
print(f"\nsample chunk:")
print(all_texts[0][:300])


docs processing
processed 100/4966 docs
processed 200/4966 docs
processed 300/4966 docs
processed 400/4966 docs
processed 500/4966 docs
processed 600/4966 docs
processed 700/4966 docs
processed 800/4966 docs
processed 900/4966 docs
processed 1000/4966 docs
processed 1100/4966 docs
processed 1200/4966 docs
processed 1300/4966 docs
processed 1400/4966 docs
processed 1500/4966 docs
processed 1600/4966 docs
processed 1700/4966 docs
processed 1800/4966 docs
processed 1900/4966 docs
processed 2000/4966 docs
processed 2100/4966 docs
processed 2200/4966 docs
processed 2300/4966 docs
processed 2400/4966 docs
processed 2500/4966 docs
processed 2600/4966 docs
processed 2700/4966 docs
processed 2800/4966 docs
processed 2900/4966 docs
processed 3000/4966 docs
processed 3100/4966 docs
processed 3200/4966 docs
processed 3300/4966 docs
processed 3400/4966 docs
processed 3500/4966 docs
processed 3600/4966 docs
processed 3700/4966 docs
processed 3800/4966 docs
processed 3900/4966 docs
processed 4000/496

**Create Embeddings using HuggingFace**

In [11]:
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')# Initialize HuggingFace embeddings model
print("Embedding model loaded")

test_text = "diabetes is a chronic disease"
test_embedding = embedding_model.encode(test_text)

print(f"\nTest embedding created")
print(f"Embedding dimension: {len(test_embedding)}")
print(f"First 5 values: {test_embedding[:5]}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded

Test embedding created
Embedding dimension: 384
First 5 values: [-0.01058776  0.04252453 -0.06571977  0.10387181 -0.03524163]


**Create FAISS Vector Store**

In [15]:
# Create embeddings for all chunks
print("encoding all text chunks")
all_embeddings = embedding_model.encode(all_texts, show_progress_bar=True)
all_embeddings = np.array(all_embeddings).astype('float32')
print(f"\nCreated {len(all_embeddings)} embeddings")
print(f"embeding dimension: {all_embeddings.shape[1]}")

# Create FAISS index
dimension = all_embeddings.shape[1]  # Dimension of embeddings (384 for MiniLM)
index = faiss.IndexFlatL2(dimension)  # L2 distance (Euclidean)

# Add embeddings to index
index.add(all_embeddings)

print(f"\nFAISS vector store created")
print(f"Total vectors in store= {index.ntotal}")

#vector store test
test_query = "What are symptoms of diabetes?"
test_query_embedding = embedding_model.encode([test_query]).astype('float32')

# Search for top 3 similar chunks
k = 3
distances, indices = index.search(test_query_embedding, k)

print(f"\n Test search for: '{test_query}'")
print(f"\nTop result:")
print(all_texts[indices[0][0]][:200])

encoding all text chunks


Batches:   0%|          | 0/1439 [00:00<?, ?it/s]


Created 46019 embeddings
embeding dimension: 384

FAISS vector store created
Total vectors in store= 46019

 Test search for: 'What are symptoms of diabetes?'

Top result:
Type 1 diabetes mellitus, insulin pump requiring.  Chronic kidney disease, stage III.  Sweet syndrome, hypertension, and dyslipidemia.

PROBLEMS LIST:,1.  Type 1 diabetes mellitus, insulin pump requir


**Create Retriever**

In [16]:
# retriever function
def retrieve_documents(query, k=4):
  #  Retrieve top k most relevant docs for a given query.Returns List of tuples (text, metadata, distance)

    query_embedding = embedding_model.encode([query]).astype('float32')    # Encode query
    distances, indices = index.search(query_embedding, k)   # Search in FAISS index

    # Get the retrieved texts and metadata
    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            'text': all_texts[idx],
            'metadata': all_metadatas[idx],
            'distance': distances[0][i]
        })

    return results
print("Retriever created")

Retriever created


**Retriever Test**

In [17]:
# test the retriever
test_docs = retrieve_documents("How is hypertension treated?", k=4)
print(f"\nTest retrieval for 'How is hypertension treated?'")
print(f"Retrieved {len(test_docs)} docs")
print(f"\nFirst retrieved chunk:")
print(test_docs[0]['text'][:200])
print(f"Medical Specialty: {test_docs[0]['metadata']['medical_specialty']}")


Test retrieval for 'How is hypertension treated?'
Retrieved 4 docs

First retrieved chunk:
void the beta-blocker for vasospasm protection and will favor using calcium channel blocker for now. If, however, we run into trouble with this, I would prefer to switch her to Brevibloc or an Esmolol
Medical Specialty:  General Medicine


**Gemini KEY**

In [46]:
from getpass import getpass
import os

# Get Gemini API key
GOOGLE_API_KEY = getpass("Enter your Gemini API key: ")
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0.2,
    google_api_key=GOOGLE_API_KEY
)

print("Using gemini-2.5-flash model")
print("Testing API connection...")

# Simple test without retries
try:
    test_response = llm.invoke("Hello")
    print(f"API is working. Response: {test_response.content}")
except Exception as e:
    print(f"Error: {str(e)[:200]}")

Enter your Gemini API key: ··········
Using gemini-2.5-flash model
Testing API connection...
API is working. Response: Hello! How can I help you today?


**RAG pipeline**

In [47]:
def medical_qa_pipeline(query, k=4):

    # Retrieve relevant docs
    print(f"Searching for: {query}")
    retrieved_docs = retrieve_documents(query, k=k)

    #  Format context with source num
    context_parts = []
    for i, doc in enumerate(retrieved_docs):
        specialty = doc['metadata']['medical_specialty']
        text = doc['text']
        context_parts.append(f"[Source {i+1}] ({specialty}):\n{text}")

    context = "\n\n".join(context_parts)

    #medical-specific prompt
    prompt = f"""You are a medical AI assistant. Answer the medical question based ONLY on the provided context from clinical transcriptions.

IMPORTANT RULES:
- Use ONLY information from the context
- Cite sources using [Source N] format
- If context doesn't contain the answer, say "The provided transcriptions don't contain sufficient information"
- Be precise and medical-accurate

Context:
{context}

Question: {query}

Answer (with citations):"""

    # Get response from Gemini
    print("Generating ans")
    response = llm.invoke(prompt)

    return {
        'query': query,
        'answer': response.content,
        'sources': retrieved_docs,
        'num_sources': len(retrieved_docs)
    }

print("RAG Pipeline created")

RAG Pipeline created


**test**

In [48]:
# Simple evaluation on 30 medical queries

import time

# 30 test questions
questions = [
    "What are symptoms of diabetes?",
    "How is heart attack diagnosed?",
    "What treatments exist for asthma?",
    "What causes high blood pressure?",
    "How is pneumonia treated?",
    "What are signs of stroke?",
    "How is arthritis managed?",
    "What indicates kidney disease?",
    "How is cancer detected?",
    "What are treatments for depression?",
    "How is a CT scan performed?",
    "What medications treat diabetes?",
    "What are complications of surgery?",
    "How is blood pressure monitored?",
    "What are side effects of chemotherapy?",
    "How is thyroid tested?",
    "What causes chest pain?",
    "How is anemia diagnosed?",
    "What treatments help with pain?",
    "How is infection prevented?",
    "What are risks of obesity?",
    "How is cholesterol managed?",
    "What causes fatigue?",
    "How is liver function tested?",
    "What are symptoms of allergies?",
    "How is migraine treated?",
    "What causes dizziness?",
    "How is glucose monitored?",
    "What are signs of dehydration?",
    "How is wound care done?"
]

results = []

print(f"Testing {len(questions)} queries...\n")

for i, question in enumerate(questions, 1):
    print(f"{i}. {question}")

    try:
        result = medical_qa_pipeline(question, k=3)
        results.append({
            'question': question,
            'answer': result['answer'],
            'sources': result['num_sources']
        })
        print(f"   Done ({result['num_sources']} sources)\n")
        time.sleep(3)  # wait between requests

    except Exception as e:
        print(f"   ✗ Error: {e}\n")
        results.append({
            'question': question,
            'answer': 'Error',
            'sources': 0
        })

print(f"\nCompleted {len(results)} queries")
print(f"Successful: {sum(1 for r in results if r['answer'] != 'Error')}")

Testing 30 queries...

1. What are symptoms of diabetes?
Searching for: What are symptoms of diabetes?
Generating ans
   Done (3 sources)

2. How is heart attack diagnosed?
Searching for: How is heart attack diagnosed?
Generating ans
   Done (3 sources)

3. What treatments exist for asthma?
Searching for: What treatments exist for asthma?
Generating ans
   Done (3 sources)

4. What causes high blood pressure?
Searching for: What causes high blood pressure?
Generating ans
   Done (3 sources)

5. How is pneumonia treated?
Searching for: How is pneumonia treated?
Generating ans
   Done (3 sources)

6. What are signs of stroke?
Searching for: What are signs of stroke?
Generating ans
   Done (3 sources)

7. How is arthritis managed?
Searching for: How is arthritis managed?
Generating ans
   Done (3 sources)

8. What indicates kidney disease?
Searching for: What indicates kidney disease?
Generating ans
   Done (3 sources)

9. How is cancer detected?
Searching for: How is cancer detected?
Gen

* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 10, model: gemini-2.5-flash
Please retry in 45.506206691s. [links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, retry_delay {
  seconds: 45
}
].


23. What causes fatigue?
Searching for: What causes fatigue?
Generating ans
   Done (3 sources)

24. How is liver function tested?
Searching for: How is liver function tested?
Generating ans
   Done (3 sources)

25. What are symptoms of allergies?
Searching for: What are symptoms of allergies?
Generating ans
   Done (3 sources)

26. How is migraine treated?
Searching for: How is migraine treated?
Generating ans
   Done (3 sources)

27. What causes dizziness?
Searching for: What causes dizziness?
Generating ans
   Done (3 sources)

28. How is glucose monitored?
Searching for: How is glucose monitored?
Generating ans
   Done (3 sources)

29. What are signs of dehydration?
Searching for: What are signs of dehydration?
Generating ans
   Done (3 sources)

30. How is wound care done?
Searching for: How is wound care done?
Generating ans
   Done (3 sources)


Completed 30 queries
Successful: 30


In [50]:
# View all answers
print("answers:")

for i, r in enumerate(results, 1):
    print(f"\n{i}. {r['question']}")
    print(f"Answer: {r['answer']}")
    print(f"Sources used: {r['sources']}  ")

answers:

1. What are symptoms of diabetes?
Answer: The provided transcriptions don't contain sufficient information.
Sources used: 3  

2. How is heart attack diagnosed?
Answer: The provided transcriptions don't contain sufficient information.
Sources used: 3  

3. What treatments exist for asthma?
Answer: The provided transcriptions don't contain sufficient information.
Sources used: 3  

4. What causes high blood pressure?
Answer: The provided transcriptions don't contain sufficient information.
Sources used: 3  

5. How is pneumonia treated?
Answer: Pneumonia can be treated on an outpatient basis [Source 1, Source 2]. Amoxicillin was given for yellow nasal discharge by the primary care provider [Source 1, Source 2]. In cases with rapid sepsis and respiratory failure, treatment involves aggressive measures such as mechanical ventilation, intubation, and other supportive measures [Source 3].
Sources used: 3  

6. What are signs of stroke?
Answer: The provided transcriptions don't con