# Medical Chatbot using RAG

## Setup

In [1]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.5


In [22]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m68.8 MB/s[0m eta [36m0

## Data Preparation

### Function to extract text form pdf

In [4]:
import fitz

In [16]:
# Extracts and returns all text from a PDF file using PyMuPDF (fitz)
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

In [18]:
document_text = extract_text_from_pdf("/content/medicine.pdf")

### Cleaning the text

In [7]:
import re

# Cleans extracted text by normalizing whitespace, fixing line breaks, and removing non-ASCII characters
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # collapse whitespace
    text = re.sub(r'\n+', '\n', text)  # fix line breaks
    text = re.sub(r'[^ -~\n]', '', text)  # remove non-ascii
    return text.strip()

In [19]:
clean_document_text = clean_text(document_text)

### Creating the chunk of the data

In [8]:
# Splits the input text into chunks of specified size with overlapping words between chunks
def chunk_text(text, chunk_size=500, overlap=100):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

In [20]:
chunk_clean_document_text = chunk_text(clean_document_text)

In [48]:
print(chunk_clean_document_text[2])

wholesome diagnosis and how therapy is geared to this knowledge. Diagnosis for the Doctor Prognosis for the Relatives Relief for the Patient 3 The complete diagnosis including all four parts mentioned earlier can be deciphered by asking a set series of questions after having elicited the history and physical signs. (Note: Pathophysiology is based on our understanding of how a disease aects a particular organ to produce dysfunction. It depends on our PREVIOUS knowledge and studies of Physiology and Pathology of similar patients. Some tests can help us in proving the pathophysiology operating in a particular case.) A. HISTORY A carefully elicited history should be able to answer the following FIVE questionsA1, A2, A3, A4 and A5 A.1 Which ORGAN SYSTEM is involved? This is based on the conglomeration of symptoms--cardinal symptoms of a particular system. Is there any particular SITE involved? This is suggested by some pathognomic symptoms and other details. e.g. -- Lateral chest pain assoc

## Using Gemini Model through its API

### Setting the Model for embedding generation

In [23]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

'1.7.0'

In [24]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')

In [26]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
  if "embedContent" in m.supported_actions:
    print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


In [29]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types

In [31]:
# helper function to retry in case of failure
is_retriable = lambda e : (isinstance(e,genai.errors.APIError) and e.code in {429,503})

### Embedding Class for creating and querying embeddings for the text data

In [32]:
# Custom EmbeddingFunction class to generate embeddings for documents or queries using the Gemini model
class GeminiEmbeddingFunction(EmbeddingFunction):
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input:Documents)->Embeddings:
       # Set the embedding task type based on the mode (document or query)
        if self.document_mode:
            embedding_task="retrieval_document"
        else:
            embedding_task="retrieval_query"

        # Call the Gemini model's embed_content method to get embeddings
        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )

        return [e.values for e in response.embeddings]

## Initializing, creating and Storing the embeddings using ChromaDB

In [34]:
import chromadb

DB_NAME = "medicaldb"

embed_fn = GeminiEmbeddingFunction()  # Initialize the custom embedding function (Gemini)
embed_fn.document_mode = True  # Set the embedding function to work in document mode

# Create a client to interact with the Chroma database
chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME,embedding_function=embed_fn)

# Add the cleaned and chunked document text into the collection, assigning unique IDs
db.add(documents=chunk_clean_document_text, ids=[str(i) for i in range(len(chunk_clean_document_text))])

In [35]:
db.count()

25

## RAG

### Querying the database - Retriveal

In [36]:
# Set the embedding function to work in query mode (not document mode)
embed_fn.document_mode = False

# Define the query for the medical chatbot or search system
query = "What are the respiratory disease symptoms?"

# Perform a query in the Chroma database and retrieve the most relevant document
result = db.query(query_texts=[query], n_results=1) # n_results can be any number depends on how many result you want for the given query

# Extract the documents from the query result
[all_passages] = result["documents"]

Markdown(all_passages[0])

cardiac or psychological. Respiratory mechanisms can be:- ~ Inspiratory obstruction ~ Bronchospasm ~ Consolidation ~ Emphysema ~ Pleural eusion ~ Pneumothorax The site of disease in respiratory system can often be told by associated symptoms/signs. Dyspnoea with inspiratory stridor occurs in Foreign body Wheeze is audible in Bronchitis and Asthma Nocturnal increase in dyspnoea is Cardiac (due to alveolar congestion) Shallow breathing is seen in Neuromuscular paralysis 3. LATERAL CHEST PAIN This is the hallmark of pleural disease. It has to be dierentiated from musculoskeletal pain by the absence of other respiratory symptoms in the latter. Diaphragmatic pleurisy may be referred to the tip of shoulder and maybe associated with an increase during deep breathing and coughing. Tracheitis may also be painful but the pain is in the front of neck and retrosternal. 4. HEMOPTYSIS This symptom gives a lot of information about the site of involvement and sometimes helps in the etiological diagnosis as the causes of hemoptysis at each subsite of the respiratory system are few and many diseases have their distinctive characteristics. ~ Upper Respiratory Tract often gives a Streaky hemoptysis ~ Alveolar origin of hemoptysis is often Frothy and is a hallmark of pulmonary edema 18 ~ Frank blood can be seen in tuberculosis, mitral stenosis and bronchial adenoma ~ Mucopurulent hemoptysis is seen in bronchiectasis and lung abscess ~ Rusty hemoptysis is seen in early pneumonia ~ Sudden onset suggests pulmonary embolism and infarction ~ Recurrent hemoptysis occurs in hemosiderosis, Goodpasture's syndrome and bronchial adenoma ~ Continuous bleeding can be seen in malignancy A.2 Are there any symptoms suggestive of pathophysiological eects of the disease? Tremulousness, drowsiness and coma in Respiratory Failure Pitting edema, right upper abdominal discomfort in CHF Palpitations in arrhythmias A.3 Cause of Respiratory Diseases The common causes are: - Acute infections - Chronic infections - Malignancy - Degenerative diseases (like Emphysema) - Immunological diseases, common being asthma Less common ones being - Trauma - Congenital - Occupational and dust diseases- - Vascular diseases (pulmonary embolism) 19 As in all systems dierentiation between these possibilities lies in analyzing the mode of onset, course, duration and response to treatment, if any. The table below highlights these for the common ones: DISEASE ONSET COURSE DURATION TREATMENT RESPONSE ASTHMA Acute / Chronic Episodic Years Good for acute attack ACUTE INFECTION Acute Progressive then Regressive Days/ Weeks Good CHRONIC INFECTION Sub-Acute Slowly progressive Months/ Years Fair MALIGNANCY Sub-Acute Rapidly Progressive Months Bad DEGENERATI - VE Insidious Very Slowly Progressive Years Poor in long term A.4 BACKGROUND HISTORY ~ Acute infections Present in others in family Endemic/epidemic in community ~ Tuberculosis Family history + debilitating disease Overcrowding Undernutrition ~ Malignancy Personal history of smoking Occupational history of exposure to asbestos or Polyvinyl chloride ~ Degenerative Family history of similar illness ~ Bronchial asthma Past history of atopy, eczema, rhinitis Family history of atopy, eczema, rhinitis, allergic pharyngitis, hay fever A.5 What is the disturbance of function? Once again this is considered in context of a patient's daily

### Creating the Prompt

In [37]:
# Formats the query and passages into a structured prompt for a chatbot to generate a response in a conversational tone
query_oneline = query.replace("\n"," ")

prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below.
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query_oneline}
"""

for passage in all_passages:
    passage_online = passage.replace("\n"," ")
    prompt += f"PASSAGE: {passage_online}\n"

print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: What are the respiratory disease symptons?
PASSAGE: cardiac or psychological. Respiratory mechanisms can be:- ~ Inspiratory obstruction ~ Bronchospasm ~ Consolidation ~ Emphysema ~ Pleural eusion ~ Pneumothorax The site of disease in respiratory system can often be told by associated symptoms/signs. Dyspnoea with inspiratory stridor occurs in Foreign body Wheeze is audible in Bronchitis and Asthma Nocturnal increase in dyspnoea is Cardiac (due to alveolar congestion) Shallow breathing is seen in Neuromuscular paralysis 3. LATERAL CHEST PAIN This

### Sending the Query data to the Model for a proper understandable output. - Augmented Generation

In [38]:
# Generates content using the gemini model based on the given prompt and displays the result
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
)

Markdown(answer.text)

Okay, I can help you understand the symptoms associated with respiratory diseases.

Respiratory diseases can manifest in several ways, and the specific symptoms or signs can often point to the location of the issue within your respiratory system.

*   **Dyspnea** is basically when you experience shortness of breath. When dyspnea is coupled with an inspiratory stridor (a harsh, vibrating sound when you breathe in), it can indicate the presence of a foreign body.
*   **Wheezing**, that whistling or squeaky sound in your chest, is commonly heard in conditions like bronchitis and asthma, which affect your airways.
*   **Nocturnal Dyspnea** If you find yourself increasingly short of breath at night, it could be related to a cardiac issue, where fluid builds up in the lungs (alveolar congestion).
*   **Shallow breathing** may indicate neuromuscular paralysis.
*    **Lateral chest pain** can be a sign of pleural disease.

In addition to these, the nature of **hemoptysis** (coughing up blood) can provide valuable information:

*   **Streaky hemoptysis** often comes from the upper respiratory tract.
*   **Frothy hemoptysis** suggests the issue originates in the alveoli.
*   **Mucopurulent hemoptysis** is often associated with bronchiectasis and lung abscess.
*   **Rusty hemoptysis** is a symptom of early pneumonia.

Other general symptoms related to respiratory disease include tremulousness, drowsiness, and even a coma in severe cases of respiratory failure.

I hope this information helps you better understand respiratory disease symptoms.


### Trying for a different query.

In [44]:
embed_fn.document_mode = False

query = "How cardiology based disease are studied?"

result = db.query(query_texts=[query],n_results=1)
[all_passages]=result["documents"]

Markdown(all_passages[0])

of dysfunction? Are there any speciEic pathophysiologic syndromes of the involved system (e.g., Respiratory failure in chronic bronchitis, or Congestive Cardiac Failure in mitral stenosis) present? C. SYSTEMIC EXAMINATION Of systems other than the one primarily aected. This should tell us: C.1 Has the same disease aected other systems earlier and then caused the present problem? E.g. In Congenital heart disease, there may be other congenital anomalies, or Pulmonary tuberculosis may precede the involvement of kidneys by the mycobacteria. C.2 Whether other systems are aected by the disease, such as, Polycythemia in Chronic Lung Diseases, or Thrombo-embolic episode in Mitral Stenosis with atrial Eibrillation, or Gastric hemorrhage in cerebro-vascular stroke. D. SYSTEMIC EXAMINATION OF THE AFFECTED SYSTEM This should provide the answer to the following two questions, D1 and D2. D.1 Which site and sub-site is/are involved? Each system has dierent sites and sub-sites which can be involved in any disease process and these will be enumerated with each system. D.2 Can the site, sub-site involved or the permutations and combinations of the aforementioned tell us the possible disease responsible? E.g, Anterior horn cell is only aected by Polio and Motor Neurone Disease whereas pure motor spastic paraplegia has only a few well known causes. E. INVESTIGATIONS These should be able to answer the following queries. E1, E2, E3 and E4. E.1 What SITE/SUB-SITE is aected? E.2 What is the CAUSE (NATURE) of this process? E.3 Presence and measurement of any PATHOPHYSIOLOGICAL SYNDROMES such as measuring CVP in a doubtful CHF, Pulmonary Wedge pressure in left heart disease. 7 E.4 What is the FUNCTIONAL DISABILITY produced? Can we quantify this disability and decipher its pattern? E.g. Spirometry in COPD NOTE: The investigations should be mainly directed towards delineation of Site, Nature, measurement of pathophysiological alteration and quantiEication of dysfunction. Some of these 4 points may become clear from a thorough clinical examination and only need conEirmation by investigations, whereas the rest may need to be investigated as they could not be determined by Clinical Analysis. This could determine the selection of tests to be done. 8 The common anatomical site/s where diseases occur are listed in a box and so are the common etiologies of disease and the pathophysiological syndromes encountered. CARDIOLOGY SITE OF DISEASE 1. Pericardium 2. Myocardium 3. Endocardium: Valvular 4. Pancardium: Rheumatic Fever, Trauma 5. Vascular: Artery : Vein : Lymphatic 6. Electrical Pathways SYNDROMES OF DYSFUNCTION When the heart is not working properly it can result in the following pathophysiological syndromes: 1. Congestive Heart Failure 2. Cardiac Asthma 3. Low output Syndrome or Shock 4. Arrhythmia 5. Bacterial Endocarditis supervening on diseased Valves/ Shunt/ Artificial valves ETIOLOGY OF CARDIAC DISEASE (Note must be made of the common ones linked to a sub-site listed in the first box) COMMON 1. Congenital 2. Rheumatic 3. Hypertensive 4. Infectious : Pericardial ~ Tuberculosis, Viral : Myocardial ~ Virus, Rickettsia : Endocardial ~ Subacute bacterial endocarditis 5. Atherosclerotic UNCOMMON 1. Collagen 2. Endocrinal 3. Immune Disease 4. Others 9 CARDIOLOGY

In [45]:
query_oneline = query.replace("\n"," ")

prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below.
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query_oneline}
"""

for passage in all_passages:
    passage_online = passage.replace("\n"," ")
    prompt += f"PASSAGE: {passage_online}\n"

print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: How cardiology based disease are studied?
PASSAGE: of dysfunction? Are there any speciEic pathophysiologic syndromes of the involved system (e.g., Respiratory failure in chronic bronchitis, or Congestive Cardiac Failure in mitral stenosis) present? C. SYSTEMIC EXAMINATION Of systems other than the one primarily aected. This should tell us: C.1 Has the same disease aected other systems earlier and then caused the present problem? E.g. In Congenital heart disease, there may be other congenital anomalies, or Pulmonary tuberculosis may precede the i

In [47]:
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
)

Markdown(answer.text)

To study cardiology-based diseases, it is important to consider the site of the disease, like the pericardium, myocardium, endocardium, pancardium, vasculature, or electrical pathways. Additionally, understanding the syndromes of dysfunction, such as congestive heart failure, cardiac asthma, low output syndrome or shock, arrhythmia, or bacterial endocarditis, can provide insights. Also, the etiology, whether congenital, rheumatic, hypertensive, infectious, or atherosclerotic, needs to be considered.
