<a href="https://colab.research.google.com/github/RashmiJK/PGP-AIML-MedicalAssistant-NLP/blob/main/medical_assistant_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement

### Business Context

The healthcare industry is rapidly evolving, with professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making, and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.

**Common Questions to Answer**

**1. Diagnostic Assistance**: "What are the common symptoms and treatments for pulmonary embolism?"

**2. Drug Information**: "Can you provide the trade names of medications used for treating hypertension?"

**3. Treatment Plans**: "What are the first-line options and alternatives for managing rheumatoid arthritis?"

**4. Specialty Knowledge**: "What are the diagnostic steps for suspected endocrine disorders?"

**5. Critical Care Protocols**: "What is the protocol for managing sepsis in a critical care unit?"

### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to **understand** issues like information overload, **apply** AI techniques to streamline decision-making, **analyze** its impact on diagnostics and patient outcomes, **evaluate** its potential to standardize care practices, and **create** a functional prototype demonstrating its feasibility and effectiveness.

### Data Description

The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.

## 1 - Installing and Importing Necessary Libraries and Dependencies

**Set Google Colab to use the T4 GPU**

Install `llama-cpp-python` with GPU acceleration. The wheel build is essential; ignore other errors. Then restart runtime.

- `llama-cpp-python` is a Python wrapper for llama.cpp, a universal LLM inference library that runs models efficiently using the GGUF file format.

- GGUF (GGML Universal File) is a binary format storing model weights and metadata in a single file. It uses quantization to reduce precision, decreasing memory usage and increasing inference speed.

- Model Compatibility: Supports any GGUF-converted model including Llama, Mistral, CodeLlama, Gemma, and Qwen.

- `Llama()` class: Main interface for loading and running models

- `hf_hub_download()`: A function from the Hugging Face Hub library to download specific files from Hugging Face repositories with automatic caching

In [None]:
# Installation for GPU llama-cpp-python: Downloads and compiles the library with GPU acceleration enabled.
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

In [None]:
# Install the libraries & downloading models from HF Hub
!pip install huggingface_hub pandas tiktoken==0.6.0 pymupdf==1.25.1 langchain==0.3.25 langchain-community==0.3.25 chromadb sentence-transformers numpy transformers -q

In [2]:
# Libraries for downloading and loading the llm
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## 2 - Query LLM with default parameters

### 2.1 - Download and load the Mistral model
| Model | Repository | File/Name | Model card |
|-------|------------|-----------|---------|
| Mistral-7B-Instruct-v0.2 | `TheBloke/Mistral-7B-Instruct-v0.2-GGUF` | `mistral-7b-instruct-v0.2.Q6_K.gguf` | https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF |

In [14]:
# Define the model repository and filename for the Mistral-7B-Instruct-v0.2 GGUF model.
model_repo = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_file = "mistral-7b-instruct-v0.2.Q6_K.gguf"

In [15]:
# Download the model
model_path = hf_hub_download(
    repo_id= model_repo,
    filename= model_file
)

In [16]:
# Initialize the model with the downloaded GGUF file.
# model_path: path to the GGUF model file.
# n_ctx: context window size (determines how much text the model can process at once).
# n_gpu_layers: number of layers to offload to the GPU for acceleration.
# n_batch: batch size for processing.
llm = Llama(
    model_path=model_path,
    n_ctx=5000,
    n_gpu_layers=38,
    n_batch=512
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


### 2.2 - Utility function `generate_response`

In [2]:
def generate_response(
    query,
    max_tokens=128,
    temperature=0,
    top_p=0.95,
    top_k=50,
    repeat_penalty=1.0
):
    """
    Generates a response from the language model.

    Args:
        query (str): The input prompt for the model.
        max_tokens (int, optional): The maximum number of tokens to generate. Defaults to 128.
        temperature (float, optional): Controls the randomness of the output. Defaults to 0.
        top_p (float, optional): Nucleus sampling parameter. Defaults to 0.95.
        top_k (int, optional): Top-k sampling parameter. Defaults to 50.
        repeat_penalty (float, optional): Penalizes repeated tokens. Defaults to 1.0.

    Returns:
        str: The generated text response.
    """
    try:
      model_output = llm(
              prompt=query,
              max_tokens=max_tokens,
              temperature=temperature,
              top_p=top_p,
              top_k=top_k,
              repeat_penalty=repeat_penalty
          )
      return model_output['choices'][0]['text'], model_output
    except Exception as e:
      return f"Error: {e}", {}

### 2.3 - Querying the LLM

#### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [3]:
query_1 = "What is the protocol for managing sepsis in a critical care unit?"
ans_1, moutput_1 = generate_response(query_1)
print(ans_1)
print("completion_tokens = ", moutput_1['usage']['completion_tokens'])

Error: name 'llm' is not defined


KeyError: 'usage'

#### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [7]:
query_2 = "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
ans_2, moutput_2 = generate_response(query_2)
print(ans_2)
print("completion_tokens = ", moutput_2['usage']['completion_tokens'])

Llama.generate: prefix-match hit




Appendicitis is a medical condition characterized by inflammation of the appendix, a small tube-shaped organ located in the lower right side of the abdomen. The symptoms of appendicitis can vary from person to person, but some common signs include:

1. Abdominal pain: The pain is typically located in the lower right side of the abdomen and may start as a mild discomfort that gradually worsens. The pain may be constant or come and go, and it may be accompanied by cramping or bloating.
2. Loss of appetite: People with appendic
completion_tokens =  128


#### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [8]:
query_3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
ans_3, moutput_3 = generate_response(query_3)
print(ans_3)
print("completion_tokens = ", moutput_3['usage']['completion_tokens'])

Llama.generate: prefix-match hit




Sudden patchy hair loss, also known as alopecia areata, is a common autoimmune disorder that affects the hair follicles, leading to hair loss in small, round patches on the scalp, beard, or other areas of the body. The exact cause of alopecia areata is not known, but it is believed to be related to a problem with the immune system.

There are several treatments that have been shown to be effective in addressing sudden patchy hair loss:

1. Corticosteroids: Corticosteroids are anti-inflammatory
completion_tokens =  128


#### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [9]:
query_4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
ans_4, moutput_4 = generate_response(query_4)
print(ans_4)
print("completion_tokens = ", moutput_4['usage']['completion_tokens'])

Llama.generate: prefix-match hit




A person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function, is typically diagnosed with a traumatic brain injury (TBI). The treatment for a TBI depends on the severity and location of the injury, as well as the individual's overall health and age.

Immediate treatment for a TBI may include:

1. Emergency medical care: This may include surgery to remove hematomas or other obstructions, as well as treatment for other injuries that may have occurred at the same time as the TBI.
2. Med
completion_tokens =  128


#### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [10]:
query_5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
ans_5, moutput_5 = generate_response(query_5)
print(ans_5)
print("completion_tokens = ", moutput_5['usage']['completion_tokens'])

Llama.generate: prefix-match hit




First and foremost, if a person has fractured their leg during a hiking trip, it is essential to ensure their safety and prevent further injury. Here are some necessary precautions and treatment steps:

1. Assess the situation: Check the extent of the injury and assess the person's condition. If the fracture is open or the person is in severe pain, immobilize the leg with a splint or a makeshift sling to prevent any movement.
2. Call for help: If possible, call for emergency medical assistance. If there is no cell phone reception, try to
completion_tokens =  128


<span style="color: blue;"> **Observation**</span>
- The responses to the questions are generic.
- The output is truncated due to the default `max_tokens` limit of 128.

## 3 - Query LLM with Prompt Engineering and Parameter Tuning

Prompt template for Mistral from the model card : `<s>[INST] {prompt} [/INST]`

In order to leverage instruction fine-tuning, prompt is surrounded by [INST] and [/INST] tokens.


In [11]:
# Define a simple utility function to prepare model prompt
def prepare_model_prompt(system_prompt, user_prompt):
    return f"""<s>[INST]{'system'}: {system_prompt}
                {'user'}: {user_prompt}
                [/INST]"""

### Query 1: What is the protocol for managing sepsis in a critical care unit?

Combination 1 - System prompt (general audience, harmless) and modified `max_tokens`

In [12]:
system_prompt = """You are a helpful, respectful and honest medical assistant.
                  Always explain in simple terms for a general audience.
                  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
                  Please ensure that your responses are socially unbiased and positive in nature."""

ans, moutput = generate_response(
    prepare_model_prompt(system_prompt, query_1),
    max_tokens=0,
    temperature=0,
    top_p=0.95,
    top_k=50,
    repeat_penalty=1.0
  )
print(ans)
print("completion_tokens = ", moutput['usage']['completion_tokens'])

Llama.generate: prefix-match hit


 Sepsis is a serious condition that occurs when the body has an overwhelming response to an infection. In a critical care unit, managing sepsis involves several steps to ensure the best possible outcome for the patient. Here's a simplified explanation of the protocol:

1. Recognition: Healthcare professionals must identify sepsis early and assess its severity using tools like the Sequential Organ Failure Assessment (SOFA) score or the Quick Sequential Organ Failure Assessment (qSOFA) score.

2. Resuscitation: The first priority is to stabilize the patient's vital signs, including maintaining adequate blood pressure, oxygenation, and perfusion. This may involve administering intravenous fluids, oxygen, and vasopressors.

3. Source control: Identify and address the source of the infection, such as removing an infected catheter or draining an abscess.

4. Antibiotics: Administer broad-spectrum antibiotics as soon as possible to cover the most common bacterial pathogens.

5. Supportive car

<span style="color: blue;"> **Observation**</span>
- The explanation is detailed and suitable for a general audience.
- The number of completion tokens has increased compared to the previous query.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

Combination 2 - System prompt (brevity and Shakespearean language) and modified `temperature` and `max_tokens`

In [13]:
# temperature set to 1 and max_token is 0
system_prompt = """Respond briefly and clearly in Shakespearean language."""

ans, moutput = generate_response(
    prepare_model_prompt(system_prompt, query_2),
    max_tokens=0,
    temperature=1,
    top_p=0.95,
    top_k=50,
    repeat_penalty=1.0
  )
print(ans)
print("completion_tokens = ", moutput['usage']['completion_tokens'])

Llama.generate: prefix-match hit


 Thou askest of the common signs of appendicitis, good sir or madam? I shall oblige thee with great haste. Abdominal pain, primarily near the navel, is the first sign this malady doth present. Swelling, inflammation, and loss of appetite, as well as vomiting and a low-grade fever, are oft accompanied. Alas, good friend, no, this affliction cannot be healed by mere medicament. Instead, a surgical procedure known as appendectomy must be pursued with great haste. Forsooth, this operation, though perilous, is the only way to save the sufferer from the impending rupture and demise. Godspeed to thee, and may fortune smile upon thee in thine time of need.
completion_tokens =  175


In [14]:
# temperature set to 1 and max_token is 0
# Repeating the same question to observe effect of temperature
system_prompt = """Respond briefly and clearly in Shakespearean language."""

ans, moutput = generate_response(
    prepare_model_prompt(system_prompt, query_2),
    max_tokens=0,
    temperature=1,
    top_p=0.95,
    top_k=50,
    repeat_penalty=1.0
  )
print(ans)
print("completion_tokens = ", moutput['usage']['completion_tokens'])

Llama.generate: prefix-match hit


 Thou askest of the telltale signs of Appendicitis, a malady most vexing? I shall oblige, fair questioner! Swelling in the lower belly, near the navel, doth oft occur. Pain thence traveleth towards right side, abdomen afflicted with great distress. Loss of appetite, and oft a feverish heat, compound this affliction. As for cure by medicine's hand, alas, it doth not oft accord. Thus, surgical procedure, in form of an appendectomy, doth become the remedy's decree.
completion_tokens =  133


<span style="color: blue;"> **Observation**</span>
- The explanation is poetic in nature
- Same question repeated again has distinct response as temperature is set to 1 for random response.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

Combination 3 - System prompt (empty) and modified `top_k`

`top_k` controls the maximum number of most-likely next tokens to consider when generating the response at each step.

In [15]:
# top_k set to 5
system_prompt = ""

ans, moutput = generate_response(
    prepare_model_prompt(system_prompt, query_3),
    max_tokens=0,
    temperature=0,
    top_p=0.95,
    top_k=5,
    repeat_penalty=1.0
  )
print(ans)
print("completion_tokens = ", moutput['usage']['completion_tokens'])

Llama.generate: prefix-match hit


 There are several possible causes for sudden, patchy hair loss, also known as alopecia areata. Here are some effective treatments and possible causes:

Causes:
1. Alopecia Areata: An autoimmune disorder that causes the body's immune system to attack hair follicles, leading to hair loss.
2. Stress: Physical or emotional stress can cause hair loss.
3. Nutritional Deficiencies: Lack of certain nutrients, such as iron, zinc, or biotin, can lead to hair loss.
4. Hormonal Imbalance: Hormonal changes, such as those caused by pregnancy, menopause, or thyroid problems, can cause hair loss.
5. Medications: Certain medications, such as chemotherapy drugs, can cause hair loss.

Treatments:
1. Minoxidil: A topical medication that can help stimulate hair growth and slow down hair loss.
2. Corticosteroids: Prescription medications that can help reduce inflammation and suppress the immune system to promote hair growth.
3. Immunotherapy: Injections of certain proteins that can help stimulate hair grow

In [16]:
# top_k set to 70
system_prompt = ""

ans, moutput = generate_response(
    prepare_model_prompt(system_prompt, query_3),
    max_tokens=0,
    temperature=0,
    top_p=0.95,
    top_k=70,
    repeat_penalty=1.0
  )
print(ans)
print("completion_tokens = ", moutput['usage']['completion_tokens'])

Llama.generate: prefix-match hit


 There are several possible causes for sudden, patchy hair loss, also known as alopecia areata. Here are some effective treatments and possible causes:

Causes:
1. Alopecia Areata: An autoimmune disorder that causes the body's immune system to attack hair follicles, leading to hair loss.
2. Stress: Physical or emotional stress can cause hair loss.
3. Nutritional Deficiencies: Lack of certain nutrients, such as iron, zinc, or biotin, can lead to hair loss.
4. Hormonal Imbalance: Hormonal changes, such as those caused by pregnancy, menopause, or thyroid problems, can cause hair loss.
5. Medications: Certain medications, such as chemotherapy drugs, can cause hair loss.

Treatments:
1. Minoxidil: A topical medication that can help stimulate hair growth and slow down hair loss.
2. Corticosteroids: Prescription medications that can help reduce inflammation and suppress the immune system to promote hair growth.
3. Immunotherapy: Injections of certain proteins that can help stimulate hair grow

<span style="color: blue;"> **Observation**</span>
- While the "Causes" sections are identical, the "Treatments" sections show a clear difference based on the top_k setting.
- The top_k=70 response provides a longer list of treatments, better wording specificity and more token count.
- This happens because top_k=5 forces the model to choose its next word from only the top 5 most probable options, leading to a more predictable and generic response. In contrast, top_k=70 gives the model a much wider pool of 70 words to choose from at each step, allowing for more specific terminology and a more comprehensive list.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

Combination 4 - Few-shot prompting

In [17]:
system_prompt = """
You are a medical assistant providing information on treatments for brain injuries.

User:
Question: What are the common symptoms and treatments for pulmonary embolism?
Answer: Common symptoms of pulmonary embolism include sudden shortness of breath, chest pain that worsens with breathing or coughing, rapid heart rate, rapid breathing, anxiety, coughing (sometimes with blood), sweating, and fainting. Treatment typically involves anticoagulant medications to prevent further clots, and sometimes thrombolytics to dissolve existing clots. In severe cases, surgical embolectomy or catheter-directed treatments may be necessary.

User:
Question: Can you provide the trade names of medications used for treating hypertension?
Answer: Some common trade names for medications used to treat hypertension include Prinivil, Zestril (Lisinopril), Norvasc (Amlodipine), Cozaar (Losartan), Diovan (Valsartan), Toprol XL, Lopressor (Metoprolol), and Tenormin (Atenolol).

User:
Question: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
Answer:
"""

user_input = ""

ans, moutput = generate_response(
    prepare_model_prompt(system_prompt, user_input),
    max_tokens=0
  )
print(ans)
print("completion_tokens = ", moutput['usage']['completion_tokens'])

Llama.generate: prefix-match hit


 Treatment for a brain injury can depend on the severity and location of the injury. For mild to moderate brain injuries, rest, medication for pain and swelling, and rehabilitation therapies such as physical, occupational, and speech therapy may be recommended. For more severe injuries, treatments may include surgery to remove hematomas or repair skull fractures, and intensive care to manage symptoms such as seizures, infections, or breathing problems. Rehabilitation is also an important part of treatment for brain injuries, regardless of severity. It can help individuals regain skills and improve function. Additionally, medications may be prescribed to manage symptoms such as seizures, depression, or difficulty with attention or memory. It's important to note that every brain injury is unique, and treatment plans will vary depending on the individual's specific needs.
completion_tokens =  174


<span style="color: blue;"> **Observation**</span>
- The structure and content of response align well with the provided few-shot examples, demonstrating that the model understood the desired format and level of detail.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

Combination 5 - Chain-of-Thought prompting

In [18]:
system_prompt = """Think step-by-step to determine the necessary precautions, treatment steps, and considerations for care and recovery for a person who has fractured their leg during a hiking trip. Consider the immediate actions to take at the injury site, the subsequent medical treatment, and the long-term recovery process.
"""

ans, moutput = generate_response(
    prepare_model_prompt(system_prompt, query_5),
    max_tokens=0
  )
print(ans)
print("completion_tokens = ", moutput['usage']['completion_tokens'])

Llama.generate: prefix-match hit


 I. Immediate Actions at the Injury Site:
1. Assess the situation: Check if the person is in a safe location and if there are any other injuries.
2. Provide first aid: Apply a sterile dressing to the wound, if present, to prevent infection. Do not attempt to realign the bone or apply excessive pressure to the area.
3. Immobilize the leg: Use a splint, a makeshift sling, or a hiking pole to immobilize the leg to prevent further damage and provide comfort.
4. Monitor vital signs: Check for signs of shock, such as rapid heartbeat, shallow breathing, or pale skin.
5. Provide hydration and nutrition: Offer water or other fluids to help maintain hydration and provide energy-rich snacks.

II. Subsequent Medical Treatment:
1. Seek professional help: Arrange for transportation to the nearest medical facility as soon as possible.
2. Diagnostic tests: X-rays will be used to confirm the fracture and determine the extent of the injury.
3. Pain management: The healthcare provider may prescribe pain 

<span style="color: blue;"> **Observation**</span>
- The response is detailed and includes step-by-step thinking and reasoning.

## 4 - Download Embedding model

Download the General Text Embeddings (GTE) model to generate embeddings for the PDF data from the Merck Manual.

*   These models are ranked well on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for 'Retrieval' tasks, indicating their effectiveness in creating meaningful representations of text for search and retrieval purposes.
*   This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

| Model     | Repository          | How to Load         | Model Card                                        | Embedding Dimension |
|-----------|---------------------|---------------------|-------------|---------------------------------------------------|
| GTE-Large | `thenlper/gte-large` | `SentenceTransformer` | https://huggingface.co/thenlper/gte-large         | 1024 |

In [19]:
# Import the SentenceTransformerEmbeddings class for creating sentence embeddings.
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [20]:
# Load the GTE-Large embedding model
embedding_model = SentenceTransformerEmbeddings(model_name="thenlper/gte-large")

  embedding_model = SentenceTransformerEmbeddings(model_name="thenlper/gte-large")


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [21]:
print("Model information:")
print(embedding_model.client)

print("\nTokenizer:")
print(embedding_model.client.tokenizer)

Model information:
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Tokenizer:
BertTokenizerFast(name_or_path='thenlper/gte-large', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False

- The gte-large embedding model uses BertTokenizerFast for generating embeddings.
- This Notebook will use the same to token count when splitting the document into chunks with RecursiveCharacterTextSplitter
- This ensures the chunks are within the embedding model's maximum length.

Methods `.embed_documents()` or `.emded_query()` of Model instance can be used to generate embeddings

## 5 - Data Preparation and Vector Database Setup for RAG

To prepare the medical manual data for Retrieval Augmented Generation (RAG), we will perform the following steps:

1.  **Chunking**: Divide the PDF document into smaller, manageable text segments (chunks). We will create two sets of chunks with different sizes (490 and 245 tokens) to explore the impact of chunk size on retrieval.
2.  **Vectorization**: Convert these text chunks into numerical representations called embeddings using the pre-trained GTE-Large embedding model.
3.  **Vector Database Setup**: Store the vectorized chunks in two separate Chroma vector databases, one for each chunk size. This allows for efficient similarity search during the retrieval phase of RAG.

By creating two databases with different chunk sizes, we can compare their effectiveness in retrieving relevant information for answering medical queries.

### 5.2 - Import libraries required for chunking

In [22]:
# Libraries for processing dataframes,text
import json,os
import tiktoken
import pandas as pd

# Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import Chroma
from uuid import uuid4
from time import sleep

### 5.2 - Loading and Previewing the Medical Manual

In [23]:
# Connect to Google Drive to load the PDF
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [24]:
manual_pdf_path = "/content/drive/MyDrive/Colab Notebooks/Project-5/medical_diagnosis_manual.pdf"

In [25]:
pdf_loader = PyMuPDFLoader(manual_pdf_path)

In [26]:
manual = pdf_loader.load()

In [27]:
print("total documents loaded from the PDF = ", len(manual))

total documents loaded from the PDF =  4114


In [28]:
# Inspect page_content and length of few randomly selected documents to understand the data structure
for i in range(20,24):
  print("page = ",manual[i].metadata['page'],end="\n")
  print("page_content length = ", len(manual[i].page_content),end="\n")
  print("---"*10)

page =  20
page_content length =  1487
------------------------------
page =  21
page_content length =  1843
------------------------------
page =  22
page_content length =  1858
------------------------------
page =  23
page_content length =  1797
------------------------------


### 5.3 - Define utility function `get_vectordb_handler` to populate vector DB

In [29]:
# Define a utility function to create and populate database
def get_vectordb_handler(persist_dir, collection_name, document_chunks):
    """
    Handles the creation or loading of the Chroma vector database.

    Args:
        persist_dir (str): The directory path to persist the database.
        collection_name (str): The name of the collection within the database.
        document_chunks (list): A list of document chunks to add to the database.

    Returns:
        Chroma: An instance of the Chroma vector database.
    """
    if os.path.exists(persist_dir):
      print(f'"{persist_dir}" already exists!')
    else:
      print(f'Creating vector database directory in "{persist_dir}"')
      os.makedirs(persist_dir)

    # Instantiate Chroma with persitence
    vectorstore = Chroma(
        persist_directory=persist_dir,
        embedding_function=embedding_model,
        collection_name=collection_name
      )

    # Get the collection
    content = vectorstore.get()

    if not len(content['ids']):
      print(f'Populating vector database...')

      uuids = [str(uuid4()) for _ in range(len(document_chunks))]
      i = 0
      while i < len(document_chunks) - 1000:
        added_list = vectorstore.add_documents(document_chunks[i : i + 1000], ids=uuids[i : i + 1000])
        print(f'Vector database populated with {len(added_list)} entries')
        i += 1000
        sleep(10)

      if i < len(document_chunks):
          added_list = vectorstore.add_documents(document_chunks[i :], ids=uuids[i :])
          print(f'Vector database populated with {len(added_list)} entries')

    else:
      print(f'Vector database already populated.')

    return vectorstore

### 5.3 - Data Chunking (chunk_size=490)

In [30]:
# Import the BertTokenizerFast from the transformers library
from transformers import BertTokenizerFast
# Load the tokenizer for the 'thenlper/gte-large' model
tokenizer = BertTokenizerFast.from_pretrained("thenlper/gte-large")

In [31]:
# Initialize the RecursiveCharacterTextSplitter using the loaded tokenizer.
# from_huggingface_tokenizer is used to ensure compatibility with the model's tokenizer.
# chunk_size is set to 490 : The maximum number of tokens in each chunk
# chunk_overlap is set to 20 : The number of tokens to overlap between consecutive chunks
text_splitter_490 = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=490,
    chunk_overlap=20
)

In [32]:
# Load the PDF document and split it into chunks using the configured text_splitter
document_chunks = pdf_loader.load_and_split(text_splitter_490)

In [33]:
# Verify token counts for each chunk
max_tokens_allowed = 512
all_chunks_within_limit = True

for i, chunk in enumerate(document_chunks):
  token_count = len(tokenizer.encode(chunk.page_content))
  if token_count > max_tokens_allowed:
    print(f"Chunk {i} exceeds the token limit with {token_count} tokens.")
    all_chunks_within_limit = False

if all_chunks_within_limit:
  print(f"All document chunks are within the {max_tokens_allowed}-token limit.")

All document chunks are within the 512-token limit.


In [34]:
# Print the total number of document chunks created
print(f"""
type(document_chunks) = {type(document_chunks)}
type(document_chunks[0]) = {type(document_chunks[0])}
len(document_chunks) = {len(document_chunks)}""")


type(document_chunks) = <class 'list'>
type(document_chunks[0]) = <class 'langchain_core.documents.base.Document'>
len(document_chunks) = 8678


In [35]:
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)

In [36]:
print("Dimension of the embedding vector ",len(embedding_1))
len(embedding_1)==len(embedding_2)

Dimension of the embedding vector  1024


True

### 5.4 - Populate Vector Database (medical_db_490)

In [37]:
# Define the directory where the vector database will be stored
persist_dir = '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_db_490'

In [38]:
vectorstore = get_vectordb_handler(persist_dir, "MerckManual", document_chunks)

"/content/drive/MyDrive/Colab Notebooks/Project-5/medical_db_490" already exists!


  vectorstore = Chroma(


Vector database already populated.


In [39]:
# Total entries in the vector db
len(vectorstore.get()['ids'])

8678

In [40]:
# Test similarity search for vitamin A toxicity
vectorstore.similarity_search("What are the side effects if vitamin A overdose?",k=3)

[Document(metadata={'creator': 'Atop CHM to PDF Converter', 'source': '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_diagnosis_manual.pdf', 'creationDate': 'D:20120615054440Z', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_diagnosis_manual.pdf', 'modDate': 'D:20251022203106Z', 'total_pages': 4114, 'trapped': '', 'keywords': '', 'subject': '', 'format': 'PDF 1.7', 'page': 93, 'moddate': '2025-10-22T20:31:06+00:00', 'author': '', 'creationdate': '2012-06-15T05:44:40+00:00'}, page_content='defects occur in children of women receiving isotretinoin (which is related to vitamin A) for acne treatment\nduring pregnancy.\nAlthough carotene is converted to vitamin A in the body, excessive ingestion of carotene causes\ncarotenemia, not vitamin A toxicity. Carotenemia is usually asymptomatic but may lead to carotenodermia,\nin which the s

<span style="color: blue;"> **Observation**</span>
- The Merck Manuals have been vectorized and stored in the Chroma DB vector database.
- There are 8678 entries in the database, corresponding to the number of document chunks created.
- Testing the similarity search for "Vitamin A toxicity" successfully retrieved relevant chunks from the database.

### 5.5 - Data Chunking (chunk_size=245)

To tune chunking, we'll create a new database to store smaller size chunks.

In [41]:
# Initialize the RecursiveCharacterTextSplitter for chunk_size 245 (smaller than the previous one)
text_splitter_245 = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=245,
    chunk_overlap=20
)

In [42]:
# Load the PDF document and split it into chunks using the configured text_splitter_245
document_chunks_245 = pdf_loader.load_and_split(text_splitter_245)

In [43]:
# Print the total number of document chunks created using text_splitter_245 and their types
print(f"""
type(document_chunks_245) = {type(document_chunks_245)}
type(document_chunks_245[0]) = {type(document_chunks_245[0])}
len(document_chunks_245) = {len(document_chunks_245)}""")


type(document_chunks_245) = <class 'list'>
type(document_chunks_245[0]) = <class 'langchain_core.documents.base.Document'>
len(document_chunks_245) = 16160


In [44]:
# Print the dimension of the embedding vector generated by the model.
print("Dimension of the embedding vector ",len(embedding_model.embed_query(document_chunks_245[0].page_content)))

Dimension of the embedding vector  1024


<span style="color: blue;"> **Observation**</span>
- The dimension of the embedded vector remains at 1024, consistent with the model's output size.
- This demonstrates that even with smaller chunks (chunk_size=245), the embedding model effectively captures the contextual information within each chunk and represents it as a 1024-dimensional vector.

### 5.6 - Populate Vector Database (medical_db_245)

In [45]:
# Define the directory where the vector database will be stored
persist_dir_245 = '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_db_245'

In [46]:
vectorstore_245 = get_vectordb_handler(persist_dir_245, "MerckManual245", document_chunks_245)

"/content/drive/MyDrive/Colab Notebooks/Project-5/medical_db_245" already exists!
Vector database already populated.


In [47]:
# Total entries in the vector db
len(vectorstore_245.get()['ids'])

16160

In [48]:
# Test similarity search for vitamin A toxicity
vectorstore_245.similarity_search("What are the side effects of vitamin A overdose?",k=3)

[Document(metadata={'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'trapped': '', 'keywords': '', 'total_pages': 4114, 'file_path': '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_diagnosis_manual.pdf', 'page': 93, 'format': 'PDF 1.7', 'creationdate': '2012-06-15T05:44:40+00:00', 'source': '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_diagnosis_manual.pdf', 'subject': '', 'creator': 'Atop CHM to PDF Converter', 'moddate': '2025-10-22T20:31:06+00:00', 'author': '', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'creationDate': 'D:20120615054440Z', 'modDate': 'D:20251022203106Z'}, page_content='are present, adjusting the dose almost always leads to complete recovery.\nAcute vitamin A toxicity in children may result from taking large doses (> 100,000 RAE [> 300,000 IU]),\nusually accidentally. In adults, acute toxicity has occurred when arctic explorers ingested polar bear or\nseal livers, which contain several million units of vitamin

<span style="color: blue;"> **Observation**</span>
- We see that the same documents are retrieved as before, but each chunk has a smaller length. This indicates that the chunking process with a smaller `chunk_size` successfully created more granular chunks while still maintaining relevant content for retrieval.

### 5.7 - Tranform `vectorstore` into retriever

For easier usage with LangChain chains, we can tranform the vector store into retriever

In [49]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 3}
)

In [50]:
rel_docs = retriever.get_relevant_documents("What are the side effects if vitamin A overdose?")
rel_docs

  rel_docs = retriever.get_relevant_documents("What are the side effects if vitamin A overdose?")


[Document(metadata={'creator': 'Atop CHM to PDF Converter', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'trapped': '', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'page': 93, 'file_path': '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_diagnosis_manual.pdf', 'source': '/content/drive/MyDrive/Colab Notebooks/Project-5/medical_diagnosis_manual.pdf', 'moddate': '2025-10-22T20:31:06+00:00', 'author': '', 'subject': '', 'creationdate': '2012-06-15T05:44:40+00:00', 'total_pages': 4114, 'creationDate': 'D:20120615054440Z', 'modDate': 'D:20251022203106Z', 'format': 'PDF 1.7', 'keywords': ''}, page_content='defects occur in children of women receiving isotretinoin (which is related to vitamin A) for acne treatment\nduring pregnancy.\nAlthough carotene is converted to vitamin A in the body, excessive ingestion of carotene causes\ncarotenemia, not vitamin A toxicity. Carotenemia is usually asymptomatic but may lead to carotenodermia,\nin which the s

<span style="color: blue;"> **Observation**</span>
- You can see the same documents were retrieved for the same query through retriever


In [51]:
# Query Mistral-7B-Instruct without retrieved context
rsp, mo = generate_response("What are the side effects of vitamin A overdose?", max_tokens=0)

Llama.generate: prefix-match hit


In [52]:
print(rsp)



Vitamin A is an essential nutrient that plays a crucial role in maintaining good vision, a healthy immune system, and normal growth and development. However, too much vitamin A can be harmful, especially for pregnant women and young children.

The side effects of vitamin A overdose, also known as hypervitaminosis A, can include:

1. Nausea and vomiting
2. Fatigue and weakness
3. Headache and dizziness
4. Dry, itchy skin and hair loss
5. Joint pain and muscle weakness
6. Liver damage, which can lead to jaundice, abdominal pain, and dark urine
7. Bone pain and fractures
8. Birth defects in developing fetuses, including cleft palate, heart defects, and mental retardation

In severe cases, vitamin A overdose can lead to coma, seizures, and even death.

It's important to note that the recommended daily intake of vitamin A for adults is 700-900 micrograms per day, while the upper limit is 3,000 micrograms per day. Pregnant women should not exceed 2,600 micrograms per day, and young childre

<span style="color: blue;"> **Observation**</span>

The above response is generic and is solely based on the data the model was trained on, rather than the medical manual.  

### 5.8 - Define utility function `prepare_rag_model_prompt`

Prompts guide the model to generate accurate responses.

    1. The system prompt describing the assistant's role.
    2. A user message includes context and the question.

In [53]:
# Define a simple utility function to prepare model prompt for RAG
def prepare_rag_model_prompt(
    system_prompt,
    query,
    retriever,
    k=3
):
    # Retrieve relevant document chunks from retriever
    relevant_docs = retriever.get_relevant_documents(query=query, k=k)
    context = [d.page_content for d in relevant_docs]

    # Combine the retrieved documents into one long string
    context_string = ". ".join(context)

    user_prompt = """
    ###Context
    Here are the retrieved documents that are releavnt to the question mentioned below.
    {context_string}

    ###Question
    {query}
    """.format(context_string=context_string, query=query)

    # Return the prepared prompt and the context string
    return (
        f"""<s>[INST]{'system'}: {system_prompt}
                {'user'}: {user_prompt}
                [/INST]""" ,
        context_string
    )

### 5.9 - RAG query

In [54]:
# Query Mistral-7B-Instruct with retrieved context
system_message = """
You are an assistant whose work is to review the report and provide the appropriate answers from the context.
User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the context, strictly respond "I don't know".
"""

In [55]:
user_question = "What are the side effects if vitamin A overdose?"

In [56]:
prompt, _ = prepare_rag_model_prompt(system_message, user_question, retriever)

In [57]:
rsp, mo = generate_response(prompt,max_tokens=0)
print(rsp)

Llama.generate: prefix-match hit


 The side effects of vitamin A overdose include headache, increased intracranial pressure, nausea, vomiting, changes in skin, hair, and nails, abnormal liver test results, and in a fetus, birth defects. In severe cases, symptoms may include sparsely distributed, coarse hair; alopecia of the eyebrows; dry, rough skin; dry eyes; cracked lips; severe headache; pseudotumor cerebri; generalized weakness; cortical hyperostosis of bone and arthralgia; easy fractures; pruritus; anorexia; failure to thrive; hepatomegaly and splenomegaly; and in carotenodermia, deep yellow skin (but not the sclera), especially on the palms and soles. Diagnosis is usually clinical, and adjusting the dose usually leads to complete recovery, except for birth defects in the fetus of a mother who has taken megadoses of vitamin A.


<span style="color: blue;"> **Observation**</span>

*   The response successfully retrieves information specifically from the provided context (Merck Manuals).

## 6 - Question Answering using RAG

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [58]:
# context window is set to 5000
# Each chunk is 512; set k = 3 to retrieve 3 top matching chunks
print("query1 : ", query_1)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_1, retriever, k=3)
mrq1, moq1 = generate_response(rag_prompt, max_tokens=0)
print(mrq1)

query1 :  What is the protocol for managing sepsis in a critical care unit?


Llama.generate: prefix-match hit


 The protocol for managing sepsis in a critical care unit includes administering antibiotics such as gentamicin or tobramycin, a 3rd-generation cephalosporin, or ceftazidime, depending on the suspected source and causative organisms. Vancomycin should be added if resistant staphylococci or enterococci are suspected. If there is an abdominal source, a drug effective against anaerobes should be included. Culture and sensitivity results should be used to change the antibiotic regimen accordingly. Antibiotics should be continued for at least 5 days after shock resolves and evidence of infection subsides. Abscesses must be drained, and necrotic tissues must be surgically excised. Normalization of blood glucose improves outcome in critically ill patients, and a continuous IV insulin infusion is titrated to maintain glucose between 80 to 110 mg/dL.


### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [61]:
print("query2 : ", query_2)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_2, retriever, k=3)
mrq2, moq2 = generate_response(rag_prompt, max_tokens=0)
print(mrq2)

query2 :  What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?


Llama.generate: prefix-match hit


 The common symptoms of appendicitis include abdominal pain, anorexia, and abdominal tenderness. Appendicitis cannot be cured via medicine alone, and the standard surgical procedure to treat it is appendectomy.


### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [62]:
print("query3 : ", query_3)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_3, retriever, k=3)
mrq3, moq3 = generate_response(rag_prompt, max_tokens=0)
print(mrq3)

query3 :  What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?


Llama.generate: prefix-match hit


 Sudden patchy hair loss, also known as alopecia areata, can be treated with various methods. Topical, intralesional, or systemic corticosteroids, topical minoxidil, topical anthralin, topical immunotherapy (diphencyprone or squaric acid dibutylester), or psoralen plus ultraviolet A (PUVA) are some of the treatment options. In severe cases, long-acting oral tetracyclines in combination with potent topical corticosteroids may be used for scarring alopecia. The cause of alopecia areata is believed to be an autoimmune disorder, and it is important to rule out other underlying disorders through proper evaluation.


### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [63]:
print("query4 : ", query_4)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_4, retriever, k=3)
mrq4, moq4 = generate_response(rag_prompt, max_tokens=0)
print(mrq4)

query4 :  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?


Llama.generate: prefix-match hit


 The recommended treatments for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function, include ensuring a reliable airway and maintaining adequate ventilation, oxygenation, and blood pressure. Surgery may be needed in patients with more severe injury to place monitors to track and treat intracranial pressure, decompress the brain if intracranial pressure is increased, or remove intracranial hematomas. In the first few days after the injury, maintaining adequate brain perfusion and oxygenation and preventing complications of altered sensorium are important. Subsequently, many patients require rehabilitation. Supportive care should include preventing systemic complications due to immobilization, providing good nutrition, and preventing pressure ulcers. A team approach that combines physical, occupational, and speech therapy, skill-building activities, and counseling may be required for patients whose coma exceeds 24

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [64]:
print("query5 : ", query_5)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_5, retriever, k=3)
mrq5, moq5 = generate_response(rag_prompt, max_tokens=0)
print(mrq5)

query5 :  What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?


Llama.generate: prefix-match hit


 The necessary precautions for a person who has fractured their leg during a hiking trip include keeping the cast dry, never putting an object inside the cast, inspecting the cast's edges and skin around the cast every day, applying lotion to any red or sore areas, and padding any rough edges with soft material to prevent the cast's edges from causing discomfort. They should also seek medical care at once if an odor emanates from within the cast or if a fever develops. Immobilization with a cast is helpful for fractures, but prolonged immobilization can cause complications such as stiffness, contractures, and muscle atrophy. Early mobilization, which involves resuming active motion within the first few days or weeks, may minimize these complications and accelerate functional recovery. In the field, the affected area should be rewarmed rapidly by immersing it in water that is tolerably warm to the touch, and the patient should be given analgesics if available. Once in the hospital, the 

<span style="color: blue;"> **Observation**</span>

*   Responses to all the queries 1 to 5 are now provided from the Merck Manuals, demonstrating the effectiveness of the RAG approach in retrieving relevant information.

### 6.1 - Fine-tuning the RAG System (chunking, retriever, LLM parameters)

In [65]:
# Define retriever for the vector database storing chunks of 245 tokens
retriever_245 = vectorstore_245.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 3}
)

#### Query 1: What is the protocol for managing sepsis in a critical care unit?

Combination 1 - System prompt to refer context and use `retriever_245`

In [66]:
# Each chunk is 245; set k = 2 to only retrieve 2 similar chunks
print("query1 : ", query_1)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_1, retriever_245, k=2)
mrq1, moq1 = generate_response(rag_prompt, max_tokens=0)
print(mrq1)

query1 :  What is the protocol for managing sepsis in a critical care unit?


Llama.generate: prefix-match hit


 Patients with sepsis in a critical care unit should be monitored frequently for systemic pressure, CVP or PAOP, pulse oximetry, ABGs, blood glucose, lactate, electrolyte levels, renal function, sublingual PCO2, urine output, and possibly have PAOP or echocardiography to identify limitations in left ventricular function and incipient pulmonary edema due to fluid overload. Fluid resuscitation with 0.9% saline should be given until CVP reaches 8 mm Hg (10 cm H2O) or PAOP reaches 12 to 15 mm Hg. Oliguria with hypotension is not a contraindication to vigorous fluid resuscitation, and the quantity of fluid required often exceeds the normal blood volume and may reach 10 L over 4 to 12 hours. Supportive care includes adequate nutrition and prevention of infection, stress ulcers and gastritis, and pulmonary embolism.


<span style="color: blue;"> **Observation**</span>
- Setting the number of documents to retrieve (`k`) to 2 and using `retriever_245` results in less context being provided to the LLM compared to using `retriever` with `k=3`.

#### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

Combination 2 - System prompt to refer context and use `retriever_245` with k=7

In [67]:
print("query2 : ", query_2)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_2, retriever_245, k=7)
mrq2, moq2 = generate_response(rag_prompt,max_tokens=0)
print(mrq2)

query2 :  What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?


Llama.generate: prefix-match hit


 The common symptoms of appendicitis include epigastric or periumbilical pain followed by brief nausea, vomiting, and anorexia, which shifts to the right lower quadrant after a few hours. Pain increases with cough and motion. Classic signs are right lower quadrant direct and rebound tenderness located at McBurney's point. Additional signs include pain felt in the right lower quadrant with palpation of the left lower quadrant (Rovsing sign) and an increase in pain from passive extension of the right hip joint.

Appendicitis cannot be cured via medicine alone. The standard surgical procedure to treat it is an appendectomy, which involves removing the appendix. This procedure should be preceded by IV antibiotics, and if the appendix is perforated, antibiotics should be continued until the patient's temperature and WBC count have normalized or for a fixed course according to the surgeon's preference. If surgery is impossible, antibiotics can improve the survival rate but are not curative.


<span style="color: blue;"> **Observation**</span>
- Setting the number of documents to retrieve (`k`) to 7 and using `retriever_245` results in more context being provided to the LLM compared to using `retriever` with `k=3`.
- This has resulted in more detailed response

#### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

Combination 3 - System prompt to refer context and use `retriever_245` with temperature = 1

In [68]:
print("query3 : ", query_3)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_3, retriever_245, k=3)
mrq3, moq3 = generate_response(rag_prompt, max_tokens=0, temperature=1)
print(mrq3)

query3 :  What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?


Llama.generate: prefix-match hit


 There are several treatment options for addressing sudden patchy hair loss, or alopecia areata, as mentioned in the context. These include topical, intralesional, or systemic corticosteroids, topical minoxidil, topical anthralin, topical immunotherapy (diphencyprone or squaric acid dibutylester), or psoralen plus ultraviolet A (PUVA). In severe cases, systemic corticosteroids may be prescribed. The cause of alopecia areata is not always clear, but it is generally considered an autoimmune disorder. Other possible causes of sudden patchy hair loss include nutritional deficiencies, stress, and certain medications. It is important to consult a healthcare professional for an accurate diagnosis and treatment plan.


<span style="color: blue;"> **Observation**</span>

- Setting temperature to 1 results in more random response

#### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

Combination 4 - System prompt to refer context and use `retriever_245` with max_tokens = 100

In [69]:
print("query4 : ", query_4)
rag_prompt, _ = prepare_rag_model_prompt(system_message, query_4, retriever_245, k=3)
mrq4, moq4 = generate_response(rag_prompt,max_tokens=0)
print(mrq4)
print("completion_tokens = ", moq4['usage']['completion_tokens'])

query4 :  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?


Llama.generate: prefix-match hit


 The recommended treatments for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function, include supportive care. Supportive care should include preventing systemic complications due to immobilization, providing good nutrition, and preventing pressure ulcers. There is no specific treatment for the brain injury itself.
completion_tokens =  69


<span style="color: blue;"> **Observation**</span>

- The number of completion tokens is less than 100, which aligns with the `max_tokens=100` setting in the `generate_response` function.

#### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

Combination 5 - System prompt to respond in bullet list and use `retriever_245` with k=3

In [70]:
sys = system_message + """
Respond in exactly 4 bullet points.
"""

In [72]:
print("query5 : ", query_5)
rag_prompt, _ = prepare_rag_model_prompt(sys, query_5, retriever_245, k=3)
mrq5, moq5 = generate_response(rag_prompt, max_tokens=0)
print(mrq5)

query5 :  What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?


Llama.generate: prefix-match hit


 1. The person should seek medical care at once for the fracture.
2. The fractured leg should be immobilized using a splint or cast to prevent movement and interference with healing.
3. The wound should be kept clean and dry, and nonadherent and impermeable dressings should be applied.
4. Antibiotic ointment should be applied daily until the wound closure device is removed, and the patient should be advised to inspect the wound for signs of infection, such as odor or fever. Additionally, the patient should maintain good hygiene and elevate the leg above heart level for the first 48 hours. If the fracture requires prolonged immobilization, the patient may be at risk for complications such as stiffness, contractures, and muscle atrophy. Early mobilization may help minimize these complications.


<span style="color: blue;"> **Observation**</span>
- Response is in bullet list
- There are exactly 4 bullet points

## 6 - Output Evaluation

Evaluation of the RAG system will be performed using the LLM-as-a-judge method. This is an effective method when human annotated/referece text is not available as gold standard reference.

This involves using a larger language model (Llama 2 13B) to evaluate the quality of the RAG system's responses based on two key aspects:

1.  **Faithfulness (also called Hallucination rate - inversely related)**: Measures how well the generated response aligns with retrieved documents, avoiding hallucinations.
2.  **Assessing Relevance**: How well the system uses the retrieved information to generate accurate and helpful answers.

The Llama 2 13B (trained on 13 billion parameters) model will be downloaded and loaded for this evaluation. Note that this model is approximately 11GB in size.

### 6.1 - Download and load the Llama model
| Model | Repository | File/Name | Model card |
|-------|------------|-----------|---------|
| Llama-2-13B-chat | `TheBloke/Llama-2-13B-chat-GGUF` | `llama-2-13b-chat.Q5_K_M.gguf` | https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF |

Prompt template for Llama-2-Chat as in the model card is -
`
[INST]{prompt}[/INST]
`

In [4]:
# Define the model repository and filename
llama_model_repo = "TheBloke/Llama-2-13B-chat-GGUF"
llama_model_file = "llama-2-13b-chat.Q5_K_M.gguf"

In [5]:
# Download the model
llama_model_path = hf_hub_download(
    repo_id= llama_model_repo,
    filename= llama_model_file
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
# Initialize the Llama model with the downloaded GGUF file.
# model_path: path to the GGUF model file.
# n_ctx: context window size (determines how much text the model can process at once).
# n_gpu_layers: number of layers to offload to the GPU for acceleration.
# n_batch: batch size for processing.
llama_llm = Llama(
    model_path=llama_model_path,
    n_ctx=5000,
    n_gpu_layers=38,
    n_batch=512
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


### 6.2 - Define utility function `generate_faithfulness_and_relevance_score`

In [7]:
def generate_faithfulness_and_relevance_score(
    user_input, # user query
    system_prompt, # prompt to generate RAG response
    retriever, # DB retriever
    faithfulness_rater_system_message,
    relevance_rater_system_message,
    max_tokens=0,
    temperature=0,
    top_p=0.95,
    top_k=50,
    repeat_penalty=1.0,
    k=3,
):
    rag_prompt, context_retrieved = prepare_rag_model_prompt(
        system_prompt,
        user_input,
        retriever,
        k=k
    )

    rag_response, model_output = generate_response(
        rag_prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        repeat_penalty=repeat_penalty
    )

    message_template = """
      ###Question
      {question}

      ###Context
      {context}

      ###Answer
      {answer}
      """

    # Generate faithfulness_prompt
    faithfulness_prompt = f"""[INST]{faithfulness_rater_system_message}\n
                {'user'}: {message_template.format(question=user_input,context=context_retrieved, answer=rag_response)}
                [/INST]"""

    print("faithfulness_prompt = ", faithfulness_prompt)

    # Generate relevance_prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {message_template.format(question=user_input,context=context_retrieved, answer=rag_response)}
                [/INST]"""

    print("relevance_prompt = ", relevance_prompt)

    faith_response = llama_llm(
            prompt=faithfulness_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    relevance_response = llama_llm(
            prompt=relevance_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    return faith_response['choices'][0]['text'],relevance_response['choices'][0]['text']

In [8]:
faithfulness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
Rate it 1 - if The ###Answer is not derived from the ###Context at all.
Rate it 2 - if The ###Answer is derived from the ###Context only to a limited extent.
Rate it 3 - if The ###Answer is derived from ###Context to a good extent.
Rate it 4 - if The ###Answer is derived from ###Context mostly.
Rate it 5 - if The ###Answer is derived from ###Context completely.

Metric:
The answer should be derived only from the information presented in the context.

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.

Please note: Make sure you give a single overall rating in the range of 1 to 5 along with an overall explanation.
"""

In [9]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
- Rate 1 – The ###Answer is not relevant to the ###Question at all.
- Rate 2 – The ###Answer is only slightly relevant to the **###Question**, missing key aspects.
- Rate 3 – The ###Answer is moderately relevant, addressing some parts of the **###Question** but leaving out important details.
- Rate 4 – The ###Answer is mostly relevant, covering key aspects but with minor gaps.
- Rate 5 – The ###Answer is fully relevant, directly answering all important aspects of the **###Question** with appropriate details from the **###Context**.

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.

Note: Provide a single overall rating in the range of 1 to 5, along with a brief explanation of why you assigned that score.
"""

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [10]:
ground,rel = generate_faithfulness_and_relevance_score(
    user_input=query_1,
    system_prompt=system_message,
    retriever=retriever,
    faithfulness_rater_system_message=faithfulness_rater_system_message,
    relevance_rater_system_message=relevance_rater_system_message
)
print(ground,end="\n\n")
print(rel)

NameError: name 'query_1' is not defined

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
ground,rel = generate_faithfulness_and_relevance_score(
    user_input=query_2,
    system_prompt=system_message,
    retriever=retriever,
    faithfulness_rater_system_message=faithfulness_rater_system_message,
    relevance_rater_system_message=relevance_rater_system_message
)

print(ground,end="\n\n")
print(rel)

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
ground,rel = generate_faithfulness_and_relevance_score(
    user_input=query_3,
    system_prompt=system_message,
    retriever=retriever,
    faithfulness_rater_system_message=faithfulness_rater_system_message,
    relevance_rater_system_message=relevance_rater_system_message
)

print(ground,end="\n\n")
print(rel)

### Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
ground,rel = generate_faithfulness_and_relevance_score(
    user_input=query_4,
    system_prompt=system_message,
    retriever=retriever,
    faithfulness_rater_system_message=faithfulness_rater_system_message,
    relevance_rater_system_message=relevance_rater_system_message
)

print(ground,end="\n\n")
print(rel)

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
ground,rel = generate_faithfulness_and_relevance_score(
    user_input=query_5,
    system_prompt=system_message,
    retriever=retriever,
    faithfulness_rater_system_message=faithfulness_rater_system_message,
    relevance_rater_system_message=relevance_rater_system_message
)

print(ground,end="\n\n")
print(rel)

## Actionable Insights and Business Recommendations


*   
*  
*



<font size=6 color='blue'>Power Ahead</font>
___