**Common Questions to Answer**

**1. Diagnostic Assistance**: "What are the common symptoms and treatments for pulmonary embolism?"

**2. Drug Information**: "Can you provide the trade names of medications used for treating hypertension?"

**3. Treatment Plans**: "What are the first-line options and alternatives for managing rheumatoid arthritis?"

**4. Specialty Knowledge**: "What are the diagnostic steps for suspected endocrine disorders?"

**5. Critical Care Protocols**: "What is the protocol for managing sepsis in a critical care unit?"

## Installing and Importing Necessary Libraries and Dependencies

In [None]:
# Installation for CPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

In [None]:
!pip uninstall -y numpy pandas

!pip install numpy==1.23.5 pandas==1.5.3 -q

In [None]:
!pip install huggingface_hub==0.23.2 tiktoken==0.6.0 pymupdf==1.25.1 \
langchain==0.1.1 langchain-community==0.0.13 chromadb==0.4.22 \
sentence-transformers==2.2.2 transformers==4.23.1 -q

In [None]:
# Importing library for data manipulation
import pandas as pd

# Function to download the model from the Hugging Face model hub
from huggingface_hub import hf_hub_download

# Importing the Llama class from the llama_cpp module
from llama_cpp import Llama

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

# Importing the json module
import json

## Question Answering using LLM

### Downloading and Loading the model

In [None]:
# Defining the Hugging Face repository and model version for Mistral-7B fine-tuned for instruction-following
model_name_or_path = 'TheBloke/Mistral-7B-Instruct-v0.2-GGUF'

# Specifying the file name for the quantized Mistral-7B model in GGUF format (Q6_K for optimal performance)
model_basename = 'mistral-7b-instruct-v0.2.Q6_K.gguf'

In [None]:
# Downloading the specified model file from Hugging Face Hub and store its local path
model_path = hf_hub_download(
    repo_id=model_name_or_path, #The Hugging Face repository containing the model
    filename=model_basename  # The specific model file to download (in GGUF format)
)
#The GGUF format is used because it provides memory-efficient storage and faster inference while maintaining compatibility across different hardware platforms.

In [None]:
# Loading the LLaMA model with specified context, GPU layers, and batch size
llm = Llama(
    model_path=model_path, #Path to the GGUF model file
    n_ctx=2300, #Sets the context window to 2300 tokens (how much text the model can "see" at once)
    n_gpu_layers=38, #Loads 38 model layers onto GPU for faster inference (set to 0 for CPU-only)
    n_batch=512 #Number of tokens processed at once
)

### Defining Model Response Parameters

In [None]:
def response(query, max_tokens=512, temperature=0, top_p=0.95, top_k=50):
    # Sends the query prompt to the LLM with specified generation parameters
    model_output = llm( 
        prompt=query, #The user's input question or prompt sent to the LLM
        max_tokens=max_tokens, #Maximum number of tokens to generate
        temperature=temperature, #Controls randomness
        top_p=top_p, #picks from top tokens that make up top_p of total probability
        top_k=top_k #considers only the top_k most likely tokens
    )
    # Extracting and returning only the text part of the response
    return model_output['choices'][0]['text'].strip()

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
response('What is the protocol for managing sepsis in a critical care unit?')

**Observations**

* The response generated by the LLM is detailed and medically appropriate regarding the management of sepsis in a critical care unit.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
response('What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?')

**Observations**

* The answer, similar to the first query, is medically accurate, demonstrating that the model is performing effectively.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
response('What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?')

**Observations**

* The answer is medically appropriate and lists effective treatemetns for sudden patchy hair loss.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
response('What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?')

**Observations**

* The answer provided is medically appropriate and offers a comprehensive list of common treatments for brain injury, which aligns well with the query.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
response('What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?')

**Observations**

* This answer is comprehensive and medically appropriate, outlining the necessary precautions and treatment steps for a person who has fractured their leg while hiking.

## Question Answering using LLM with Prompt Engineering

### Defining Model Response Parameters

In [None]:
def generate_llama_response(query, instruction="Answer the following question clearly and concisely."):

    # Create system message with instructions for the model
    system_message = f"[INST]<<SYS>>\n{instruction}\n<</SYS>>[/INST]"

    # Construct the final prompt using the user's query and system message
    prompt = f"{system_message} {query}"

    # Generate a response using the LLaMA model
    response = llm(
        prompt=prompt,
        max_tokens=1024,
        temperature=0.7,
        top_p=0.95,
        repeat_penalty=1.1,
        top_k=50,
        stop=['</s>'],
        echo=False,
        seed=42,
    )

    # Return only the generated answer text
    return response["choices"][0]["text"].strip()

### Query 1: What is the protocol for managing sepsis in a critical care unit?

#### Basic parameters

In [None]:
generate_llama_response('What is the protocol for managing sepsis in a critical care unit?')

#### Modified parameters - Temperature = 0.0

In [None]:
def generate_llama_response1(query, instruction="Answer the following question clearly and concisely."):

    # Create system message with instructions for the model
    system_message = f"[INST]<<SYS>>\n{instruction}\n<</SYS>>[/INST]"

    # Construct the final prompt using the user's query and system message
    prompt = f"{system_message} {query}"

    # Generate a response using the LLaMA model
    response = llm(
        prompt=prompt,
        max_tokens=1024,
        temperature=0.0,
        top_p=0.95,
        repeat_penalty=1.1,
        top_k=50,
        stop=['</s>'],
        echo=False,
        seed=42,
    )

    # Return only the generated answer text
    return response["choices"][0]["text"].strip()

In [None]:
generate_llama_response1('What is the protocol for managing sepsis in a critical care unit?')

#### Observation on Query 1


* **Answer 1**, generated with a temperature of 0.7, is more **comprehensive and nuanced**, offering a clear step-by-step clinical protocol with specific metrics (e.g., MAP ≥ 65 mmHg, urine output ≥ 0.5 ml/kg/h). This reflects a more dynamic and human-like response, likely enhanced by the higher temperature which allows the model to explore a broader range of relevant tokens, contributing to richer content.

* **Answer 2**, with temperature 0.0, is more conservative and deterministic, sticking closely to the most probable next token at each step. While it remains medically accurate, it is less informative and cuts off mid-sentence, indicating a lack of flexibility in adapting to the context or expanding on critical aspects.

* Overall, **Answer 1** is superior in terms of completeness, clinical clarity, and usefulness for decision-making in a critical care setting.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

#### Basic parameters

In [None]:
generate_llama_response('What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?')

#### Modified Parameters - Top_p = 0.85

In [None]:
def generate_llama_response2(query, instruction="Answer the following question clearly and concisely."):

    # Create system message with instructions for the model
    system_message = f"[INST]<<SYS>>\n{instruction}\n<</SYS>>[/INST]"

    # Construct the final prompt using the user's query and system message
    prompt = f"{system_message} {query}"

    # Generate a response using the LLaMA model
    response = llm(
        prompt=prompt,
        max_tokens=1024,
        temperature=0.7,
        top_p=0.85,
        repeat_penalty=1.1,
        top_k=50,
        stop=['</s>'],
        echo=False,
        seed=42,
    )

    # Return only the generated answer text
    return response["choices"][0]["text"].strip()

In [None]:
generate_llama_response2('What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?')

#### Observations on Query 2

* **Answer 1**, generated with a top_p of 0.95, allows the model to consider a wider range of possible next tokens, leading to slightly more diverse and expansive content. The answer is informative but a bit general in structure.

* **Answer 2**, generated with a top_p = 0.85, that is a narrower top_p, the model restricts its output to more probable and focused responses.

* Overall, **Answer 2** is better because the lower top_p value helps the model generate more focused and relevant content, minimizing unnecessary variation.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

#### Basic parameters

In [None]:
generate_llama_response('What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?')

#### Modified Parameters - Top_k = 80

In [None]:
def generate_llama_response3(query, instruction="Answer the following question clearly and concisely."):

    # Create system message with instructions for the model
    system_message = f"[INST]<<SYS>>\n{instruction}\n<</SYS>>[/INST]"

    # Construct the final prompt using the user's query and system message
    prompt = f"{system_message} {query}"

    # Generate a response using the LLaMA model
    response = llm(
        prompt=prompt,
        max_tokens=1024,
        temperature=0.7,
        top_p=0.95,
        repeat_penalty=1.1,
        top_k=80,
        stop=['</s>'],
        echo=False,
        seed=42,
    )

    # Return only the generated answer text
    return response["choices"][0]["text"].strip()

In [None]:
generate_llama_response3('What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?')

#### Observations on Query 3

* **Answer 1**, generated with a top_k of 50, provides a broader overview by listing multiple causes of patchy hair loss and briefly mentioning treatment options for each, which makes it informative but slightly general.

* **Answer 2**, with top_k of 80, dives deeper into specific treatments for different types of hair loss, especially Alopecia Areata and Traction Alopecia, offering more clinical detail and therapeutic options.

* Overall, in the context of this project, **Answer 2** is better because it shows more medical depth and actionable insights.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

#### Basic Parameters

In [None]:
generate_llama_response('What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?')

#### Modified Parameters - Max_tokens = 512

In [None]:
def generate_llama_response4(query, instruction="Answer the following question clearly and concisely."):

    # Create system message with instructions for the model
    system_message = f"[INST]<<SYS>>\n{instruction}\n<</SYS>>[/INST]"

    # Construct the final prompt using the user's query and system message
    prompt = f"{system_message} {query}"

    # Generate a response using the LLaMA model
    response = llm(
        prompt=prompt,
        max_tokens=512,
        temperature=0.7,
        top_p=0.95,
        repeat_penalty=1.1,
        top_k=50,
        stop=['</s>'],
        echo=False,
        seed=42,
    )

    # Return only the generated answer text
    return response["choices"][0]["text"].strip()

In [None]:
generate_llama_response4('What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?')

#### Observations on Query 4

* **Answer 1**, generated with a Max_tokens of 1024, provides a comprehensive overview of brain injury treatment, listing specific interventions such as medications, surgery, rehabilitation, and supportive care, covering a wide range of recovery aspects, including physical, cognitive, and emotional support.

* **Answer 2**, with Max_tokens of 512, is slightly more concise, focusing on key areas like initial care, rehabilitation, medications, and supportive care. It emphasizes rehabilitation methods and includes mention of surgery when necessary for specific issues such as hematomas or clots.

* Overall, **Answer 1** is better as it offers a more detailed and well-rounded view, covering additional aspects. With more tokens, the model can cover additional aspects of the topic, giving a fuller answer.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

#### Basic Parameters

In [None]:
generate_llama_response('What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?')

#### Modified Parameters -

Temp = 0.7

Top_p = 0.85

Top_k = 80

Max_tokens = 1024

In [None]:
def generate_llama_response5(query, instruction="Answer the following question clearly and concisely."):

    # Create system message with instructions for the model
    system_message = f"[INST]<<SYS>>\n{instruction}\n<</SYS>>[/INST]"

    # Construct the final prompt using the user's query and system message
    prompt = f"{system_message} {query}"

    # Generate a response using the LLaMA model
    response = llm(
        prompt=prompt,
        max_tokens=1024,
        temperature=0.7,
        top_p=0.85,
        repeat_penalty=1.1,
        top_k=80,
        stop=['</s>'],
        echo=False,
        seed=42,
    )

    # Return only the generated answer text
    return response["choices"][0]["text"].strip()

In [None]:
generate_llama_response5('What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?')

#### Observations on Query 5

* **Answer 1**, generated with default parameters, provides a more detailed approach to the immediate steps for handling a fractured leg during a hiking trip, emphasizing both the severity assessment and infection prevention, while focusing on the necessity of seeking medical attention in case of complications.

* **Answer 2**, with modified parameters, is more structured and offers a comprehensive recovery plan, addressing not only the immediate care but also long-term considerations like nutrition, rest, and physical therapy. It also focuses on the steps that a layperson can take, such as cleaning the wound and elevating the leg for pain management.

* Overall, **Answer 2** is more holistic and includes follow-up care, making it slightly better for a complete response that covers both the immediate treatment and recovery process.

## Data Preparation for RAG

### Loading the Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
merck_pdf_path = '/content/sample_data/medical_diagnosis_manual.pdf'
pdf_loader = PyMuPDFLoader(merck_pdf_path)
merck = pdf_loader.load()

### Data Overview

#### Checking first 5 pages

In [None]:
for i in range(5):
    print(f"Page Number : {i+1}",end="\n")
    print(merck[i].page_content,end="\n")

#### Checking the number of pages

In [None]:
len(merck)

### Data Chunking

In [None]:
# Initializing a RecursiveCharacterTextSplitter to split the text into manageable chunks for embedding and retrieval
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap= 20
)

In [None]:
#loading the PDF document, extracting its text, and splitting it into smaller chunks
document_chunks = pdf_loader.load_and_split(text_splitter)

In [None]:
#Checking the number of text chunks the pdf has been split into
len(document_chunks)

In [None]:
# Initializing a RecursiveCharacterTextSplitter to split the text into manageable chunks for embedding and retrieval
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap= 20
)

In [None]:
#loading the PDF document, extracting its text, and splitting it into smaller chunks
document_chunks = pdf_loader.load_and_split(text_splitter)

In [None]:
#Checking the number of text chunks the pdf has been split into
len(document_chunks)

#### Confirming that there is overlap between chunks

In [None]:
document_chunks[120].page_content

In [None]:
document_chunks[119].page_content

In [None]:
document_chunks[121].page_content

**Observations**

### Embedding

In [None]:
#This model is chosen because of its embedding vector size is the same as our token size in chunking (512).
embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large')

In [None]:
# Generating embedding for the first document chunk
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)
# Generating embedding for the second document chunk
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)

In [None]:
#Checking if both are of the same size
print("Dimension of the embedding vector ",len(embedding_1))
len(embedding_1)==len(embedding_2)

### Vector Database

In [None]:
import os

In [None]:
# Creating the output directory 'merck_db' if it doesn't already exist, so we can save the processed data or vector database files there.
out_dir = 'merck_db'

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [None]:
vectorstore = Chroma.from_documents( #creating a Chroma vector store from a set of document chunks.
    document_chunks, #creating a list of text chunks that will be converted into embeddings..
    embedding_model, #model responsible for embedding the document chunks into vector representations
    persist_directory=out_dir #name of the collection in the Chroma database
)

In [None]:
 #Loading Chroma vector store with the given embedding model
 vectorstore = Chroma(persist_directory=out_dir,embedding_function=embedding_model)

In [None]:
#Accessing the embedding function used in the Chroma vector store
vectorstore.embeddings

In [None]:
#Performing a similarity search in the vector store to find the top 3 most similar documents to "Alopecia Areata"
vectorstore.similarity_search("Alopecia Areata ",k=3)

**Observations**

 From the retrieved chunks, we observe that all the chunks are related to the key terms **Alopecia Areata**

### Retrieval

In [None]:
retriever = vectorstore.as_retriever( #Converting the Chroma vector store into a retriever for querying.
    search_type='similarity', #Specifying that retrieval is based on cosine similarity
    search_kwargs={'k': 3} #Retrieving the top 3 most similar documents for a given query.
)

In [None]:
user_input = 'What are the symptoms of migraine?'
rel_docs = retriever.get_relevant_documents(user_input)
rel_docs

### System and User Prompt Template

In [None]:
# System message instructing the LLM to only answer using Merck Manual 19th Edition
qna_system_message = """
You are a helpful assistant trained to answer questions based only on the Merck Manual of Medical Diagnosis and Therapy, Nineteenth Edition.
Use the context provided to find accurate and reliable answers.
If the answer is not found in the context, reply with "I don't know".
Do not mention the context or the Merck Manual in your final answer.
"""

In [None]:
# Template for formatting the user's input with context from the Merck Manual, 19th Edition and the actual medical question.
qna_user_message_template = """
###Context
The following excerpts are from the Merck Manual of Medical Diagnosis and Therapy, Nineteenth Edition:
{context}

###Question
{question}
"""

### Response Function

In [None]:
def generate_rag_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + '\n' + user_message

    # Generate the response
    try:
        response = llm(
                  prompt=prompt,
                  max_tokens=max_tokens,
                  temperature=temperature,
                  top_p=top_p,
                  top_k=top_k
                  )

        # Extract and print the model's response
        response = response['choices'][0]['text'].strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

## Question Answering using RAG

### Query 1: What is the protocol for managing sepsis in a critical care unit?

#### Basic Parameters

In [None]:
user_input1 = 'What is the protocol for managing sepsis in a critical care unit?'
print(generate_rag_response(user_input1))

#### Fine Tuning - Removing Chunk Overlap

Chunk size = 512, overlap = 0

In [None]:
# Initializing a RecursiveCharacterTextSplitter to split the text into manageable chunks for embedding and retrieval
text_splitter1 = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap= 0
)

In [None]:
#loading the PDF document, extracting its text, and splitting it into smaller chunks
document_chunks1 = pdf_loader.load_and_split(text_splitter1)

In [None]:
#Checking the number of text chunks the pdf has been split into
len(document_chunks1)

In [None]:
# Generating embedding for the first document chunk
embedding_1 = embedding_model.embed_query(document_chunks1[0].page_content)
# Generating embedding for the second document chunk
embedding_2 = embedding_model.embed_query(document_chunks1[1].page_content)

In [None]:
vectorstore1 = Chroma.from_documents( #creating a Chroma vector store from a set of document chunks.
    document_chunks1, #creating a list of text chunks that will be converted into embeddings..
    embedding_model, #model responsible for embedding the document chunks into vector representations
    persist_directory=out_dir #name of the collection in the Chroma database
)

In [None]:
 #Loading Chroma vector store with the given embedding model
 vectorstore1 = Chroma(persist_directory=out_dir,embedding_function=embedding_model)

In [None]:
#Accessing the embedding function used in the Chroma vector store
vectorstore1.embeddings

In [None]:
retriever1 = vectorstore1.as_retriever( #Converting the Chroma vector store into a retriever for querying.
    search_type='similarity', #Specifying that retrieval is based on cosine similarity
    search_kwargs={'k': 3} #Retrieving the top 3 most similar documents for a given query.
)

In [None]:
def generate_rag_response1(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks1 = retriever1.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks1]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + '\n' + user_message

    # Generate the response
    try:
        response = llm(
                  prompt=prompt,
                  max_tokens=max_tokens,
                  temperature=temperature,
                  top_p=top_p,
                  top_k=top_k
                  )

        # Extract and print the model's response
        response = response['choices'][0]['text'].strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

In [None]:
user_input1 = 'What is the protocol for managing sepsis in a critical care unit?'
print(generate_rag_response1(user_input1))

#### Observations on Query 1

* Removing chunk overlap **did not change the answers** in this case, probably because the relavant information for the question was already contained within a single chunk.

* Removing chunk overlap might affect the answer if the question required information that spanned across multiple chunks from different parts of the text.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

#### Basic Parameter

In [None]:
user_input2 = 'What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?'
print(generate_rag_response(user_input2))

#### Fine Tuning - Retreiver Parameters

k = 2


In [None]:
retriever2 = vectorstore.as_retriever( #Converting the Chroma vector store into a retriever for querying.
    search_type='similarity', #Specifying that retrieval is based on cosine similarity
    search_kwargs={'k': 2} #Retrieving the top 2 most similar documents for a given query.
)

In [None]:
def generate_rag_response2(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever2.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + '\n' + user_message

    # Generate the response
    try:
        response = llm(
                  prompt=prompt,
                  max_tokens=max_tokens,
                  temperature=temperature,
                  top_p=top_p,
                  top_k=top_k
                  )

        # Extract and print the model's response
        response = response['choices'][0]['text'].strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

In [None]:
user_input2 = 'What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?'
print(generate_rag_response2(user_input2))

#### Observations on Query 2

* **Answer 1**, with **k=3**, is slightly more detailed and includes examples of cases (like when surgery is impossible).

* **Answer 2**, with **k=2**, is slightly shorter and cuts off earlier, suggesting that retrieving more chunks (k=3) provided additional helpful context for a more complete answer.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

#### Basic Parameters

In [None]:
user_input3 = 'What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?'
print(generate_rag_response(user_input3))

#### Fine Tuning - LLM Parameters

Temperature = 0.7

In [None]:
user_input3 = 'What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?'
print(generate_rag_response(user_input3, temperature=0.7))

#### Observations on Query 3

* **Answer 1**, with **temperature = 0**, is more precise and structured.

* **Answer 2**, with **temperature = 0.7**, introduces a bit more variability, with slight changes in wording, reflecting the increased randomness from the higher temperature.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

#### Basic Parameters

In [None]:
user_input4 = 'What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?'
print(generate_rag_response(user_input4))

#### Fine Tuning - LLM Parameters

Top_p = 0.8

In [None]:
user_input4 = 'What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?'
print(generate_rag_response(user_input4, top_p=0.8))

#### Observations on Query 4

* **Answer 1** with **Top_p = 0.95** means the model will consider the top 95% probability mass of the next token and select from that pool. This makes the answer generated more focused, coherent responses because it prioritizes more likely words and phrases.

* **Answer 2** with **Top_p = 0.8** means the model restricts the token selection to only the top 80% of the probability distribution, including less likely but more diverse options.

* Thus, in **Answer 1**, the model focuses more on the main facts (like supportive care and rehab) without adding unnecessary or unrelated details. In **Answer 2**, the model explores a wider range of possibilities, leading to additional information about severe cases, surgery, and treatment phases.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

#### Basic Parameters

In [None]:
user_input5 = 'What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?'
print(generate_rag_response(user_input5))

#### Fine Tuning - LLM Parameters

Top_k = 25

In [None]:
user_input5 = 'What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?'
print(generate_rag_response(user_input5, top_k=25))

#### Observations on Query 5

* In this case, the **generated answers are the same**, probably because the topic is very structured, and the model chooses almost the same tokens even when it has a broader range of options (top_k=50).

## Output Evaluation

#### Rating system

In [None]:
groundedness_rater_system_message = """

You will be presented a ###Question, ###Context used by the AI system and AI generated ###Answer.

Your task is to judge the extent to which the ###Answer is derived from ###Context.

Rate it 1 - if The ###Answer is not derived from the ###Context at all
Rate it 2 - if The ###Answer is derived from the ###Context only to a limited extent
Rate it 3 - if The ###Answer is derived from ###Context to a good extent
Rate it 4 - if The ###Answer is derived from ###Context mostly
Rate it 5 - if The ###Answer is is derived from ###Context completely

Please note: Make sure you give a single overall rating in the range of 1 to 5 along with an overall explanation.

"""

In [None]:
relevance_rater_system_message = """

You will be presented with a ###Question, the ###Context used by the AI system to generate a response, and the AI-generated ###Answer.

Your task is to judge the extent to which the ###Answer is relevant to the ###Question, considering whether it directly addresses the key aspects of the ###Question based on the provided ###Context.

Rate the relevance as follows:
- Rate 1 – The ###Answer is not relevant to the ###Question at all.
- Rate 2 – The ###Answer is only slightly relevant to the **###Question**, missing key aspects.
- Rate 3 – The ###Answer is moderately relevant, addressing some parts of the **###Question** but leaving out important details.
- Rate 4 – The ###Answer is mostly relevant, covering key aspects but with minor gaps.
- Rate 5 – The ###Answer is fully relevant, directly answering all important aspects of the **###Question** with appropriate details from the **###Context**.

Note: Provide a single overall rating in the range of 1 to 5, along with a brief explanation of why you assigned that score.
"""

In [None]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

#### Function for Output Evaluation

In [None]:
def generate_ground_relevance_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=3)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)

    # Combine user_prompt and system_message to create the prompt
    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=context_for_query, question=user_input)}
                [/INST]"""

    response = llm(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    answer =  response["choices"][0]["text"]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    response_1 = llm(
            prompt=groundedness_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    response_2 = llm(
            prompt=relevance_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    return response_1['choices'][0]['text'],response_2['choices'][0]['text']

#### Evaluation of Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input1 = 'What is the protocol for managing sepsis in a critical care unit?'
ground,rel = generate_ground_relevance_response(user_input1,max_tokens=350)

print(ground,end="\n\n")
print(rel)

**Observations**

* The rating of 5 for groudness and relavance indicates that the answer is fully grounded in the context and is also highly relevant to the query.

#### Evaluation of Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input2 = 'What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?'
ground,rel = generate_ground_relevance_response(user_input2,max_tokens=350)

print(ground,end="\n\n")
print(rel)

**Observations**

* The rating of 5 for groudness and relavance indicates that the answer is fully grounded in the context and is also highly relevant to the query.

#### Evaluation of Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input3 = 'What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?'
ground,rel = generate_ground_relevance_response(user_input3,max_tokens=350)

print(ground,end="\n\n")
print(rel)

**Observations**

* The rating of 5 for groudness and relavance indicates that the answer is fully grounded in the context and is also highly relevant to the query.

#### Evaluation of Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input4 = 'What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?'
ground,rel = generate_ground_relevance_response(user_input4,max_tokens=350)

print(ground,end="\n\n")
print(rel)

**Observations**

* The rating of 5 for groudness and relavance indicates that the answer is fully grounded in the context and is also highly relevant to the query.

#### Evaluation of Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input5 = 'What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?'
ground,rel = generate_ground_relevance_response(user_input5,max_tokens=350)

print(ground,end="\n\n")
print(rel)

**Observations**

* The rating of 5 for groudness and relavance indicates that the answer is fully grounded in the context and is also highly relevant to the query.

## Actionable Insights and Business Recommendations

**Business Insights**

**Accuracy and Trustworthiness**

* RAG-based answers are context-specific, reliable, and trustworthy, enhancing customer satisfaction in medical applications.

**Enhanced User Experience**

* The RAG model is able to provide personalized, relevant responses and more accurate answers.

**Cost Reduction in Medical Consultations**

* The model can be used for automating medical Q&A, reducing the need for consultations, saving costs in telemedicine and healthcare support.

**Recommendations**

**Continuous Fine-tuning**

* The model should be updated and fine-tuned to reflect the latest medical guidelines and research.

**Ethical Considerations**

* Data privacy should be prioritized to ensure compliance with regulations like HIPAA and GDPR.

**Specialized Models**

* Dedicated models can be created,  tailored to individual medical specialties such as orthopedics, gastroenterology, oncology, and more for improved accuracy and relevance.
