Installing and Importing Necessary Libraries and Dependencies


In [None]:
# !pip install --upgrade --force-reinstall \
#   "numpy>=1.26.0,<2.2.0" \
#   "pandas>=2.0.0,<2.2.0" \
#   "scikit-learn<1.7" \
#   "tensorflow==2.19.0"

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

In [None]:
#Libraries for processing dataframes,text
import json,os
import tiktoken
import pandas as pd

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

#Libraries for downloading and loading the llm
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

In [None]:
# Example on how embedding can be split
from langchain.text_splitter import RecursiveCharacterTextSplitter

sample_text = "Medical content snippet here..."
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
chunks = splitter.split_text(sample_text)
print(chunks)

In [None]:
# Model details
model_name_or_path = "" # Hugging Face repo ID
model_basename = "" # File name in the repo

# Download from Hugging Face Hub
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

In [None]:
llm = Llama(
    model_path=model_path,
    n_ctx=2300,       # context window size
    n_gpu_layers=38,  # number of layers to run on GPU
    n_batch=512       # batch size for inference
)


AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


In [None]:
def response(query,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    model_output = llm(
      prompt=query,
      max_tokens=max_tokens,
      temperature=temperature,
      top_p=top_p,
      top_k=top_k
    )

    return model_output['choices'][0]['text']

LLM Question Answering – Baseline (No System Prompt)

**Purpose:**  
Establish a baseline for Large Language Model (LLM) responses without predefined role, tone, or style instructions.  
This allows comparison against later runs with structured **system prompts** to evaluate improvements in accuracy, tone, and consistency.

**Approach:**  
- Directly pass user queries to the `response()` function without adding a system-level instruction.  
- Let the LLM interpret the intent and produce answers based solely on the raw question.  

In [None]:
query1 = "What is the protocol for managing sepsis in a critical care unit?"
response1 = response(query1)
print(response1)

In [None]:
# Define the system prompt for the LLM
system_prompt = (
    "You are a knowledgeable and precise medical assistant. "
    "Answer each question accurately based only on reliable medical sources. "
    "Provide step-by-step reasoning where needed, and list treatments or protocols clearly. "
    "If unsure, state that more expert consultation is required."
)

Purpose:
Demonstrate how a carefully crafted system prompt can guide a Large Language Model (LLM) to deliver concise, medically accurate, and structured answers for diverse healthcare-related queries.

In [None]:
user_input = system_prompt + "\n" + "What is the protocol for managing sepsis in a critical care unit?"
response(user_input)

Context Truncation Prevents Model Overload
Dynamically truncating retrieved context to fit within the LLM’s token limit (2300 tokens) avoids runtime errors and optimizes the balance between context depth and model performance.

Prompt Engineering Enhances Answer Quality
Designed system and user prompts that incorporate retrieved context and clarify the assistant’s role improve answer relevance, groundedness, and reduce hallucinations.