### In this section, I will involve a LLM as an question-generator to make hypothetical questions based on the docments
#### That is one of reasons why I call the RAG system as ***Agentic-RAG***

In [1]:
import json
import uuid
import os
from typing import List
from pathlib import Path
from pydantic import BaseModel, Field

In [2]:
# This function will load the json file to json object
def load_json_list(path: str):    
    with open(path, mode = "r", encoding="utf-8") as f:
        return json.load(f)

In [3]:
workspace_base_path = os.getcwd()
dataset_path = os.path.join(workspace_base_path, "datasets", "chuncked_data.json") 
print(dataset_path)

/home/jovyan/work/datasets/chuncked_data.json


In [4]:
data = load_json_list(dataset_path)

In [5]:
data[:2]

[{'doc_id': '1bf5880b-93ec-4ac9-a0cb-eb35693ccce4',
  'questions': ['why is Phenylephrine prescribed?'],
  'docs': ['phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine'],
  'original_doc': 'phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine'},
 {'doc_id': 'f7ad6ffd-7176-4

In [6]:
from huggingface_hub import login

In [7]:
with open("keys.txt") as f:
    os.environ["HF_TOKEN"] = f.read().strip()

# login using env var
login(os.environ["HF_TOKEN"])

print(f"Login Huggingface so that we can access the model")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Login Huggingface so that we can access the model


### Now I need to use a local LLM to generate hypothetical question for the content.

#### I tested three models in my test notebook(That's a mess. I will never show it to you). Result is showing Below:

#### Task 1: Generate hypothetical questions for a document and output JSON
* meta-llama/Llama-3.2-1B-Instruct did OK job. very fast.
* meta-llama/Meta-Llama-3-8B-Instruct is out of control no matter how I prompt it. very slow!
* ContactDoctor/Bio-Medical-Llama-3-8B did amazing job. fast enought.

#### Task 2: Summarize the document
* meta-llama/Llama-3.2-1B-Instruct did OK job. very fast.
* meta-llama/Meta-Llama-3-8B-Instruct is out of control. very slow!
* ContactDoctor/Bio-Medical-Llama-3-8B did amazing job. fast enought.

#### So I have no any reason not to select ContactDoctor/Bio-Medical-Llama-3-8B

#### As the model provider's citation:

@misc{ContactDoctor_Bio-Medical-Llama-3-8B, author = ContactDoctor, title = {ContactDoctor-Bio-Medical: A High-Performance Biomedical Language Model}, year = {2024}, howpublished = {https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B}, }

Bio-Medical-Llama-3-8B model is a specialized large language model designed for biomedical applications. It is finetuned from the meta-llama/Meta-Llama-3-8B-Instruct model using a custom dataset containing over 500,000 diverse entries. These entries include a mix of synthetic and manually curated data, ensuring high quality and broad coverage of biomedical topics.

The model is trained to understand and generate text related to various biomedical fields, making it a valuable tool for researchers, clinicians, and other professionals in the biomedical domain.


In [8]:
model_id = "ContactDoctor/Bio-Medical-Llama-3-8B"

In [23]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_huggingface import HuggingFacePipeline, ChatHuggingFace
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from transformers import BitsAndBytesConfig
import torch

#### Better quantization will speed up the inference time extrordinately.

In [10]:
def best_dtype():
    if torch.cuda.is_available():
        if torch.cuda.is_bf16_supported():
            return torch.bfloat16
        else:
            return torch.float16
        
    return torch.float32

def best_device():
    return "cuda" if torch.cuda.is_available() else "cpu"

In [11]:
print(best_dtype())

torch.bfloat16


In [12]:
# ContactDoctor/Bio-Medical-Llama-3-8B works best at bfloat16 quantization.
# In order to load a lighter LLM and still don't lose the performance.
# I will load it in 8bit and keep the dtype as bfloat16. This is a balance choise.
def load_model(model_id: str):
    bnb_cfg = None

    if best_device() == "cuda":
        bnb_cfg = BitsAndBytesConfig(
            load_in_8bit=True, # I don't want to lost LLM's performance     
            load_in_8bit_fp32_cpu_offload=False,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            quantization_config = bnb_cfg,
            dtype = best_dtype(),
            device_map = "auto",                        
        )
        print(f"The bnb configuration: {bnb_cfg}")
    else: #CPU
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            dtype = best_dtype(),
            device_map={"":best_device()}, 
            low_cpu_mem_usage=True           
        )

    return model

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

model = load_model(model_id)

print("Load tokenizer done!")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The bnb configuration: BitsAndBytesConfig {
  "_load_in_4bit": false,
  "_load_in_8bit": true,
  "bnb_4bit_compute_dtype": "float32",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "fp4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": false,
  "load_in_8bit": true,
  "quant_method": "bitsandbytes"
}

Load tokenizer done!


In [14]:
print(model)                    # full architecture tree (long but useful)
print(model.config)             # core hyperparameters (dims, layers, heads…)
print(model.name_or_path)       # the checkpoint id/path


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear8bitLt(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Lla

In [15]:
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
    return_full_text=False,   
    )

Device set to use cuda:0


In [16]:
hug_pipe = HuggingFacePipeline(pipeline=pipe)

In [17]:
question_generator = ChatHuggingFace(llm=hug_pipe)

In [53]:
# LLM model is ready. 
# Here is the definition of the structure of output schema
class QuestionList(BaseModel):
    questions: List[str] = Field(
        ...,
        description="A list of unique, concise questions in English; each ends with a question mark."
    )

In [54]:
#structured_generator = question_generator.with_structured_output(QuestionList)

parser = PydanticOutputParser(pydantic_object=QuestionList)

In [59]:
# Prompt engineering can guide the LLM to generate output we expect.

# def prompt_maker(tok, doc_text: str, n: int) -> str:

#     messages = [
#         {"role": "system", "content": "You are a medical student. You like ask questions when you read clinical documents."},
#         {"role": "user", "content": (
#         f"Read the document and generate {n} clinically relevant questions.\n"    
#         "Requirements:\n" 
#         "1) The questions you generate should be from different medical perspectives.\n"
#         "2) Don't produce answers. Questions only.\n"
#         "3) Output only JSON format, for example: {'questions:': ['question1','question2',...,'question5']}.\n"
#         f"Document:\n{doc_text}\n\n"
#         "JSON:"
#         )}
#     ]

#     prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

#     return prompt

prompt = ChatPromptTemplate.from_messages([
    ("system", 
     "You are a medical student. You like ask questions when you read clinical documents."),
    ("user", 
     "1.From the document below, write exactly {n} unique QUESTIONS only. "
     "2.The questions you generate should be from different medical perspectives. "
     "3.Don't generate answer. "
     "4.Output JSON format, for example: {{\"questions\": [\"question1\",\"question2\",...,\"question5\"]}}."),    
    ("user", "Document:\n{doc}")
])

In [60]:
prompt

ChatPromptTemplate(input_variables=['doc', 'n'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template='You are a medical student. You like ask questions when you read clinical documents.'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['n'], input_types={}, partial_variables={}, template='1.From the document below, write exactly {n} unique QUESTIONS only. 2.The questions you generate should be from different medical perspectives. 3.Don\'t generate answer. 4.Output JSON format, for example: {{"questions": ["question1","question2",...,"question5"]}}.'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['doc'], input_types={}, partial_variables={}, template='Document:\n{doc}'), additional_kwargs={})])

In [61]:
# A simple chain which will feed a prompt to the generator to produce stucture json object with questions
chain = prompt | question_generator | parser  # returns a QuestionList object

In [62]:
doc = "phenylephrine is used to relieve nasal discomfort... (your text)"
n = 3
res: QuestionList = chain.invoke({
    "n": n,
    "doc": doc,
    "format_instructions": parser.get_format_instructions()
})
print(res.questions)  # already parsed & validated



['Is phenylephrine generally safe for use in pediatric population?', 'How does phenylephrine affect blood vessels?', 'Is phenylephrine contraindicated in patients with narrow-angle glaucoma?']


In [None]:
# Decide how many questions it need to generate based on the length of the document.
# From my experience, 500 to 2000 : 5 questions, 100 to 500 : 3 questions, less than 100 : 1 question.
# But it can be varied.
def decide_n_questions(txt: str):
    wc = len(txt.split())
    if wc > 500:
        return 5
    elif wc > 100:
        return 3
    else:
        return 1

In [None]:
# Now let's iterate all medicine's docments

for content in data:
    for doc in content["docs"]:
        n = decide_n_questions(doc)
        questions_list = chain.invoke({"n": n, "doc": doc})