## Introduction
GIT Repo: https://github.com/JEF1056/CS598

This project is based on the paper Dialogue-Contextualized Re-ranking for Medical History-Taking by Jian Zhu, Ilya Valmianski, and Antha Kannan. The paper itself focuses on the problem of on improving AI-driven medical history-taking, a crucial aspect of patient care and diagnosis. The paper proposes a two-stage reranking approach that involves using  an expert system to retrieve a list of candidate questions and then employing a machine-learned re-ranker to prioritize the most relevant question to ask. The paper also conducts experiments using real doctor-patient dialogue data collected from a medical service platform, Curai, which will be similar to our approach in data collection. The author hypothesizes that employing a two-stage re-ranking approach, incorporating a dialogue-contextualized model to prioritize relevant questions for medical history-taking, can effectively mitigate the training-inference gap encountered in AI-driven healthcare systems. The major contribution from this paper will be the main approach and reasoning/logic behind the decisions made while solving the intial problem. However, we will mainly use alternative models from those featured in the orginal study (ie. LongT5), in favor of employing more recent models trained on more general data. The code shown below will also be seperate from that of the paper will be designed and trained independently alongside llama-based models (utiliziing Tinyllama). The information provided in this paper serves as a baseline foundation for the code that we will now develop to take in a medical conversion and generate/reorder relevant questions/conversations, and sorting desired information.

## Scope of reproducibility 

The dataset utilized in the paper is not publicly available for purchase or any other means of access. Therefore, we will employ a substitution approach by leveraging the MediQA dataset, which similarly encompasses medical conversational data and boasts a substantial amount of labeled information. This alternative dataset will serve as a suitable replacement for our analysis, allowing us to maintain the integrity and relevance of our study despite the unavailability of the original dataset. By leveraging a smaller scale model, we can accommodate more data per batch and utilize longer context lengths, thereby potentially enhancing the performance and robustness of our approach.


# Install dependencies
Before running, please install conda and run the following commands:

This is a one-time setup, the `CS598` conda enviroment cna be activated at any time to activate these dependencies
> I highly reccomend using conda and the following commands to easily enable CUDA support.

```bash
conda deactivate
conda env remove -n CS598
conda env create
conda activate CS598
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt
```
Select the new `CS598` conda kernel to begin

## Environnent details
**Python version:** Python 3.11

**Packages used:** See `requirements.txt`



In [1]:
# Import dependencies
import pandas as pd
import os
import shutil
import re
import json
import random
from tqdm.notebook import tqdm
from llama_cpp import Llama, LlamaGrammar

# Training dependencies
import torch
from unsloth import FastLlamaModel
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback
from transformers.utils import logging
from peft import PeftModel, LoftQConfig
import multiprocessing
from torch import backends
from datasets import Dataset
from rouge_score import rouge_scorer

os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"
backends.cuda.matmul.allow_tf32 = True
logging.set_verbosity_info()

2024-04-13 13:31:48.541109: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-13 13:31:48.573851: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Data Preprocessing
We focus on using the main dataset from [Mediqa 2023](https://github.com/abachaa/MTS-Dialog/tree/main/Main-Dataset), from which we will use a LLM [OpenHermes 2.5](https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF) to extract candidate questions and create rankings for those questions.

## High level processing steps
Here is an example excerpt from the raw test set:
```text
Doctor: Good morning, ma'am. 
Patient: Good morning, doctor.
Doctor: Before we begin today, I just need to confirm a few pieces of information I got from the nurse. 
Patient: Absolutely, no problem. 
Doctor: Great, so you're thirty six years old, correct? 
Patient: Yeah, that's right. 
Doctor: And you identify as Caucasian? 
Patient: Yes, doctor. 
Doctor: Thank you, young lady. So, what seems to be the problem today. 
Patient: Well, I've had pain in this right knee for a long time. 
Doctor: Have you been treated for this before? 
Patient: Yes, and I've been diagnosed with, um, chondromalacia. 
Doctor: How have you been treated so far? 
Patient: I've taken antiinflammatories, rested, changed my activities, all of that. 
Doctor: Has there been any improvement? 
Patient: No, none at all. 
Doctor: Have you discussed surgery with anyone before. 
Patient: No, nobody's said anything yet. 
Doctor: Well, I think you'd be a good candidate for an arthroscopy lateral release and tubercle transfer. 
Patient: What will the surgery do? Doctor: This will help take some stress off of the knee joint. It should help you feel a lot better. 
Patient: What are the risks of infection from the surgery, doctor? 
Doctor: Well, you'll be relieved to know that it's less than one percent. We use prophylactic antibiotics the entire time. 
Patient: Will I be asleep for this?
Doctor: Yes, you won't feel a thing. 
Patient: Okay, yes. I agree, we should do the surgery.
```

1. To process this data, we select `N` indexes from the conversation where the doctor is the next speaker, where `N = (X - 2) // 5` and `X` is the length of the example conversation- in this case, 4.
    - `N = (X - 5) // 5` was chosen because the first ~2-4 exchanges tend to be greetings and pleasantries and we want to reduce the numbers of new samples generated.
2. For each of those indexes, we give our LLM expert a prompt to generate questions the doctor may want to ask, given the context of the exiting conversation.
    - We can leverage GBNF grammar to force the LLM expert to produce a JSON-formatted array of strings.
    - If the doctor's existing response is a question, we will include that in the array
3. For each index, we ask the expert model to list in order the importance of the questions asked numerically. This is an array of integers.
    - For example if the questions for index 6 are `["And you identify as Caucasian?", "You are Asian, correct?", "Have you ever visited other specialists for your inflamation issue?"]` the ranking returned may be `[2, 1, 0]`
4. Dump into examples. We have two distinct tasks, which are combined into a singe example. This improves computing efficiency and allows the task to be broken down into two steps. See below for an example:
```text
Context:
Doctor: Good morning, ma'am. 
Patient: Good morning, doctor.
Doctor: Before we begin today, I just need to confirm a few pieces of information I got from the nurse. 
Patient: Absolutely, no problem. 
Doctor: Great, so you're thirty six years old, correct? 
Patient: Yeah, that's right. 

Questions:
["And you identify as Caucasian?", "You are Asian, correct?", "Have you ever visited other specialists for your inflamation issue?"]

Order:
[2,1,0]
```

In [2]:
raw_dir = 'data/raw'
processed_dir = 'data/processed'

# Preview dataset
pd.read_csv(os.path.join(raw_dir, 'MTS-Dialog-TrainingSet.csv'))

Unnamed: 0,ID,section_header,section_text,dialogue
0,0,GENHX,The patient is a 76-year-old white female who ...,Doctor: What brings you back into the clinic t...
1,1,GENHX,The patient is a 25-year-old right-handed Cauc...,Doctor: How're you feeling today? \r\nPatient...
2,2,GENHX,"This is a 22-year-old female, who presented to...","Doctor: Hello, miss. What is the reason for yo..."
3,3,MEDICATIONS,Prescribed medications were Salmeterol inhaler...,Doctor: Are you taking any over the counter me...
4,4,CC,"Burn, right arm.","Doctor: Hi, how are you? \r\nPatient: I burned..."
...,...,...,...,...
1196,1196,PASTSURGICAL,Vasectomy.,"Doctor: Good morning, sir. \r\nPatient: Good m..."
1197,1197,MEDICATIONS,"Tylenol #3 q6h prn, ibuprofen 800 mg q8h prn, ...","Doctor: Okay, so let's go over your medication..."
1198,1198,GENHX,This patient presents to the office today for ...,"Doctor: How are you doing today, sir? \r\nPati..."
1199,1199,FAM/SOCHX,"No tobacco, alcohol or illicit drug use. Patie...","Doctor: Hi, how's it going? \r\nPatient: Not t..."


In [3]:
# Setup Expert model
!wget -nc -O "Smaug-34B-v0.1_Q2_K.gguf" "https://huggingface.co/nold/Smaug-34B-v0.1-GGUF/resolve/main/Smaug-34B-v0.1_Q2_K.gguf?download=true"

expert_model = None
expert_model = Llama(
    "Smaug-34B-v0.1_Q2_K.gguf",
    numa=True,
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

string_list_grammar = LlamaGrammar.from_file(
    "required_string_list.grammar",
    verbose=True
)

number_list_grammar = LlamaGrammar.from_file(
    "required_number_list.grammar",
    verbose=True
)

expert_model("Objectively, why is the sky blue?", max_tokens=100)

File ‘Smaug-34B-v0.1_Q2_K.gguf’ already there; not retrieving.


ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
from_string grammar:
root ::= stringlist 
stringlist ::= [[] ws string [,] ws string [,] ws string ws []] 
ws ::= ws_3 
ws_3 ::= [ <U+0009><U+000A>] ws_3 | 
string ::= ["] string_5 ["] 
string_5 ::= string_6 
string_6 ::= [A-Za-z0-9 .?!,':;] string_6 | 

from_string grammar:
root ::= numberlist 
numberlist ::= [[] ws number [,] ws number [,] ws number ws []] 
ws ::= ws_3 
ws_3 ::= [ <U+0009><U+000A>] ws_3 | 
number ::= number_5 
number_5 ::= number_6 
number_6 ::= [1-3] number_6 | 



{'id': 'cmpl-31f1a20b-b89f-4ee7-a39c-74de762cecf2',
 'object': 'text_completion',
 'created': 1711929398,
 'model': 'Smaug-34B-v0.1_Q2_K.gguf',
 'choices': [{'text': '\nThe color of the sky comes from scattered sunlight. This scattering effect occurs because our atmosphere has particles in it such as dust and water droplets. These small particles cause sunlight to scatter before reaching our eyes. The blue wavelengths of light are scattered more than other colors due to their shorter wavelengths, resulting in a perceived blue hue in the sky.',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 10, 'completion_tokens': 69, 'total_tokens': 79}}

In [4]:
# Define some helper functions
def postprocess_text(text):
    text = re.sub(r'(\s){2,}', r"\1", text)
    text = text.strip()
    return text

def postprocess_conversation(conversation):
    conversation = "\n".join(conversation)
    conversation = postprocess_text(conversation)
    return conversation

def postprocess_question(question):
    if question.lower().startswith("doctor: "):
        question = question[7:]
    question = postprocess_text(question)
    return question

def create_question_prompt(context, question):
    if question.endswith("?"):
        return f"""
Using the below context, give some examples of what the doctor might want to say or ask the patient after the existing conversation.
Do not repeat any questions already asked.
Produce as many questions as you can think of.
For example, the doctor may say: "{question}"

Context:
{context}

Questions:
""".strip()
    else:
        return f"""
Using the below context, give some examples of what the doctor might want to say or ask the patient after the existing conversation.
Do not repeat any questions already asked.
Produce as many questions as you can think of.

Context:
{context}

Questions:
""".strip()

def create_rerank_prompt(context, questions):
    return f"""
Using the below context, rerank the questions in order of relevance to the conversation. The most relevant question should come first.
Output a list of indexes, with the first and most relevant question being 1, the second being 2, and so on.

Context:
{context}

Questions:
{questions}

Order:
""".strip()

def create_example(context, questions, rerank_order):
    return f"""
Context:
{context}

Questions:
{json.dumps(questions)}

Ranked:
{json.dumps(rerank_order)}
""".strip()

In [6]:
# Remove existing processed data
if os.path.exists(processed_dir) and len(os.listdir(processed_dir)):
    # ipynb prompt user if they want to reprocess data
    do_process_data = input("Processed data already exists, do you want to reprocess data? (y/n): ").lower() == 'y'
    if do_process_data:
        print("Reprocessing data...")
else:
    do_process_data = True
 
if do_process_data:
    shutil.rmtree(processed_dir, ignore_errors=True)
    os.makedirs(processed_dir, exist_ok=False)

    for data_source in os.listdir(raw_dir):
        if not data_source.endswith('.csv'):
            continue

        # Load dataset
        df = pd.read_csv(os.path.join(raw_dir, data_source))

        # Process dataset
        processed_data = []
        for _, row in tqdm(df.iterrows(), total=len(df), desc=f'Processing {data_source}'):
            examples = []
            
            # Get dialogue text and convert to turns
            turns = row['dialogue'].split("\n")
            num_examples = max((len(turns) - 2) // 5, 1)
            
            # Get a list of indexes in turns where the sentence starts with "doctor:"
            doctor_indexes = [i for i, turn in enumerate(turns) if turn.lower().startswith("doctor: ") and i > 0]
            
            # Get num_examples of doctor indexes
            try:
                doctor_indexes = random.sample(doctor_indexes, num_examples)
            except:
                print("Not enough doctor indexes:", doctor_indexes, "\nTurns:", turns)
                pass
            
            for i in doctor_indexes:
                # Extract and cleanup conversation
                context = postprocess_conversation(turns[0:i])
                example = postprocess_question(turns[i])
                
                prompt = create_question_prompt(context, example)                
                try:
                    expert_generated_questions = expert_model(prompt, max_tokens=100, grammar=string_list_grammar)
                    expert_generated_questions = json.loads(expert_generated_questions["choices"][0]["text"])
                except:
                    print("Invalid questions result")
                    continue
                
                prompt = create_rerank_prompt(context, expert_generated_questions)
                try:
                    expert_generated_rerank = expert_model(prompt, max_tokens=50, grammar=number_list_grammar)                
                    expert_generated_rerank = json.loads(expert_generated_rerank["choices"][0]["text"])
                except:
                    print("Invalid rerank result")
                    continue
                
                example = create_example(context, expert_generated_questions, expert_generated_rerank)
                processed_data.append({
                    'text': example,
                })

                # Save processed data
                pd.DataFrame(processed_data).to_csv(os.path.join(processed_dir, data_source), index=False)

Reprocessing data...


Processing MTS-Dialog-TrainingSet.csv:   0%|          | 0/1201 [00:00<?, ?it/s]

Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: Any know drug allergies? \r', 'Patient: No.']
Invalid questions result
Invalid questions result
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: Is he currently taking any medication? \r', 'Guest_family: No.']
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: Any allergies I should know about? \r', 'Patient: Nope, no allergies for me.']
Invalid questions result
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: How did the patient do on the activity test?\r', 'Guest_clinician: Patient was good. I have advised him to continue with his normal activities as long as he is feeling fine.']
Invalid questions result
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: Do you have a history of mental illness or psychological disease? \r', 'Patient: No.']
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: Tell me a

Processing MTS-Dialog-ValidationSet.csv:   0%|          | 0/100 [00:00<?, ?it/s]

Not enough doctor indexes: [] 
Turns: ['Doctor: Have you been experiencing any mental difficulties or confusion? ', 'Patient: No. Doctor: Any hallucinations? Are you seeing hearing thing that is not real? ', 'Patient: No.']
Not enough doctor indexes: [] 
Turns: ['Doctor: Who are going to stay with? ', 'Patient: I am going home with my son. I will stay with him.']
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: Let me write you a prescription for Cipro and Flagyl.', 'Patient: Okay.']
Invalid questions result
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: What is your surgical history? ', 'Patient: I had cataract surgery on both eyes. I also had knee replacement surgery on my left knee.']
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: I have sent over your referral for physical therapy, occupational therapy and speech therapy. The patient coordinator at Siskin Rehab Hospital will give you a call within two days 

Processing MTS-Dialog-TestSet-2-MEDIQA-Sum-2023.csv:   0%|          | 0/200 [00:00<?, ?it/s]

Not enough doctor indexes: [] 
Turns: ['Doctor: Do you have any known history of allergies?', "Patient: No, I don't have any allergies."]
Not enough doctor indexes: [] 
Turns: ['Doctor: I belive you caught a virus.  You have the stomach flu.']
Not enough doctor indexes: [] 
Turns: ['Doctor: Any known allergies? ', 'Patient: Um none that I can think of.']
Not enough doctor indexes: [] 
Turns: ['Doctor: Have you ever had any surgeries? ', 'Patient: I had my appendix removed when I was nine years old.']
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: What brings you in today? ', 'Patient: I was at the beach walking along the rocks. My foot slipped and I cut my foot on the barnacles.']
Invalid questions result
Invalid questions result
Not enough doctor indexes: [] 
Turns: ["Doctor: Hi there, so tell me what's going on?", "Patient: I have realized lately, I can't walk as much and as far I used to before."]
Invalid questions result
Not enough doctor indexes: [11] 
Tu

Processing MTS-Dialog-TestSet-1-MEDIQA-Chat-2023.csv:   0%|          | 0/200 [00:00<?, ?it/s]

Not enough doctor indexes: [] 
Turns: ['Doctor: Do you know about any allergies from any medications?', 'Patient: No.']
Not enough doctor indexes: [] 
Turns: ['Guest_clinician: Hello, my name is Mary. I will ask you a few questions about your medical and family history and then Doctor Smith will come and check you. Okay?', 'Patient: Okay. ', 'Guest_clinician: Do you have any other previously diagnosed medical issues?', 'Patient: I have sinus. I also had a stroke around two years ago.', 'Guest_clinician: Do you smoke or drink?', 'Patient: Nope, never did any of those.', 'Guest_clinician: Do you have any kind of allergies?', 'Patient: No, no known allergies.', 'Guest_clinician: Thank you for answering all my questions, I will let Doctor Smith know that you are ready.']
Invalid questions result
Not enough doctor indexes: [] 
Turns: ['Doctor: I spoke with Poison Control regarding the possible ingestion of the liquid. They let me know that it is actually a relatively small amount and is lik

In [8]:
# Preview processed dataset
pd.read_csv(os.path.join(processed_dir, 'MTS-Dialog-TrainingSet.csv'))

Unnamed: 0,text
0,Context:\nDoctor: What brings you back into th...
1,Context:\nDoctor: How're you feeling today?\nP...
2,Context:\nDoctor: How're you feeling today?\nP...
3,"Context:\nDoctor: Hello, miss. What is the rea..."
4,Context:\nDoctor: Are you taking any over the ...
...,...
1697,"Context:\nDoctor: How are you doing today, sir..."
1698,"Context:\nDoctor: Hi, how's it going?\nPatient..."
1699,"Context:\nDoctor: Hi, how's it going?\nPatient..."
1700,"Context:\nDoctor: Hi, how's it going?\nPatient..."


In [11]:
# Cleanup memory
del expert_model

# Training
We use [Tinyllama](https://github.com/jzhang38/TinyLlama), a small 1.1b param autoregressive model based on the LLAMA architecture. This size is chosen for quick trianing and fast evaluation.

We also use [Unsloth](https://github.com/unslothai/unsloth) to perform QLora operations on the model, loowing for even faster training and more efficient memory usage.

Training required 24gb of vram, 2gb of system ram, and was completed on a RTX 3090. Training took approximately 12 minutes, and concluded early when validation loss started to degrade at about 3.3 epochs.

In [3]:
# Non-hyperparam configuration
output_dir = "models/ramp1"
base_model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
is_lora = True
resume_from = None # Set if we want to resume from a checkpoint
val_size = 0.1 # This is only used if valid_path is None
train_path = os.path.join(processed_dir, 'MTS-Dialog-TrainingSet.csv')
valid_path = os.path.join(processed_dir, 'MTS-Dialog-ValidationSet.csv')
save_steps = 25
logging_steps = 5

os.environ["WANDB_PROJECT"]="CS598"

# Hyperparameters
epochs=10
lr_scheduler_type = "cosine"
warmup_ratio = 0.1
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.1
gradient_accumulation_steps = 2
optimizer = "adamw_8bit"
use_gradient_checkpointing = True
random_state = 42
final_output_model_dir = os.path.join(output_dir, "final")

max_seq_length = 2048
dtype = (
    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
)
load_in_4bit = False  # Use 4bit quantization to reduce memory usage. Can be False.
HAS_BFLOAT16 = torch.cuda.is_bf16_supported()

In [4]:
# Define sdome helper functions
def merge_and_save(output_path="/tmp/merged"):
    global model
    global tokenizer

    model = model.merge_and_unload()
    model.save_pretrained(output_path, safe_serialization=False)
    if tokenizer is not None:
        tokenizer.save_pretrained(output_path)

    return output_path

In [5]:
#  Load / resume model
global model
global tokenizer
model, tokenizer = FastLlamaModel.from_pretrained(
    model_name=base_model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

if (os.path.exists(output_dir) and len(os.listdir(output_dir)) > 1) or (
    resume_from is not None
):
    if resume_from is not None:
        latest_checkpoint = resume_from
    else:
        latest_checkpoint_steps = max(
            [
                int(ckpt.split("-")[1])
                for ckpt in os.listdir(output_dir)
                if ckpt.startswith("checkpoint")
            ]
        )
        latest_checkpoint = os.path.join(
            output_dir, f"checkpoint-{latest_checkpoint_steps}"
        )

    print(f"Loading model from checkpoint-{latest_checkpoint}.")

    if "adapter_config.json" in os.listdir(latest_checkpoint):
        print("adapter_config.json` found. Loading as a PeftModel.")
        model = PeftModel.from_pretrained(model, latest_checkpoint)

        model, tokenizer = FastLlamaModel.from_pretrained(
            model_name=merge_and_save(),
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit,
        )

        is_lora = True
    else:
        print("adapter_config.json` not found. Loading as a FastLlamaModel.")
        model, tokenizer = FastLlamaModel.from_pretrained(
            model_name=latest_checkpoint,
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit,
        )

        is_lora = False
else:
    print("Creating new model directory.")
    shutil.rmtree(output_dir, ignore_errors=True)
    os.makedirs(output_dir)


if is_lora or (is_lora == False and resume_from is not None):
    print("Creating LORA model.")
    model = FastLlamaModel.get_peft_model(
        model,
        r=32,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_alpha=32,
        lora_dropout=0,  # Currently only supports dropout = 0
        bias="none",  # Currently only supports bias = "none"
        use_gradient_checkpointing=use_gradient_checkpointing,  # With Unsloth, we can turn this off!
        random_state=random_state,
        use_rslora=True,  # We support rank stabilized LoRA
        loftq_config=LoftQConfig(loftq_bits=4),  # And LoftQ
        max_seq_length=max_seq_length,
    )

loading configuration file config.json from cache at /home/jfan/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-intermediate-step-1431k-3T/snapshots/036fa4651240b9a1487f709833b9e4b96b4c1574/config.json
Model config LlamaConfig {
  "_name_or_path": "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.39.2",
  "use_cache": true,
  "vocab_size": 32000
}



==((====))==  Unsloth: Fast Llama patching release 2024.3
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 24.0 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.1.1. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.23. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


loading configuration file config.json from cache at /home/jfan/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-intermediate-step-1431k-3T/snapshots/036fa4651240b9a1487f709833b9e4b96b4c1574/config.json
Model config LlamaConfig {
  "_name_or_path": "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.2",
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file model.safetensors from cache at 

Creating new model directory.
Creating LORA model.


Unsloth 2024.3 patched 22 layers with 22 QKV layers, 22 O layers and 22 MLP layers.


In [6]:
# Load Data
train = pd.read_csv(train_path).sample(frac=1, random_state=random_state)
valid = None

if valid_path is not None:
    valid = pd.read_csv(valid_path).sample(frac=1, random_state=random_state)
else:
    train, valid = train.train_test_split(test_size=val_size, seed=random_state)
    
train

Unnamed: 0,text
705,"Context:\nDoctor: Hello, sir. Before we begin ..."
809,Context:\nDoctor: Hello. How are you doing?\nP...
1432,Context:\nDoctor: Did you ever had pneumonia?\...
173,Context:\nDoctor: I have their surgical histor...
513,Context:\nDoctor: Welcome to the clinic.\nPati...
...,...
1130,"Context:\nDoctor: Hello sir, how are you?\nPat..."
1294,"Context:\nDoctor: Hello, how are you?\nPatient..."
860,Context:\nDoctor: What types of surgeries have...
1459,Context:\nDoctor: Has anyone in your family ha...


In [7]:
# Define and run training loop
trainer = SFTTrainer(
    model=model,
    train_dataset=Dataset.from_pandas(train),
    eval_dataset=Dataset.from_pandas(valid),
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    dataset_num_proc=multiprocessing.cpu_count() // 2,
    args=TrainingArguments(
        do_train=True,
        do_eval=True,
        tf32=True,
        dataloader_num_workers=4,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_ratio=warmup_ratio,
        learning_rate=learning_rate,
        fp16=not HAS_BFLOAT16,
        bf16=HAS_BFLOAT16,
        output_dir=output_dir,
        optim=optimizer,
        weight_decay=weight_decay,
        lr_scheduler_type=lr_scheduler_type,
        seed=random_state,
        save_strategy="steps",
        save_steps=save_steps,
        logging_strategy="steps",
        evaluation_strategy="steps",
        eval_steps=save_steps,
        logging_steps=logging_steps,
        num_train_epochs=epochs,
        load_best_model_at_end=True,
        save_total_limit=3,
        save_safetensors=False,
    ),
    callbacks=[EarlyStoppingCallback(2, 0.0)],
)

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

trainer_stats = trainer.train()

Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Map (num_proc=10):   0%|          | 0/1702 [00:00<?, ? examples/s]

Map (num_proc=10):   0%|          | 0/119 [00:00<?, ? examples/s]

Using auto half precision backend
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,702 | Num Epochs = 10
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 2
\        /    Total batch size = 32 | Total steps = 530
 "-____-"     Number of trainable parameters = 25,231,360
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


GPU = NVIDIA GeForce RTX 3090. Max memory = 24.0 GB.
2.246 GB of memory reserved.


[34m[1mwandb[0m: Currently logged in as: [33mjef1056[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
25,1.6975,1.747466
50,1.4089,1.451734
75,1.3035,1.403583
100,1.2231,1.381681
125,1.1717,1.381397
150,1.1698,1.385434
175,1.0914,1.404944


***** Running Evaluation *****
  Num examples = 119
  Batch size = 16
Saving model checkpoint to models/ramp1/checkpoint-25
loading configuration file config.json from cache at /home/jfan/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-intermediate-step-1431k-3T/snapshots/036fa4651240b9a1487f709833b9e4b96b4c1574/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.39.2",
  "use_cache": true,
  "vocab_size": 32000
}

toke

In [8]:
# Cleanup
shutil.rmtree(final_output_model_dir, ignore_errors=True)
if is_lora or (is_lora == False and resume_from is not None):
    os.makedirs(os.path.join(final_output_model_dir, "lora"), exist_ok=True)
    os.makedirs(os.path.join(final_output_model_dir, "merged"), exist_ok=True)
    model.save_pretrained(
        os.path.join(final_output_model_dir, "lora"), safe_serialization=False
    )
    tokenizer.save_pretrained(os.path.join(final_output_model_dir, "lora"))
    merge_and_save(os.path.join(final_output_model_dir, "merged"))
else:
    model.save_pretrained(final_output_model_dir, safe_serialization=False)
    tokenizer.save_pretrained(final_output_model_dir)

loading configuration file config.json from cache at /home/jfan/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-intermediate-step-1431k-3T/snapshots/036fa4651240b9a1487f709833b9e4b96b4c1574/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.39.2",
  "use_cache": true,
  "vocab_size": 32000
}

tokenizer config file saved in models/ramp1/final/lora/tokenizer_config.json
Special tokens file saved in models/ramp1/final/lor

# Evaluation

In order to evaluate this model, we complute the ROGUE score of the model's outputs against the test sets.

In [26]:
max_seq_length = 2048
random_state = 42

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
HAS_BFLOAT16 = torch.cuda.is_bf16_supported()

model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = os.path.join(final_output_model_dir, "merged"),
    max_seq_length = max_seq_length,
    dtype = dtype,
)

# model.save_pretrained_gguf("final_model.gguf", tokenizer)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

loading configuration file models/ramp1/final/merged/config.json
Model config LlamaConfig {
  "_name_or_path": "models/ramp1/final/merged",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.3",
  "unsloth_version": "2024.3",
  "use_cache": true,
  "vocab_size": 32000
}

loading configuration file models/ramp1/final/merged/config.json
Model config LlamaConfig {
  "_name_or_path": "models/ramp1/final/merged",
  "architectures": [
    "LlamaFo

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 24.0 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.1.1. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.23. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

target_dtype {target_dtype} is replaced by `CustomDtype.INT4` for 4-bit BnB quantization
All model checkpoint weights were used when initializing LlamaForCausalLM.

All the weights of LlamaForCausalLM were initialized from the model checkpoint at models/ramp1/final/merged.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
loading configuration file models/ramp1/final/merged/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "max_length": 2048,
  "pad_token_id": 0
}

loading file tokenizer.model
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file tokenizer.json
loading file tokenizer.model
loading

In [39]:
test_data1 = pd.read_csv(os.path.join(processed_dir, 'MTS-Dialog-TestSet-1-MEDIQA-Chat-2023.csv'))
test_data2 = pd.read_csv(os.path.join(processed_dir, 'MTS-Dialog-TestSet-2-MEDIQA-Sum-2023.csv'))

score_counters = {}
processed_count = 0
total = len(test_data1)

# iterate through test data
for test_dataset in [test_data1, test_data2]:
    for i, row in tqdm(test_dataset.iterrows(), total=total):
        prompt, answers = row['text'].split("Questions:")
        inputs = prompt + "Questions:"
        inputs_tokenized = tokenizer([inputs], return_tensors = "pt").to("cuda")
        outputs = "\n\n".join(tokenizer.decode(model.generate(**inputs_tokenized, max_new_tokens = 128, use_cache = False, top_p = 0.9, top_k = 50, temperature = 0.9, num_return_sequences = 1)[0])[len(inputs)+5:].split("\n\n")[0:2]).strip()

        scores = scorer.score(row['text'], outputs)
        
        for key in scores:
            if key not in score_counters:
                score_counters[key] = 0
            score_counters[key] += scores[key].fmeasure
        
        processed_count += 1
        
        if i == total:
            break

    for key in score_counters:
        score_counters[key] /= processed_count
        print(f"Average {key} F1 Score: {score_counters[key]}")

  0%|          | 0/260 [00:00<?, ?it/s]

Average rouge1 F1 Score: 0.29979520974440355
Average rouge2 F1 Score: 0.11813238381054471
Average rougeL F1 Score: 0.23140387876731552


  0%|          | 0/260 [00:00<?, ?it/s]

Average rouge1 F1 Score: 0.13252110868218833
Average rouge2 F1 Score: 0.051978575980848846
Average rougeL F1 Score: 0.10097343177444587


In [46]:
prompt, answers = test_data1['text'][40].split("Questions:")
inputs = prompt + "Questions:"
inputs_tokenized = tokenizer([inputs], return_tensors = "pt").to("cuda")
print(tokenizer.decode(model.generate(**inputs_tokenized, max_new_tokens = 128, use_cache = False, top_p = 0.9, top_k = 50, temperature = 0.9, num_return_sequences = 1)[0]))

<s> Context:
Doctor: Hi there! I am Doctor Jones, sir.
Patient: Hello! It is nice to meet you.
Doctor: What brings you into see me today?
Patient: I have had this weakness in my right leg for quite some time now.
Doctor: How long has this been going on and do you know how you injured yourself?
Patient: I think that it was about six months ago that the weakness in my leg started. I don't really remember how it happened.
Doctor: Can you tell me what you do remember?
Patient: I was reaching to get something from a cabinet, and I noticed that I was unable to stand on my right toe. Ever since then I have had difficulty pushing off when walking. My toes were tingling and numb.
Doctor: Was the numbness and tingling mild, moderate, or severe?
Patient: It was a mild feeling, but this has been an ongoing problem that has been the same since the weakness started.
Doctor: Have you had any other pain any where else in your body?
Patient: I have had back pain, but this has been going on for many yea

# Results And Analysis
Average ROGUE score is pretty average. Something interesting is that the model acheives a pretty low loss on the test set and remains consistent on the validation set. This can probalby be attributed to the say MSE scores vs ROGUE, and how the resulting text produced by the model might be more than rewording or a different variant of the options produced by the original expert model.

Average rouge1 F1 Score: 0.29979520974440355
Average rouge2 F1 Score: 0.11813238381054471
Average rougeL F1 Score: 0.23140387876731552

Looking at the actual results from model generation, for example:
```text

```
The results seem to line up retty close with the test set, and while they may contain different questions thir content and the order in which they could be prioritized is, at least to human evaluation, is done suprisingly well.

Model training metrics can be found here: https://wandb.ai/jef1056/CS598/

# Plans
Additional potential improvments:
- Dividing the task better into re-ranking and generation steps
- Hyperparam sweek to determine best training metric
- Enhance logging and charts

# Results And Analysis
Average ROGUE score is pretty average. Something interesting is that the model acheives a pretty low loss on the test set and remains consistent on the validation set. This can probalby be attributed to the say MSE scores vs ROGUE, and how the resulting text produced by the model might be more than rewording or a different variant of the options produced by the original expert model.

Average rouge1 F1 Score: 0.29979520974440355
Average rouge2 F1 Score: 0.11813238381054471
Average rougeL F1 Score: 0.23140387876731552

Looking at the actual results from model generation, for example:
```text
Context:
Doctor: Hi there! I am Doctor Jones, sir.
Patient: Hello! It is nice to meet you.
Doctor: What brings you into see me today?
Patient: I have had this weakness in my right leg for quite some time now.
Doctor: How long has this been going on and do you know how you injured yourself?
Patient: I think that it was about six months ago that the weakness in my leg started. I don't really remember how it happened.
Doctor: Can you tell me what you do remember?
Patient: I was reaching to get something from a cabinet, and I noticed that I was unable to stand on my right toe. Ever since then I have had difficulty pushing off when walking. My toes were tingling and numb.
Doctor: Was the numbness and tingling mild, moderate, or severe?
Patient: It was a mild feeling, but this has been an ongoing problem that has been the same since the weakness started.
Doctor: Have you had any other pain any where else in your body?
Patient: I have had back pain, but this has been going on for many years and has not changed.
Doctor: Is the back pain been mild, moderate, or severe?
Patient: I would say mild. I have also been having cramping in both calves.
Doctor: How long has that been going on?
Patient: For the past year but it stopped about two months ago.

Questions:
["What type of pain are you experiencing in your right leg?", "Have you noticed any changes in your right leg since your last visit?", "Have you noticed any changes in your right leg since your last visit?"]

Ranked:
[3, 2, 1]
```
The results seem to line up retty close with the test set, and while they may contain different questions thir content and the order in which they could be prioritized is, at least to human evaluation, is done suprisingly well.

Model training metrics can be found here: https://wandb.ai/jef1056/CS598/

# Plans
Additional potential improvments:
- Dividing the task better into re-ranking and generation steps
- Hyperparam sweek to determine best training metric
- Enhance logging and charts