## MedTitleGen

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch \
    torchdata --quiet

%pip install \
    transformers \
    datasets==2.15.0 \
    evaluate \
    rouge_score \
    loralib \
    peft

Collecting pip
  Obtaining dependency information for pip from https://files.pythonhosted.org/packages/47/6a/453160888fab7c6a432a6e25f8afe6256d0d9f2cbd25971021da6491d899/pip-23.3.1-py3-none-any.whl.metadata
  Downloading pip-23.3.1-py3-none-any.whl.metadata (3.5 kB)
Downloading pip-23.3.1-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.2.1
    Uninstalling pip-23.2.1:
      Successfully uninstalled pip-23.2.1
Successfully installed pip-23.3.1
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting datasets==2.15.0
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting rouge_score
  

Import the necessary components. They will be discussed later in the notebook.

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np



<a name='1.2'></a>
### 1.2 - Load Dataset and LLM



In [4]:
huggingface_dataset_name = "medalpaca/medical_meadow_cord19"

dataset = load_dataset(huggingface_dataset_name,'main')

dataset

Downloading readme:   0%|          | 0.00/2.01k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.38G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 821007
    })
})

In [5]:
model_name='Falconsai/medical_summarization'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [6]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 60506624
all model parameters: 60506624
percentage of trainable model parameters: 100.00%


In [7]:
dataset=dataset['train'].train_test_split(test_size=0.8)

<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. 

In [8]:
index = 200

question = dataset['test'][index]['input']
answer = dataset['test'][index]['output']

prompt = f"""
Please summerize the given abstract to a title:
{question}
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{answer}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')



---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Please summerize the given abstract to a title:
The Indian government, therefore, forbids all airlines from these countries Besides these, the Indian government begins the screening for COVID-19 symptoms at the airport [ ]on March 22, 2020, the Indian Government proclaimed the Janata curfew, during which citizens ordered themselves to be confined at home from 6:00 a m to 9:00 p m in the night [ ]the time had come to transfer the medical education curriculum from classical face to face into the modern online teaching process [ ]we also try to find out in between online teaching and traditional face-to-face teaching, which method is more effective

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Preferred online teaching and assessment methods among Indian medical graduates in coronavirus disease era

In [9]:
def tokenize_function(example):
    start_prompt = 'Please summerize the given abstract to a title:'
    prompt = [start_prompt + dialogue for dialogue in example["input"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["output"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/164201 [00:00<?, ? examples/s]

Map:   0%|          | 0/656806 [00:00<?, ? examples/s]

In [10]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (164201, 5)
Test: (656806, 5)
DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction', 'input_ids', 'labels'],
        num_rows: 164201
    })
    test: Dataset({
        features: ['output', 'input', 'instruction', 'input_ids', 'labels'],
        num_rows: 656806
    })
})


The output dataset is ready for fine-tuning.

<a name='3'></a>
## 2 - Perform Parameter Efficient Fine-Tuning (PEFT)


<a name='3.1'></a>
### 2.1 - Setup the PEFT/LoRA model for Fine-Tuning


In [11]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, 
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM 
)

In [12]:
peft_model = get_peft_model(original_model,
                            lora_config)
peft_model = peft_model.to("cuda:0")  # Move model to cuda:0
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1179648
all model parameters: 61686272
percentage of trainable model parameters: 1.91%


<a name='2.2'></a>
### 2.2 - Train PEFT Adapter


In [13]:
output_dir = f'./peft-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=3,
    max_steps=-1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [14]:
peft_trainer.train()

peft_model_path="./peft-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

[34m[1mwandb[0m: Currently logged in as: [33manishbenjwal[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231203_183731-sgkvc9kn[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mlunar-capybara-13[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/anishbenjwal/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/anishbenjwal/huggingface/runs/sgkvc9kn[0m


Step,Training Loss
500,0.2694
1000,0.1383
1500,0.1352
2000,0.1282
2500,0.1319
3000,0.1285
3500,0.1255
4000,0.1269
4500,0.1274
5000,0.1245


('./peft-checkpoint-local2/tokenizer_config.json',
 './peft-checkpoint-local2/special_tokens_map.json',
 './peft-checkpoint-local2/spiece.model',
 './peft-checkpoint-local2/added_tokens.json',
 './peft-checkpoint-local2/tokenizer.json')

In [15]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1179648
all model parameters: 61686272
percentage of trainable model parameters: 1.91%


In [16]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("Falconsai/medical_summarization", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("Falconsai/medical_summarization")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       './peft-checkpoint-local2',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False).to('cuda')

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [17]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 61686272
percentage of trainable model parameters: 0.00%


<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

In [18]:
index = 20
question = dataset['test'][index]['input']
baseline_human_answer = dataset['test'][index]['output']

prompt = f"""
Please summerize the given abstract to a title:
{question}
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)


peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_answer}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Characteristics of boundary layer ozone and its effect on surface ozone concentration in Shenzhen, China: A case study.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
O3 pollution in Pearl River Delta: A review of the O3 vertical distribution characteristics boundary layer
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
PEFT MODEL: Ozone pollution in Pearl River Delta, a region of the ocean


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 1000 dialogues and summaries to save time).

In [19]:
questions = dataset['test'][0:1000]['input']
human_baseline_answers = dataset['test'][0:1000]['output']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, question in enumerate(questions):
    prompt = f"""
    Please summerize the given abstract to a title:
    {question}
    """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    human_baseline_text_output = human_baseline_answers[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    
    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_answers, original_model_summaries,  peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_answers', 'original_model_summaries', 'peft_model_summaries'])
df

Token indices sequence length is longer than the specified maximum sequence length for this model (923 > 512). Running this sequence through the model will result in indexing errors


Unnamed: 0,human_baseline_answers,original_model_summaries,peft_model_summaries
0,The short-term impacts of coronavirus quaranti...,Health and the economic trade-offs during the ...,Health and the Economy during the Period of So...
1,Federally Qualified Health Centers Play a Crit...,Vaccination Equity in the US Vaccine Administr...,Racial/ethnic Equity in COVID19 Vaccination Ad...
2,Family with children in times of pandemic - wh...,Social and Social Working in the Environment o...,The impact of the COVID-19 pandemic on parents...
3,Creating an Artificial Tail Anchor as a Novel ...,A Novel and Simple Strategy to Enhance the Pot...,A novel and simple strategy to enhance the pot...
4,"Wetting, Adhesion, and Droplet Impact on Face ...",Face Masks with incoming Droplets,The Interaction of Face Masks with incoming Dr...
...,...,...,...
995,Debate: Remote learning during COVID‐19 for ch...,Hidden Curriculum for Children with Autism Spe...,Hidden Curriculum for Children with Autism Spe...
996,"Learning process of causes, consequences and s...",Students' Understanding and Attitude towards t...,Understanding and Attitude of Students Towards...
997,Web Application to Track Student Attentiveness...,Drowsiness Detection of the Student's Activity...,Drowsiness detection of student in online clas...
998,Dynamics of the seroprevalence of SARS-CoV-2 a...,SARS-CoV-2 seroprevalence among healthcare wor...,SARS-CoV-2 seroprevalence among healthcare wor...


In [20]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_answers[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_answers[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'rouge1': 0.4300321810398991, 'rouge2': 0.221781728227968, 'rougeL': 0.3613286157561494, 'rougeLsum': 0.3614749956629657}
PEFT MODEL:
{'rouge1': 0.4542976478388038, 'rouge2': 0.24743987563477182, 'rougeL': 0.3860600516351327, 'rougeLsum': 0.3858348252772862}


In [21]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 2.43%
rouge2: 2.57%
rougeL: 2.47%
rougeLsum: 2.44%
