<a href="https://colab.research.google.com/github/EllieZhangy/GPT-LLM-Based-Impression-Prediction-from-Radiology-Reports/blob/main/medalpaca-7b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Radiology Report Fine Tuning

This fine tuning process follows the instruction of 
Venelin Valkov's youtube [video](https://www.youtube.com/watch?v=4-Q50fmq7Uw&t=1691s).

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
!pip install -U pip
!pip install accelerate
!pip install appdirs==1.4.4
!pip install bitsandbytes==0.37.2
!pip install datasets==2.10.1
!pip install fire==0.5.0
!pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
!pip install git+https://github.com/huggingface/transformers.git
!pip install torch==2.0.0
!pip install sentencepiece==0.1.97
!pip install tensorboardX==2.6
!pip install gradio==3.9

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.19.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bitsandbytes==0.37.2
  Downloading bitsandbytes-0.37.2-py3-none-any.whl (84.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.2/84.2 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.37.2
Looking in in

In [None]:
import transformers
import textwrap
from transformers import LlamaTokenizer, LlamaForCausalLM
import os
import sys
from typing import List

from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict
)

import fire
import torch
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from pylab import rcParams
import json

%matplotlib inline
sns.set(rc={'figure.figsize':(8, 6)})
sns.set(rc={'figure.dpi':100})
sns.set(style='white', palette='muted', font_scale=1.2)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


'cuda'

### Radiology Reports Classification

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#### Mount Google drive, we will be using GDrive for processing
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Rad_all_data_id.csv')

In [None]:
# Drop the rows with finding is NA
data = df.dropna(subset=['findings'])

In [None]:
import re

# Define the categories and their corresponding regular expressions
categories = {'MRI': 'mri',
              'CT': 'ct',
              'X-Ray': 'x-ray|x ray|xray|radiography|chest',
              'Ultrasound': 'ultrasound',
              'Sono': 'sono'}

# Create a function to match the category keywords in multiple columns and return the category
def get_category(row, columns):
    for col in columns:
        if pd.isna(row[col]):
            continue
        for category, keywords in categories.items():
            if any([kw in row[col].lower() for kw in keywords.split('|')]):
                return category
    return 'Others'

# Apply the function to create the 'category' column
columns = ['technique', 'findings','comparison']
data['category'] = data.apply(lambda row: get_category(row, columns), axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['category'] = data.apply(lambda row: get_category(row, columns), axis=1)


In [None]:
data['category'].value_counts()

CT            582951
X-Ray         122431
Others         36993
MRI            33894
Sono            6577
Ultrasound      6433
Name: category, dtype: int64

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,clinical_information,technique,findings,comparison,impression,report_id,category
0,0,34 year old female with history of sickle cell...,2 views of the right shoulder at 6:41 on 7/12/12,The right total shoulder arthroplasty componen...,XR shoulder 7/11/12,Right total shoulder arthroplasty components i...,RAD_0,CT
1,1,34 year old female with history of sickle cell...,One portable view of the right shoulder at 17:...,The right total shoulder arthroplasty componen...,XR shoulder 2/1/12,Right total shoulder arthroplasty components i...,RAD_1,CT
2,2,84-year-old female with low back pain,Four views of the lumbar spine,Posterior stabilization rods with transpedicul...,2/13/06,"Posterior fixation of L4 and L5, appearing sim...",RAD_2,CT
3,3,,Informed consent was obtained. The patient was...,The colon is adequately cleansed and distended...,,No significant colonic polyps or masses identi...,RAD_3,CT
4,4,Preoperative planning for brain tumor. History...,MRI BRAIN STEALTH W/WO CONTRAST. A total of 17...,There is a heterogeneous left supratentorial a...,Brain MRI dated 11/17/14.,Presurgical planning MRI shows a complex mass ...,RAD_4,MRI


### Create a function for get JSON file

In [None]:
def get_json(data, cat):
  dat_new = data.loc[data['category'] == cat]
  dat_new['input'] = dat_new['clinical_information'].fillna('') + ' ' + dat_new['findings'].fillna('')
  dat_new = dat_new[['input','impression']]
  dataset_data = [
    {
        "instruction": "Generate impression based on findings.",
        "input": row_dict["input"],
        "output": row_dict["impression"]
    }
    for row_dict in dat_new.to_dict(orient="records")]
  return dataset_data

get sono dataset

In [None]:
sono = get_json(data, "Sono")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat_new['input'] = dat_new['clinical_information'].fillna('') + ' ' + dat_new['findings'].fillna('')


save sono file

In [None]:
with open('sono.json', 'w') as outfile:
    for obj in sono:
        json.dump(obj, outfile)
        outfile.write('\n')

### Medalpaca LoRa

In [None]:
BASE_MODEL = "medalpaca/medalpaca-7b"
 
model = LlamaForCausalLM.from_pretrained(
    BASE_MODEL,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
 
tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)
 
tokenizer.pad_token_id = (
    0  # unk. we want this to be different from the eos token
)
tokenizer.padding_side = "left"

Downloading (…)lve/main/config.json:   0%|          | 0.00/542 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.88G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.89G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/7.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

### Dataset

In [None]:
from datasets import load_dataset
data = load_dataset("json", data_files="sono.json")

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-2b37f85939402675/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-2b37f85939402675/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
data["train"]

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 6577
})

In [None]:
CUTOFF_LEN = 256

In [None]:
def generate_prompt(data_point):    
    return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{data_point["instruction"]}
### Input:
{data_point["input"]}
### Response:
{data_point["output"]}"""

In [None]:
def tokenize(prompt, add_eos_token=True):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=CUTOFF_LEN,
        padding=False,
        return_tensors=None,
    )
    if (
        result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < CUTOFF_LEN
        and add_eos_token
    ):
        result["input_ids"].append(tokenizer.eos_token_id)
        result["attention_mask"].append(1)

    result["labels"] = result["input_ids"].copy()

    return result

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenize(full_prompt)
    return tokenized_full_prompt

In [None]:
### remember to change test size
train_val = data["train"].train_test_split(
    test_size=400, shuffle=True, seed=42          # adjust the test size if you think its too small        
)
train_data = (
    train_val["train"].shuffle().map(generate_and_tokenize_prompt)
)
val_data = (
    train_val["test"].shuffle().map(generate_and_tokenize_prompt)
)



Map:   0%|          | 0/6177 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

In [None]:
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT= 0.05
LORA_TARGET_MODULES = [
    "q_proj",
    "v_proj",
]

BATCH_SIZE = 128
MICRO_BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
LEARNING_RATE = 3e-4
TRAIN_STEPS = 300  ### change this part for shorter training time
OUTPUT_DIR = "experiments"

In [None]:
model = prepare_model_for_int8_training(model)
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=LORA_TARGET_MODULES,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 4194304 || all params: 6742618112 || trainable%: 0.06220586618327525


### Training

In [None]:
training_arguments = transformers.TrainingArguments(
    per_device_train_batch_size=MICRO_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    warmup_steps=100,
    max_steps=TRAIN_STEPS,
    learning_rate=LEARNING_RATE,
    fp16=True,
    logging_steps=10,
    optim="adamw_torch",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=50,
    save_steps=50,
    output_dir=OUTPUT_DIR,
    save_total_limit=3,
    load_best_model_at_end=True,
    report_to="tensorboard" 
)

In [None]:
data_collator = transformers.DataCollatorForSeq2Seq(
    tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
)

In [None]:
trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=training_arguments,
    data_collator=data_collator
)
model.config.use_cache = False
old_state_dict = model.state_dict
state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(
        self, old_state_dict()
    )
).__get__(model, type(model))()

set_peft_model_state_dict(model, state_dict)
#set_peft_model_state_dict(model, state_dict) 
trainer.train()
model.save_pretrained(OUTPUT_DIR)

Step,Training Loss,Validation Loss
50,1.5351,1.310325
100,0.7896,0.752363
150,0.6975,0.668014
200,0.6496,0.637423
250,0.6206,0.62323
300,0.6149,0.617599


In [None]:
from huggingface_hub import login
login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
model.push_to_hub("Ka4on/Sono", use_auth_token=True)

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ka4on/Sono/commit/f07d961c0feb287aa9be0b2af563b5c1236aab1d', commit_message='Upload model', commit_description='', oid='f07d961c0feb287aa9be0b2af563b5c1236aab1d', pr_url=None, pr_revision=None, pr_num=None)

### Evaluation

In [None]:
from peft import PeftModel
from transformers import LlamaTokenizer, LlamaForCausalLM,GenerationConfig

tokenizer = LlamaTokenizer.from_pretrained("medalpaca/medalpaca-7b")  # change the name if using another base model

model = LlamaForCausalLM.from_pretrained(
    "medalpaca/medalpaca-7b",
    load_in_8bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "Ka4on/Sono")   # change model weight if necessary

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)/adapter_config.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

Downloading adapter_model.bin:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

i only pick 100 samples from validation set to save time. Remove the cells below to evaluate on the whole validation set.

In [None]:
import random

# Shuffle the val_data
shuffled_val_data = val_data.shuffle()

# Randomly choose 100 samples
random_indices = random.sample(range(len(shuffled_val_data)), k=100)
val_sample = []

# Iterate over the random indices and append the samples to the val_sample list
for idx in random_indices:
    val_sample.append(shuffled_val_data[idx])

In [None]:
import torch

generated_texts = []  # List to store generated texts
reference_summaries = []  # List to store reference summaries
num_iter = 0
# Iterate over validation set
for example in val_sample:    # change the name of val_sample 
    
    # Generate prompt using the generate_prompt function
    prompt = generate_and_tokenize_prompt(example)
    
    generation_config = GenerationConfig(        # adjust the configuration at your will
        temperature=0.6,
        top_p=0.7,
        repetition_penalty=1.15,
    )
    input_ids = torch.tensor(prompt["input_ids"]).unsqueeze(0) 
    attention_mask = torch.tensor(prompt["attention_mask"]).unsqueeze(0)   
    generated_text = model.generate(
        input_ids=input_ids.to('cuda'),
        #attention_mask = attention_mask,
        generation_config=generation_config,
        return_dict_in_generate=True,
        max_new_tokens=256,
    )
    num_iter +=1
    print("completed iteration", num_iter)
    for s in generated_text.sequences:
        res = tokenizer.decode(s)

    generated_text = res.split("### Response:")[-1].split("</s>")[0].strip()   ### process the output, the output I have always contain </s>, so i took away </s> in the output
    
    # Get the generated text and reference summary
    reference_summary = example["output"]
    
    # Add generated text and reference summary to their respective lists
    generated_texts.append(generated_text)
    reference_summaries.append(reference_summary)


completed iteration 1
completed iteration 2
completed iteration 3
completed iteration 4
completed iteration 5
completed iteration 6
completed iteration 7
completed iteration 8
completed iteration 9
completed iteration 10
completed iteration 11
completed iteration 12
completed iteration 13
completed iteration 14
completed iteration 15
completed iteration 16
completed iteration 17
completed iteration 18
completed iteration 19
completed iteration 20
completed iteration 21
completed iteration 22
completed iteration 23
completed iteration 24
completed iteration 25
completed iteration 26
completed iteration 27
completed iteration 28
completed iteration 29
completed iteration 30
completed iteration 31
completed iteration 32
completed iteration 33
completed iteration 34
completed iteration 35
completed iteration 36
completed iteration 37
completed iteration 38
completed iteration 39
completed iteration 40
completed iteration 41
completed iteration 42
completed iteration 43
completed iteration 

save the result just in case

In [None]:
# Save generated texts to a file
with open('/content/generated_texts.txt', 'w') as file:
    for text in generated_texts:
        file.write(text + '\n')

# Save reference summaries to a file
with open('/content/reference_summaries.txt', 'w') as file:
    for summary in reference_summaries:
        file.write(summary + '\n')

calculate ROUGE score

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
! pip install nltk rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=92f27c6ea435c8d4f4614433c6243de2e46c4138cdadb4b3e76718685d8e11fe
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
from datasets import load_metric

rouge = load_metric("rouge")
predictions = generated_texts
references = reference_summaries
rouge.compute(predictions=predictions, references=references)

Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

{'rouge1': AggregateScore(low=Score(precision=0.5039798731389096, recall=0.4348437713075209, fmeasure=0.4300161318902743), mid=Score(precision=0.5720792629284558, recall=0.5132077481833219, fmeasure=0.5054624416931355), high=Score(precision=0.6405205377147721, recall=0.5883448506097336, fmeasure=0.5779940273366488)),
 'rouge2': AggregateScore(low=Score(precision=0.3582798382175233, recall=0.3235908779847449, fmeasure=0.32918125908855456), mid=Score(precision=0.4392509777131242, recall=0.40295513961450347, fmeasure=0.409665208413), high=Score(precision=0.5251640257720721, recall=0.48966996121548095, fmeasure=0.4952198977550493)),
 'rougeL': AggregateScore(low=Score(precision=0.47591451772312626, recall=0.4178247096471858, fmeasure=0.41378492761078917), mid=Score(precision=0.5492121080733334, recall=0.49706670956159316, fmeasure=0.4907448853213938), high=Score(precision=0.6161206980939465, recall=0.5727637153299102, fmeasure=0.5628000500713916)),
 'rougeLsum': AggregateScore(low=Score(pr

### Online Demo

In [None]:
!git clone https://github.com/tloen/alpaca-lora.git
%cd alpaca-lora
!git checkout a48d947

Cloning into 'alpaca-lora'...
remote: Enumerating objects: 607, done.[K
remote: Counting objects: 100% (51/51), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 607 (delta 28), reused 33 (delta 19), pack-reused 556[K
Receiving objects: 100% (607/607), 27.78 MiB | 6.84 MiB/s, done.
Resolving deltas: 100% (360/360), done.
/content/alpaca-lora
Note: switching to 'a48d947'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at a48d947 把中文LoRA放在一起


In [None]:
!python generate.py \
    --load_8bit \
    --base_model 'medalpaca/medalpaca-7b' \
    --lora_weights 'Ka4on/Sono' \
    --share_gradio