# **Assignment: Fine Tune LLM for medical dataset**


In this assignment I am using GPT-Neo (125 M parameters). Have tried using larger models (GPT-Neo-1.3B) as well but had to go forward with a small model due to resource constraints.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m108.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [3]:
!pip install --upgrade accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.19.0


In [4]:
import pandas as pd
import numpy as np
import re
import os

# Data Preprocessing

I have collected the data from BioASQ collection of medical data. The data is availabe in json format and I formatted it in the following way to be able to use it while fine tuning our LLM. The train data has been made in the following way:

*   [Q] question1
*   [A] answer1
*   [Q] question2
*   [A] answer2

In [5]:
with open('/content/drive/MyDrive/Prezent.ai/input.txt', "r") as file:
    text = file.read()

In [6]:
text_data = re.sub(r'\n+', '\n', text).strip()

In [7]:
with open('/content/drive/MyDrive/Prezent.ai/train_data.txt', "w") as f:
    f.write(text_data)

In [72]:
# This is how the training data looks
print(text_data[0:600])

[Q] Is Hirschsprung disease a mendelian or a multifactorial disorder?
[A] Coding sequence mutations in RET, GDNF, EDNRB, EDN3, and SOX10 are involved in the development of Hirschsprung disease. The majority of these genes was shown to be related to Mendelian syndromic forms of Hirschsprung's disease, whereas the non-Mendelian inheritance of sporadic non-syndromic Hirschsprung disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model.
[Q] List signaling molecules (ligands) that interact with the receptor EGFR?
[A] The 7 known EGFR ligands  are: epider


In [9]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPTNeoForCausalLM
from transformers import Trainer, TrainingArguments

In [10]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator

In [11]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
      
  model = GPTNeoForCausalLM.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )
      
  trainer.train()
  trainer.save_model()


In [12]:
train_file_path = "/content/drive/MyDrive/Prezent.ai/train_data.txt"
model_name = 'EleutherAI/gpt-neo-125M' # here we can use a bigger model like gpt-neo-1.3B
output_dir = '/content/drive/MyDrive/Prezent.ai/GPT-Neo-125/gpt-neo-custom_q_and_a'
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 45.0
save_steps = 50000

In [13]:
# Train
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/526M [00:00<?, ?B/s]



Step,Training Loss
500,2.5083
1000,2.1301
1500,1.7542
2000,1.4783
2500,1.1481
3000,0.9076
3500,0.6637
4000,0.5003
4500,0.3545
5000,0.2628


# Inference

In [15]:
from transformers import PreTrainedTokenizerFast, GPTNeoForCausalLM, GPT2TokenizerFast, GPT2Tokenizer

In [54]:
def load_model(model_path):
    model = GPTNeoForCausalLM.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

def generate_text(model_path, sequence, max_length):
    
    model = load_model(model_path)
    tokenizer = load_tokenizer(model_path)
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    output = tokenizer.decode(final_outputs[0], skip_special_tokens=True)
    return output

In [55]:
model_path = "/content/drive/MyDrive/Prezent.ai/GPT-Neo-125/gpt-neo-custom_q_and_a"
max_len = 50

def final_answer(model_path, sequence, max_len):
  result = generate_text(model_path, sequence, max_len)
  result = result.split('[Q]')[0].strip()
  try:
    start_index = result.index('[A]')
    end_index = result.index('.', start_index) + 1
    result = result[:end_index]
  except:
    pass
  return result

### Test examples

In [59]:
sequence1 = "what is an RNA-seq tool?"
answer = final_answer(model_path, sequence1, max_len)
print(answer)

what is an RNA-seq tool?
[A] RNA-seq (radionomain) is a stable color 3' end-labeled signal peptide of
an fragments -called DNA-binding when it binds what kind of DNA


In [66]:
sequence2 = "Which signaling pathway does LY294002 inhibit?"
answer = final_answer(model_path, sequence2, max_len)
print(answer)

Which signaling pathway does LY294002 inhibit?
[A] In the absence of YAP, LY29400 D does not inhibit TRAIL-induced apoptosis but rather promotes BMP-1 to promote the trans-GFib


In [67]:
sequence3 = "Which disease is caused by mutations in the gene PRF1?"
answer = final_answer(model_path, sequence3, max_len)
print(answer)

Which disease is caused by mutations in the gene PRF1?
[A] No, PRQ1 and PRL1 are the most frequent mutations in PRF1.


In [68]:
sequence4 = "What is the use of Atogepant?"
answer = final_answer(model_path, sequence4, max_len)
print(answer)

What is the use of Atogepant?
[A] Currently Atogepant is used for the treatment of esophageal cancer.


In [69]:
sequence5 = "Is Cabotegravir effective for HIV prevention?"
answer = final_answer(model_path, sequence5, max_len)
print(answer)

Is Cabotegravir effective for HIV prevention?
[A] Yes.
