In [None]:
# necessary package installations -- this cell will autorestart the notebook (ie kill it, you will see notifications for the session crashing, but that is fine)
! pip install --upgrade opengpt
! pip install accelerate
! pip install pandas
! pip install datasets
! pip install pickle
! pip install transformers

import os
os.kill(os.getpid(), 9)

Collecting opengpt
  Downloading opengpt-0.0.5-py3-none-any.whl (16 kB)
Collecting tiktoken>=0.3.2
  Downloading tiktoken-0.5.1-cp39-cp39-win_amd64.whl (760 kB)
Collecting python-box
  Downloading python_box-7.1.1-cp39-cp39-win_amd64.whl (1.2 MB)
Collecting datasets<3,>=2
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
Collecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
Collecting jsonpickle
  Downloading jsonpickle-3.0.2-py3-none-any.whl (40 kB)
Collecting transformers<5,>=4.2
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
Collecting xxhash
  Downloading xxhash-3.3.0-cp39-cp39-win_amd64.whl (29 kB)
Collecting dill<0.3.8,>=0.3.0
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
Collecting pyarrow>=8.0.0
  Downloading pyarrow-13.0.0-cp39-cp39-win_amd64.whl (24.4 MB)
Collecting multiprocess
  Downloading multiprocess-0.70.15-py39-none-any.whl (133 kB)
Collecting fsspec[http]<2023.9.0,>=2023.1.0
  Downloading fsspec-2023.6.0-py3-none-any.whl (163 k

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, pipeline
import pickle
import pandas as pd
import datasets


from opengpt.config import Config
from opengpt.model_utils import add_tokens_to_model_and_tokenizer
from opengpt.dataset_utils import create_labels, pack_examples
from opengpt.data_collator import DataCollatorWithPadding

# switches runtime to GPU
!nvcc --version

'nvcc' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
# loads the config
config = Config(yaml_path='./example_train_config.yaml')

# this config can be used as a template
config.train.to_dict()

{'model': 'olm/olm-gpt2-oct-2022',
 'datasets': ['../data/example_project_data/prepared_generated_data_for_example_project.csv',
  '../data/nhs_uk_full/prepared_generated_data_for_nhs_uk_qa.csv',
  '../data/nhs_uk_full/prepared_generated_data_for_nhs_uk_conversations.csv',
  '../data/medical_tasks_gpt4/prepared_generated_data_for_medical_tasks.csv'],
 'ignore_index': -100,
 'max_seq_len': 512,
 'packing_type': 'partial',
 'shuffle_dataset': True,
 'hf_training_arguments': {'output_dir': '../data/results/',
  'gradient_accumulation_steps': 16,
  'per_device_eval_batch_size': 1,
  'per_device_train_batch_size': 1,
  'load_best_model_at_end': False,
  'learning_rate': 2e-05,
  'weight_decay': 0.1,
  'adam_beta1': 0.9,
  'adam_beta2': 0.95,
  'adam_epsilon': 1e-07,
  'max_grad_norm': 1,
  'num_train_epochs': 1,
  'lr_scheduler_type': 'cosine',
  'warmup_ratio': 0.03,
  'logging_strategy': 'steps',
  'logging_steps': 100,
  'save_strategy': 'steps',
  'save_steps': 30000,
  'seed': 11,
  'o

In [5]:
# loads the GPT-2 model and tokenizer
model = AutoModelForCausalLM.from_pretrained(config.train.model)
tokenizer = AutoTokenizer.from_pretrained(config.train.model)
tokenizer.model_max_length = config.train.max_seq_len

Downloading (…)lve/main/config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/805k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/463k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [6]:
# expands tokenizer/model with the necessary special tokens for conversational LLMs
add_tokens_to_model_and_tokenizer(config, tokenizer, model)

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 50270. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


In [8]:
# defines the csv datasets to be used
config.train.datasets = ['./prepared_generated_data_for_nhs_uk_qa.csv',
                         './prepared_generated_data_for_nhs_uk_conversations.csv',
                         './prepared_generated_data_for_medical_tasks.csv']

In [9]:
# loads datasets and shuffles if needed
train_dataset = datasets.Dataset.from_csv(config.train.datasets)
if config.train.shuffle_dataset:
    train_dataset = train_dataset.shuffle()
    print("Shuffling dataset!")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Shuffling dataset!


In [10]:
# checks the size of the dataset and columns
train_dataset

Dataset({
    features: ['text', 'raw_data_id'],
    num_rows: 31708
})

In [11]:
# text part used to train model with special tokens
train_dataset[0]['text']

"<|user|> Based on the patient's history and symptoms, determine the most likely diagnosis and briefly explain your reasoning.\nA 45-year-old female presents with fatigue, weight gain, dry skin, and constipation. She also reports feeling cold all the time and has recently noticed hair thinning. She has no significant past medical history. <|eos|> <|ai|> The most likely diagnosis for this patient is hypothyroidism. Hypothyroidism is caused by an underactive thyroid gland, resulting in reduced production of thyroid hormones. The symptoms mentioned, such as fatigue, weight gain, dry skin, constipation, sensitivity to cold, and hair thinning, are all consistent with hypothyroidism. Further investigations, such as thyroid function tests, would be needed to confirm the diagnosis and determine the appropriate treatment. <|eos|> <|eod|>"

In [12]:
# removes everything but text
to_remove = list(train_dataset.column_names)
to_remove.remove('text')
train_dataset = train_dataset.remove_columns(to_remove)

In [13]:
# ignores max_seq_len warning, it is handled by the packer or data_collator
train_dataset = train_dataset.map(
    lambda examples: tokenizer(examples['text'], add_special_tokens=False),
    batched=True,
    num_proc=1,
    remove_columns=["text"])
# creates labels for supervised training (meaning we do not train on questions, but only on answers)
train_dataset = train_dataset.map(
    lambda examples: create_labels(examples, config, tokenizer),
    batched=True,
    batch_size=1000,
    num_proc=1,
)
# we only do packing for the train set
train_dataset = train_dataset.map(
    lambda examples: pack_examples(examples, config.train.max_seq_len, packing_type=config.train.packing_type),
    batched=True,
    batch_size=1000,
    num_proc=1,
)

Map:   0%|          | 0/31708 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (815 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/31708 [00:00<?, ? examples/s]

Map:   0%|          | 0/31708 [00:00<?, ? examples/s]

In [14]:
# checks the new train_dataset (take note of how the labels look). The USER (Question) part of the input should have a label of -100,
# and the AI part (Answer) should have labels equal to input_ids
for i in range(100):
    print(train_dataset[0]['input_ids'][i], train_dataset[0]['labels'][i], train_dataset[0]['attention_mask'][i])

50265 -100 1
11531 -100 1
325 -100 1
225 -100 1
88 -100 1
76 -100 1
73 -100 1
4323 -100 1
526 -100 1
2084 -100 1
290 -100 1
5946 -100 1
16 -100 1
4773 -100 1
225 -100 1
88 -100 1
76 -100 1
73 -100 1
745 -100 1
2648 -100 1
11194 -100 1
290 -100 1
12556 -100 1
5942 -100 1
398 -100 1
20113 -100 1
18 -100 1
203 -100 1
37 -100 1
6049 -100 1
17 -100 1
2996 -100 1
17 -100 1
765 -100 1
4750 -100 1
8382 -100 1
346 -100 1
16005 -100 1
16 -100 1
3137 -100 1
4715 -100 1
16 -100 1
3652 -100 1
2680 -100 1
16 -100 1
290 -100 1
38221 -100 1
18 -100 1
1236 -100 1
568 -100 1
4593 -100 1
3371 -100 1
4065 -100 1
479 -100 1
225 -100 1
88 -100 1
76 -100 1
73 -100 1
595 -100 1
290 -100 1
504 -100 1
3077 -100 1
6674 -100 1
2733 -100 1
41945 -100 1
18 -100 1
1236 -100 1
504 -100 1
786 -100 1
2645 -100 1
1843 -100 1
3009 -100 1
2084 -100 1
18 -100 1
225 -100 1
50267 -100 1
225 -100 1
50266 50266 1
409 409 1
745 745 1
2648 2648 1
11194 11194 1
326 326 1
442 442 1
4323 4323 1
318 318 1
8406 8406 1
13579 13579 1
5

In [15]:
# loads the HF training arguments and make the data collator, you can try increasing the LR or the number of epochs - it could improve the performance of GPT-2.
# check HF Hub for better models with better results
training_args = TrainingArguments(**config.train.hf_training_arguments.to_dict())
dc = DataCollatorWithPadding(tokenizer.pad_token_id, config.train.ignore_index, max_seq_len=config.train.max_seq_len)

# uses HF trainer to train our models
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=None,
    data_collator=dc,
)

In [19]:
# runs training, ignore AdamW warnings
trainer.train()

Step,Training Loss


KeyboardInterrupt: 

# Testing:

In [17]:
gen = pipeline(model=model, tokenizer=tokenizer, task='text-generation', device=model.device)

In [18]:
t = "<|user|> What is diabetes? <|eos|> <|ai|>" # The format with special tokens is required, because of training
print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.


<|user|> What is diabetes? <|eos|> <|ai|> 


In [24]:
# Let's try some examples we used to test the NHS-LLM.  Obviously the result is not the best, what is good is the reference is correct - but the explanation makes no sense
t = "<|user|> What is vitamin d3 and should I take it? <|eos|> <|ai|>"
print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.


<|user|> What is vitamin d3 and should I take it? <|eos|> <|ai|> Vitamin d3 is a vitamin that helps to protect your skin from damage caused by sun damage, such as sunburn. It is recommended to take it as a supplement to your daily diet.
References:
- https://www.nhs.uk/conditions/vitamins-and-minerals/vitamin-d/ 


In [21]:
# A bit better, it even correctly expanded the abbreviation for hypertension
t = "<|user|> What is HTN? <|eos|> <|ai|>"
print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.


<|user|> What is HTN? <|eos|> <|ai|> HTN is a rare genetic disorder that affects the way the brain works. It's caused by a faulty gene that causes the brain to produce a protein called a protein called a protein called a protein kinase.
References:
- https://www.nhs.uk/conditions/neurofibromatosis-type-1/causes/ 


In [23]:
# Let us try a question that has nothing to do with healthcare - this proves that the model is generalising a bit (but the reference makes no sense)
t = "<|user|> What is the capital of France? <|eos|> <|ai|>"
print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.


<|user|> What is the capital of France? <|eos|> <|ai|> The capital of France is Paris.
References:
- https://www.nhs.uk/conditions/social-care-and-support-guide/care-services-equipment-and-care-homes/ 


In [None]:
# And a couple more healthcare quesetions, e.g. an MCQ (The correct answer is A)
t = "<|user|> Choose the correct statement about laparoscopic liver resection efficacy from the following options: a) there is no difference in the overall patient survival rate or disease-free survival rate between laparoscopic liver resection and open resection, b) laparoscopic liver resection has a higher patient survival rate than open resection, c) laparoscopic liver resection has a lower disease-free survival rate than open resection. <|eos|> <|ai|>"
print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.


<|user|> Choose the correct statement about laparoscopic liver resection efficacy from the following options: a) there is no difference in the overall patient survival rate or disease-free survival rate between laparoscopic liver resection and open resection, b) laparoscopic liver resection has a higher patient survival rate than open resection, c) laparoscopic liver resection has a lower disease-free survival rate than open resection. <|eos|> <|ai|> The correct statement about laparoscopic liver resection efficacy is that there is no difference in the overall patient


In [3]:
t = "<|user|> I had a high fever for the past 3 days, what should i do? <|eos|> <|ai|>"
print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

NameError: name 'gen' is not defined

In [28]:
t = "<|user|> My lower abdomen is hurting very badly on my right side. What should I do? <|eos|> <|ai|>"
print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.


<|user|> My lower abdomen is hurting very badly on my right side. What should I do? <|eos|> <|ai|> I suggest you see a GP as soon as possible. They may suggest a tummy tuck or surgery to remove the tummy.
References:
- https://www.nhs.uk/conditions/pregnancy-and-baby/ 


In [None]:
# Note that we can also continue the conversation (I'm doing it manually here, but can be easily made into a simple chatbot)
t = """<|user|> I had a high fever for the past 3 days, what should i do? <|eos|> <|ai|> I'm sorry to hear that. I understand your concern. Have you been feeling any other symptoms?
References:
- https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/how-to-get-your-first-dose-for-coronavirus/ <|eos|> <|user|> No, only fever. <|eos|> <|ai|>"""
print(gen(t, do_sample=True, max_length=256, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.


<|user|> I had a high fever for the past 3 days, what should i do? <|eos|> <|ai|> I'm sorry to hear that. I understand your concern. Have you been feeling any other symptoms?
References:
- https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/how-to-get-your-first-dose-for-coronavirus/ <|eos|> <|user|> No, only fever. <|eos|> <|ai|> I understand your concern. Have you been feeling any other symptoms?
References:
- https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/how-to-get-your-first-dose-for-coronavirus/ 


In [None]:
# Finally, to show that our training works, we will also try to query an untrained model
model_not_trained = AutoModelForCausalLM.from_pretrained(config.train.model)
tokenizer_not_trained = AutoTokenizer.from_pretrained(config.train.model)
gen_not_trained = pipeline(model=model_not_trained, tokenizer=tokenizer_not_trained, task='text-generation', device=model.device)

In [None]:
t = "What is HTN?" # No special tokens, as this model is not trained.
print(gen_not_trained(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is HTN?
HTN stands for High-Tensile Nerve Network. It is a network of nerve cells that are linked to the brain. It is a network of nerve cells that are connected to the brain. It is important to note that HTN is not a single nerve cell. It is a network of nerve cells that are linked to the brain.
What is the difference between HTN and the other nerve cells?
The difference between HTN and the other nerve cells is that HTN is


In [None]:
t = "What is vitamin d3 and should I take it?" # No special tokens, as this model is not trained.
print(gen_not_trained(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is vitamin d3 and should I take it?
Vitamin D3 is a vitamin that is found in the skin. It is a vitamin that is essential for the health of the skin. It is also essential for the growth of the skin. It is important to take vitamin D3 if you are looking for a healthy glow.
What is vitamin d2 and what is vitamin d3?
Vitamin D2 is a vitamin that is found in the skin. It is a vitamin that is essential for th


In [None]:
t = "What is the capital of France?" # Let's try a general question
print(gen_not_trained(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is the capital of France?
The capital of France is Paris. It is the capital of France and the second largest city in the world. It is also the capital of the French overseas department of Guadeloupe.
What is the capital of the United Kingdom?
The capital of the United Kingdom is London. It is the capital of the United Kingdom and the second largest city in the world. It is the capital
