**Code adapted from this medium post - https://medium.com/geekculture/fine-tune-eleutherai-gpt-neo-to-generate-netflix-movie-descriptions-in-only-47-lines-of-code-40c9b4c32475** 

# Installing and loading required libraries

In [None]:
!pip install transformers --quiet
!pip install wandb --quiet

[K     |████████████████████████████████| 3.8 MB 8.2 MB/s 
[K     |████████████████████████████████| 596 kB 70.9 MB/s 
[K     |████████████████████████████████| 67 kB 6.8 MB/s 
[K     |████████████████████████████████| 895 kB 84.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 79.7 MB/s 
[K     |████████████████████████████████| 1.7 MB 7.6 MB/s 
[K     |████████████████████████████████| 181 kB 70.0 MB/s 
[K     |████████████████████████████████| 144 kB 82.8 MB/s 
[K     |████████████████████████████████| 63 kB 1.9 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel, AutoTokenizer, AutoModelForCausalLM

In [None]:
torch.manual_seed(42)

<torch._C.Generator at 0x7fe7aac0b210>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Fine-tuning GPT2 medium

## Loading GPT2-Medium Model from 🤗 Model Hub 

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2-medium').cuda()
model.resize_token_embeddings(len(tokenizer))

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

NameError: ignored

In [None]:
dir(model)

['T_destination',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_apply',
 '_auto_class',
 '_backward_compatibility_gradient_checkpointing',
 '_backward_hooks',
 '_buffers',
 '_call_impl',
 '_can_retrieve_inputs_from_name',
 '_convert_head_mask_to_5d',
 '_create_or_get_repo',
 '_expand_inputs_for_generation',
 '_forward_hooks',
 '_forward_pre_hooks',
 '_from_config',
 '_get_backward_hooks',
 '_get_decoder_start_token_id',
 '_get_logits_processor',
 '_get_logits_warper',
 '_get_name',
 '_get_repo_url_from_name',
 '_get_resized_embeddings',
 '_get_resized_lm_head',
 '_get_stopping_criteria',
 '_hook_rss_

## Load the dataset for fine-tuning

In [None]:
path = '/content/drive/MyDrive/Portfolio Project- DSR29/DocProductData/'
flag = 1
if flag==1:
  qa = pd.read_csv(path+'WebMD_QAs.csv')
elif flag==2:
  qa = pd.read_csv(path+"healthtap_QAs.csv")
  qa = qa.groupby('question',as_index = False).agg({'answer': ' '.join})

qa=qa.dropna()

In [None]:
#max_length = max([len(tokenizer.encode(qa.loc[i,'question']+qa.loc[i,'answer'])) for i in range(len(qa))])
# since above does not work, try for fewer samples to get an idea
# max_length = max([len(tokenizer.encode(qa.loc[i,'question']+qa.loc[i,'answer'])) for i in range(400)])
# max_length: 2883 for healthtap_QAs
max_length = 256 # because of memory limitations

In [None]:
class MedicalDataset(Dataset):
    def __init__(self, qa, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for question, answer in zip(qa.loc[:,'question'], qa.loc[:,'answer']):
            prep_txt = f'<|startoftext|>Question: {question}\nAnswer: {answer}<|endoftext|>'            
            encodings_dict = tokenizer(prep_txt, truncation=True,
                                       max_length=max_length, padding="max_length")

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [None]:
dataset = MedicalDataset(qa, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])
del dataset
del qa

In [None]:
import gc
gc.collect()

50

In [None]:
torch.cuda.empty_cache()

## Training the model

```
TrainingArguments(output_dir=mistral-hello-world/runs/gpt2-small-d=wikitext-n=1-g=1-w=1+2021-06-25-23:57:32, 
                  overwrite_output_dir=False,do_train=True, do_eval=None,do_predict=False,evaluation_strategy=IntervalStrategy.STEPS, prediction_loss_only=True,per_device_train_batch_size=4,per_device_eval_batch_size=16,gradient_accumulation_steps=128,eval_accumulation_steps=None,learning_rate=0.0006,weight_decay=0.1,adam_beta1=0.9,adam_beta2=0.95,adam_epsilon=1e-08,max_grad_norm=1.0,num_train_epochs=3.0,max_steps=400000,lr_scheduler_type=SchedulerType.LINEAR,warmup_ratio=0.0,warmup_steps=4000,logging_dir=logs,logging_strategy=IntervalStrategy.STEPS,logging_first_step=True,logging_steps=50,save_strategy=IntervalStrategy.STEPS,save_steps=1000,save_total_limit=None,no_cuda=False,seed=21,fp16=True,fp16_opt_level=O1,fp16_backend=auto,fp16_full_eval=False,local_rank=-1,tpu_num_cores=None,tpu_metrics_debug=False,debug=False,dataloader_drop_last=False,eval_steps=1000,dataloader_num_workers=4,past_index=-1,run_name=gpt2-small-d=wikitext-n=1-g=1-w=1+2021-06-25-23:57:32,disable_tqdm=False,remove_unused_columns=True,label_names=None,load_best_model_at_end=False,metric_for_best_model=None,greater_is_better=None,ignore_data_skip=False,sharded_ddp=[],deepspeed=None,label_smoothing_factor=0.0,adafactor=False,group_by_length=False,length_column_name=length,report_to=[],ddp_find_unused_parameters=None,dataloader_pin_memory=True,skip_memory_metrics=False, _n_gpu=1,mp_parameters=)
```



In [None]:
training_args = TrainingArguments(output_dir='/content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag'+str(flag)+'_3epochs',
                                  num_train_epochs=4,
                                  logging_steps=100,
                                  save_steps=200,
                                  eval_steps=200,
                                  evaluation_strategy='steps',
                                  prediction_loss_only='False',
                                  per_device_eval_batch_size=4,
                                  per_device_train_batch_size=4,
                                  gradient_accumulation_steps=16,
                                  #gradient_checkpointing=True,
                                  
                                  learning_rate=0.0006,
                                  weight_decay=0.15,
                                  adam_beta1=0.9,
                                  adam_beta2=0.95,
                                  adam_epsilon=1e-08,
                                  max_grad_norm=1.0,

                                  warmup_steps=200,
                                  #weight_decay=0.1,
                                  #lr_scheduler_type= 'cosine',
                                  #learning_rate = 5e-4,

                                  fp16=True,

                                  logging_dir='/content/drive/MyDrive/Portfolio Project- DSR29/logs_try2_flag'+str(flag)+'_3epochs',
                                  report_to="wandb",  # enable logging to W&B
                                  run_name="GPT2-"+str(flag)  # name of the W&B run (optional)
)


In [None]:
Trainer(model=model,
        args=training_args,
        train_dataset=train_dataset, 
        eval_dataset=val_dataset,
        data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                  'attention_mask': torch.stack([f[1] for f in data]),
                                  'labels': torch.stack([f[0] for f in data])},
        ).train(
            #resume_from_checkpoint=True
            )

Using amp half precision backend
***** Running training *****
  Num examples = 41630
  Num Epochs = 4
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 16
  Total optimization steps = 2600
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mmaryamf[0m (use `wandb login --relogin` to force relogin)


Step,Training Loss,Validation Loss
200,0.8812,0.837988
400,0.8406,0.806226
600,0.8065,0.772102
800,0.691,0.754336


***** Running Evaluation *****
  Num examples = 4626
  Batch size = 4
Saving model checkpoint to /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-200
Configuration saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-200/config.json
Model weights saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-200/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4626
  Batch size = 4
Saving model checkpoint to /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-400
Configuration saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-400/config.json
Model weights saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4626
  Batch size = 4
Saving model checkpoint to /content/drive/My

Step,Training Loss,Validation Loss
200,0.8812,0.837988
400,0.8406,0.806226
600,0.8065,0.772102
800,0.691,0.754336
1000,0.6732,0.735782
1200,0.6544,0.712402
1400,0.5082,0.725477
1600,0.5059,0.712879
1800,0.4995,0.689195
2000,0.4331,0.723712


***** Running Evaluation *****
  Num examples = 4626
  Batch size = 4
Saving model checkpoint to /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-1000
Configuration saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-1000/config.json
Model weights saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4626
  Batch size = 4
Saving model checkpoint to /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-1200
Configuration saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-1200/config.json
Model weights saved in /content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag1_3epochs/checkpoint-1200/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4626
  Batch size = 4
Saving model checkpoint to /content/dr

TrainOutput(global_step=2600, training_loss=0.7627855829092173, metrics={'train_runtime': 21987.8921, 'train_samples_per_second': 7.573, 'train_steps_per_second': 0.118, 'total_flos': 7.730968931598336e+16, 'train_loss': 0.7627855829092173, 'epoch': 4.0})

In [None]:
wandb.finish()




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/loss,█▇▅▄▃▂▃▂▁▃▂▂▂
eval/runtime,▇▁▄▇▄█▃▇██▇█▆
eval/samples_per_second,▂█▅▂▅▁▆▂▁▁▂▁▃
eval/steps_per_second,▃█▆▂▄▂▆▂▁▁▂▁▃
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/learning_rate,▄███▇▇▇▆▆▆▅▅▅▅▄▄▄▃▃▃▂▂▂▂▁▁
train/loss,█▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.70373
eval/runtime,185.1283
eval/samples_per_second,24.988
eval/steps_per_second,6.25
train/epoch,4.0
train/global_step,2600.0
train/learning_rate,0.0
train/loss,0.362
train/total_flos,7.730968931598336e+16
train/train_loss,0.76279


In [None]:
torch.save(model, '/content/drive/MyDrive/Portfolio Project- DSR29/models/model_GPT2_medium_finetune_flag'+str(flag)+'3epochs'+'.pt')

# Question Answering with GPT2




## Load the test data (90 Q/A)

In [None]:
qa_90 = pd.read_csv('/content/drive/MyDrive/Portfolio Project- DSR29/90_qas.csv')

## Load the models


### Load pretrained models and generate answers using them

In [None]:
def generate_ans(question_list, model, tokenizer, finetuned, min_length, max_length, num_beams, no_repeat_ngram_size):
  '''
  This function answers the questions provided in a list using the provided model and parameters
  '''
  answer = []
  for i in range(len(question_list)):
    question = question_list[i]
    if finetuned==True:
      prep_txt = f'<|startoftext|>Question: {question}\nAnswer: '
    else:
      prep_txt = question
    generated = tokenizer(prep_txt, return_tensors="pt").input_ids.cuda()
    sample_outputs = model.generate(generated, do_sample=False, num_beams=num_beams, 
                                    max_length=max_length, no_repeat_ngram_size=no_repeat_ngram_size)
    if finetuned==True:
      answer += [tokenizer.decode(sample_outputs[0], skip_special_tokens=True).split('Answer: ')[1]]
    else:
      answer += [tokenizer.decode(sample_outputs[0][len(generated[0])+1:], skip_special_tokens=True)]
  return answer

In [None]:
# testing the pretrained GPT2 performance without fine-tuning
modelTypeList = ['gpt2-xl','gpt2-medium']
modelType = modelTypeList[0]
model = GPT2LMHeadModel.from_pretrained(modelType).cuda()
tokenizer = GPT2Tokenizer.from_pretrained(modelType)
# Another way of loading
# model_xl = AutoModelForCausalLM.from_pretrained("gpt2-xl").cuda()
# tokenizer_xl = AutoTokenizer.from_pretrained("gpt2-xl")

In [None]:
# Generate answers
max_gen_length = 256
min_gen_length = 60
num_beams = 5
no_repeat_ngram_size = 3
col_name = f'{modelType}_nbeam{num_beams}_ngram{no_repeat_ngram_size}_len{min_gen_length}_{max_gen_length}'
answer = generate_ans(question_list=qa_90.loc[:,'question'], model=model, tokenizer=tokenizer, finetuned=False,
                      min_length=min_gen_length, max_length=max_gen_length, num_beams=num_beams, no_repeat_ngram_size=no_repeat_ngram_size)
qa_90.loc[:,col_name] = answer


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

### Load the fine-tuned model

In [None]:
# Loading from the checkpoint with best validation accuracy
#flag = 1
#model_directory = '/content/drive/MyDrive/Portfolio Project- DSR29/results_try2_flag'+str(flag)+'_3epochs/checkpoint-1800'
#model_finetuned = GPT2LMHeadModel.from_pretrained(model_directory, return_dict=False).cuda()
#torch.save(model_finetuned, '/content/drive/MyDrive/Portfolio Project- DSR29/models/model_GPT2_medium_finetuned_webMD_1800steps'+'.pt')

In [None]:
model_finetuned = torch.load('/content/drive/MyDrive/Portfolio Project- DSR29/models/model_GPT2_medium_finetuned_webMD_1800steps'+'.pt')

In [None]:
# Load the GPT2 Tokeizer
tokenizer_finetuned = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')
#model_finetuned.resize_token_embeddings(len(tokenizer_finetuned))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Generate answers
max_gen_length = 256
min_gen_length = 60
num_beams = 5
no_repeat_ngram_size = 3
col_name = f'GPT2med1800steps_webMD_nbeam{num_beams}_ngram{no_repeat_ngram_size}_len{min_gen_length}_{max_gen_length}'
answer = generate_ans(question_list=qa_90.loc[:,'question'], model=model_finetuned, tokenizer=tokenizer_finetuned, finetuned=True,
                      min_length=min_gen_length, max_length=max_gen_length, num_beams=num_beams, no_repeat_ngram_size=no_repeat_ngram_size)
qa_90.loc[:,col_name] = answer

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [None]:
qa_90

Unnamed: 0,question,answer,gpt2-medium_nbeam5_ngram3_len60_256,GPT2med1800steps_webMD_nbeam5_ngram3_len60_256
0,I suffer from severe back pain. In which case ...,Sorry to hear what you are going through. In c...,"\nNo, you do not need to go to a doctor right ...",Severe back pain that doesn’t get better with...
1,What are hay fever symptoms?,Thanks for asking. People who have allergic rh...,\nHay fever symptoms include:\n\nFever\n\nHead...,Hay fever symptoms include:Fever (usually hig...
2,Is it fine to exercise with knee pain?,I am glad to help you out. You should rest and...,"\nIf you have knee pain, you may want to consu...",It’s fine to work out if you’re knee-bound an...
3,When will my new tattoo be completely healed?,"That's a good question. In general, it can be ...",\nYour new tattoo will be fully healed by the ...,"When your new tattoo is completely healed, yo..."
4,Who is affected by irritable bowel syndrome?,I will gladly answer that. Irritable bowel syn...,\nSymptoms include:\n\nDiarrhea\n\nConstipatio...,"Anyone can get IBS with constipation (IBS-C),..."
...,...,...,...,...
85,Need advice on what is the the target heart ra...,I am happy to give you advice. Your maximal he...,"\nIf you want to lose weight, you need to burn...",The goal of exercise is to burn more calories...
86,When is low blood pressure dangerous?,Blood pressure varies from person to person an...,\nLow blood pressure (BP) is one of the most c...,Low blood pressure is dangerous because it ca...
87,What might you see in someone with narcissisti...,The question is not so easy to answer. Sound l...,,Someone with this disorder might: Think they ...
88,Do high blood pressure medications produce sid...,"Any medication can cause side effects, and hig...",\nYes. High blood pressure medication can incr...,High blood pressure meds can sometimes cause ...


In [None]:
qa_90.to_csv('/content/drive/MyDrive/Portfolio Project- DSR29/90_qas_GPT2.csv')

Finetuning on webMD made the model answer shorter. Setting the max_length a higher value does not change this behaviour.
a side note: the pretrained GPT2 is the xl. the fine.tuned is on medium because fine-tuning the larger models on colab was not possible.