<a href="https://colab.research.google.com/github/QaynatKhan/machine_learning/blob/main/M4_NB_MiniProject_1_Medical_Q%26A_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative AI and Prompt Engineering
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries.

### **GPT-2**

In recent years, the OpenAI GPT-2 exhibited an impressive ability to write coherent and passionate essays that exceeded what current language models can produce. The GPT-2 wasn't a particularly novel architecture - its architecture is very similar to the **decoder-only transformer**. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset.

Here, you are going to fine-tune the GPT2 model with the Medical data. Expected result should be that the model will be able to reply to the prompt related medical queries after fine-tuning.

To know more about GPT-2, refer [here](http://jalammar.github.io/illustrated-gpt2/).

### Installing Dependencies

In [1]:
%%capture
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

### <font color="#990000">Restart Session/Runtime</font>

### Import required packages

In [2]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [3]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv
MedQuAD.csv.1
MedQuAD.csv.2


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** `pd.read_csv()`

In [4]:
# Download dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv

# Read dataset
data = pd.read_csv("MedQuAD.csv")

### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [1 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [5]:
# Handle missing values
data = data.dropna()

- **Remove duplicates from data considering `Question` and `Answer` columns**

In [6]:
# Remove duplicates based on 'Question' and 'Answer'
data = data.drop_duplicates(subset=['Question', 'Answer'])

**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [1 Mark]**

In [7]:
# Total categories in Focus column
# Count the total number of distinct categories in the 'Focus' column
total_categories = data['Focus'].nunique()

# Display the total count of distinct categories
print("Total distinct categories in 'Focus' column:", total_categories)

Total distinct categories in 'Focus' column: 4770


In [8]:
# Displaying the distinct categories of Focus column and the number of records belonging to each category
# (Top 100 only)

# Count the occurrences of each category in the 'Focus' column
category_counts = data['Focus'].value_counts()

# Get the top 100 distinct categories
top_100_categories = category_counts.head(100)

# Create a DataFrame for better visualization
top_100_df = top_100_categories.reset_index()
top_100_df.columns = ['Category', 'Number of Records']

# Display the top 100 categories and their counts
print(top_100_df)

                                             Category  Number of Records
0                                       Breast Cancer                 53
1                                     Prostate Cancer                 43
2                                              Stroke                 35
3                                         Skin Cancer                 34
4                                 Alzheimer's Disease                 30
..                                                ...                ...
95                                        Sarcoidosis                 11
96                                  Polycythemia Vera                 11
97                                     Celiac Disease                 11
98                                      Down syndrome                 10
99  Microscopic Colitis: Collagenous Colitis and L...                 10

[100 rows x 2 columns]


In [9]:
# Top 100 Focus categories names

# Get the top 100 distinct category names
top_100_categories = category_counts.head(100).index.tolist()

# Display the top 100 category names
print("Top 100 Focus Categories:")
for category in top_100_categories:
    print(category)

Top 100 Focus Categories:
Breast Cancer
Prostate Cancer
Stroke
Skin Cancer
Alzheimer's Disease
Colorectal Cancer
Lung Cancer
High Blood Cholesterol
Heart Failure
Heart Attack
High Blood Pressure
Parkinson's Disease
Leukemia
Shingles
Osteoporosis
Age-related Macular Degeneration
Diabetes
Hemochromatosis
Diabetic Retinopathy
Gum (Periodontal) Disease
Psoriasis
Kidney Disease
Balance Problems
Dry Mouth
COPD
Cataract
Glaucoma
Prescription and Illicit Drug Abuse
Medicare and Continuing Care
Gout
Wilson Disease
Osteoarthritis
Narcolepsy
Problems with Taste
Endometrial Cancer
Neuroblastoma
Short Bowel Syndrome
Rheumatoid Arthritis
Dry Eye
Peripheral Arterial Disease (P.A.D.)
Anxiety Disorders
Surviving Cancer
Pituitary Tumors
Kidney Dysplasia
Problems with Smell
Urinary Tract Infections in Children
Prostate Enlargement: Benign Prostatic Hyperplasia
Depression
Knee Replacement
National Hormone and Pituitary Program (NHPP): Information for People Treated with Pituitary Human Growth Hormone (Com

### Create Training and Validation set

**Exercise 4: Create training and validation set [1 Mark]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [10]:
# YOUR CODE HERE
# Get the top 100 distinct categories
top_100_categories = category_counts.head(100).index.tolist()

# Initialize lists to hold training and validation samples
train_samples = []
val_samples = []

# Iterate over each category in the top 100
for category in top_100_categories:
    # Filter the data for the current category
    category_data = data[data['Focus'] == category]

    # Sample 4 records for the training set (with replacement)
    train_sample = category_data.sample(n=4, random_state=42, replace=False)
    train_samples.append(train_sample)

    # Sample 1 record for the validation set (different from training set)
    val_sample = category_data.drop(train_sample.index).sample(n=1, random_state=42, replace=False)
    val_samples.append(val_sample)

# Combine all training and validation samples into DataFrames
train_set = pd.concat(train_samples, ignore_index=True)
val_set = pd.concat(val_samples, ignore_index=True)

# Display the shapes of the training and validation sets
print("Training Set Shape:", train_set.shape)
print("Validation Set Shape:", val_set.shape)



Training Set Shape: (400, 6)
Validation Set Shape: (100, 6)


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks:  [1 Mark]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text + '\<end\>'*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [29]:
# Combine Questions and Answers for train and val data
## sequence = '<question> ' + question + ' <answer> ' + answer + ' <end>'

# Combine Question and Answer for training set
train_set['combined'] = '   <question>   ' + train_set['Question'] + '   <answer>   ' + train_set['Answer'] + '<end>'

# Combine Question and Answer for validation set
val_set['combined'] = '   <question>   ' + val_set['Question'] + '   <answer>   ' + val_set['Answer'] + '<end>'


print("Training and validation data pre-processed and saved successfully.")

Training and validation data pre-processed and saved successfully.


In [30]:
# Print one example from the training set
print("\nExample from Training Set:")
print(train_set['combined'].iloc[10])  # Display the first combined example

# Print one example from the validation set
print("\nExample from Validation Set:")
print(val_set['combined'].iloc[20])  # Display the first combined example



Example from Training Set:
   <question>   What are the treatments for Stroke ?   <answer>   Surgery Surgery can be used to prevent stroke, to treat stroke, or to repair damage to the blood vessels or malformations in and around the brain. - Carotid endarterectomy is a surgical procedure in which a surgeon removes fatty deposits, or plaque, from the inside of one of the carotid arteries. The procedure is performed to prevent stroke. The carotid arteries are located in the neck and are the main suppliers of blood to the brain. Carotid endarterectomy is a surgical procedure in which a surgeon removes fatty deposits, or plaque, from the inside of one of the carotid arteries. The procedure is performed to prevent stroke. The carotid arteries are located in the neck and are the main suppliers of blood to the brain. Vascular Interventions In addition to surgery, a variety of techniques have been developed to allow certain vascular problems to be treated from inside the artery using speciali

- **Join the combined text using '\n' into a single string for training and validation separately**

In [31]:
# Train and Validation text for all Q&As
# Join the combined text for the training set
train_text = '\n'.join(train_set['combined'].tolist())

# Join the combined text for the validation set
val_text = '\n'.join(val_set['combined'].tolist())

# Print lengths of the combined strings for verification
print("Length of training text:", len(train_text))
print("Length of validation text:", len(val_text))


Length of training text: 570777
Length of validation text: 201820


- **Save the training and validation strings as text files**

In [32]:
# Save the training and validation data as text files

# Save the training text to a file
with open("train_data.txt", "w") as train_file:
    train_file.write(train_text)

# Save the validation text to a file
with open("val_data.txt", "w") as val_file:
    val_file.write(val_text)

print("Training and validation data saved to 'train_data.txt' and 'val_data.txt'.")

# Print a few samples from the training data
print("\nSample Training Data:")
train_samples = train_text.split('\n')[:5]  # Get the first 5 samples
for i, sample in enumerate(train_samples):
    print(f"Sample {i + 1}: {sample}")

# Print a few samples from the validation data
print("\nSample Validation Data:")
val_samples = val_text.split('\n')[:5]  # Get the first 5 samples
for i, sample in enumerate(val_samples):
    print(f"Sample {i + 1}: {sample}")


Training and validation data saved to 'train_data.txt' and 'val_data.txt'.

Sample Training Data:
Sample 1:    <question>   Who is at risk for Breast Cancer? ?   <answer>   Key Points - Avoiding risk factors and increasing protective factors may help prevent cancer. - The following are risk factors for breast cancer: - Older age - A personal history of breast cancer or benign (noncancer) breast disease - Inherited risk of breast cancer - Dense breasts - Exposure of breast tissue to estrogen made in the body - Taking hormone therapy for symptoms of menopause - Radiation therapy to the breast or chest - Obesity - Drinking alcohol - The following are protective factors for breast cancer: - Less exposure of breast tissue to estrogen made by the body - Taking estrogen-only hormone therapy after hysterectomy, selective estrogen receptor modulators, or aromatase inhibitors and inactivators - Estrogen-only hormone therapy after hysterectomy - Selective estrogen receptor modulators - Aromatase 

**Exercise 6: Load pre-trained GPT2Tokenizer**

- Use checkpoint = "gpt2"

**Hint:** `GPT2Tokenizer.from_pretrained(...)`

In [33]:
# Set up the tokenizer
# Import the necessary library
from transformers import GPT2Tokenizer

# Load the pre-trained GPT2 tokenizer
checkpoint = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

# Print a confirmation message
print("GPT2Tokenizer loaded successfully.")


GPT2Tokenizer loaded successfully.


**Exercise 7: Tokenize train and validation data [1 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

**Hint:**

`from datasets import load_dataset`

`dataset = load_dataset("text", data_files={...})`

In [53]:
# # Import necessary libraries
# from datasets import load_dataset
# from transformers import GPT2Tokenizer

# # Load the pre-trained GPT2 tokenizer
# checkpoint = "gpt2"
# tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

# # Set pad token
# tokenizer.pad_token = tokenizer.eos_token  # Use the EOS token as the pad token

# # Load the training and validation datasets from text files
# dataset = load_dataset("text", data_files={"train": "train_data.txt", "validation": "val_data.txt"})

# # Tokenization function
# def tokenize_function(examples):
#     return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# # Tokenize the datasets
# tokenized_datasets = dataset.map(tokenize_function, batched=True)

# # Print a message confirming the tokenization
# print("Training and validation datasets tokenized successfully.")

# Save the training and validation data as text files

import torch
from transformers import GPT2Tokenizer

# Load the pre-trained tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set padding token to be the same as the EOS token
tokenizer.pad_token = tokenizer.eos_token  # Assign EOS token as padding token

# Read the training data from the file
with open("train_data.txt", "r") as train_file:
    train_text = train_file.read()

# Read the validation data from the file
with open("val_data.txt", "r") as val_file:
    val_text = val_file.read()

# Print a few samples from the training data
print("\nSample Training Data:")
train_samples = train_text.split('\n')[:5]  # Get the first 5 samples
for i, sample in enumerate(train_samples):
    print(f"Sample {i + 1}: {sample}")

# Print a few samples from the validation data
print("\nSample Validation Data:")
val_samples = val_text.split('\n')[:5]  # Get the first 5 samples
for i, sample in enumerate(val_samples):
    print(f"Sample {i + 1}: {sample}")

# Tokenize the training and validation text
train_tokens = tokenizer(train_text, return_tensors='pt', truncation=True, padding=True)
val_tokens = tokenizer(val_text, return_tensors='pt', truncation=True, padding=True)


# Print a few samples of the tokenized training data
print("\nSample Tokenized Training Data:")
train_token_samples = train_tokens['input_ids'][:5]  # Get the first 5 tokenized samples
for i, token_sample in enumerate(train_token_samples):
    decoded_sample = tokenizer.decode(token_sample, skip_special_tokens=True)  # Skip special tokens for clearer output
    print(f"Token Sample {i + 1}: {decoded_sample}")

# Print a few samples of the tokenized validation data
print("\nSample Tokenized Validation Data:")
val_token_samples = val_tokens['input_ids'][:5]  # Get the first 5 tokenized samples
for i, token_sample in enumerate(val_token_samples):
    decoded_sample = tokenizer.decode(token_sample, skip_special_tokens=True)  # Skip special tokens for clearer output
    print(f"Token Sample {i + 1}: {decoded_sample}")


# Tokenize the training and validation text
train_tokens = tokenizer(train_text, return_tensors='pt', truncation=True, padding=True)
val_tokens = tokenizer(val_text, return_tensors='pt', truncation=True, padding=True)

# Print a few samples of the tokenized training data
print("\nSample Tokenized Training Data:")
train_token_samples = train_tokens['input_ids'][:5]  # Get the first 5 tokenized samples
for i, token_sample in enumerate(train_token_samples):
    print(f"Token Sample {i + 1}: {tokenizer.decode(token_sample)}")  # Decode back to text for better visibility

# Print a few samples of the tokenized validation data
print("\nSample Tokenized Validation Data:")
val_token_samples = val_tokens['input_ids'][:5]  # Get the first 5 tokenized samples
for i, token_sample in enumerate(val_token_samples):
    print(f"Token Sample {i + 1}: {tokenizer.decode(token_sample)}")  # Decode back to text for better visibility




Sample Training Data:
Sample 1:    <question>   Who is at risk for Breast Cancer? ?   <answer>   Key Points - Avoiding risk factors and increasing protective factors may help prevent cancer. - The following are risk factors for breast cancer: - Older age - A personal history of breast cancer or benign (noncancer) breast disease - Inherited risk of breast cancer - Dense breasts - Exposure of breast tissue to estrogen made in the body - Taking hormone therapy for symptoms of menopause - Radiation therapy to the breast or chest - Obesity - Drinking alcohol - The following are protective factors for breast cancer: - Less exposure of breast tissue to estrogen made by the body - Taking estrogen-only hormone therapy after hysterectomy, selective estrogen receptor modulators, or aromatase inhibitors and inactivators - Estrogen-only hormone therapy after hysterectomy - Selective estrogen receptor modulators - Aromatase inhibitors and inactivators - Risk-reducing mastectomy - Ovarian ablation -

In [44]:
# import torch
# from transformers import GPT2Tokenizer

# # Load the pre-trained tokenizer
# model_name = "gpt2"
# tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# # Set padding token to be the same as the EOS token
# tokenizer.pad_token = tokenizer.eos_token  # Assign EOS token as padding token

# # Read the training data from the existing file
# with open("train_data.txt", "r") as train_file:
#     train_text = train_file.read()

# # Read the validation data from the existing file
# with open("val_data.txt", "r") as val_file:
#     val_text = val_file.read()

# # Split the training and validation text into lines (samples)
# train_samples = train_text.split('\n')
# val_samples = val_text.split('\n')

# # Tokenize the training and validation text in batches
# train_tokens = tokenizer(train_samples,
#                          return_tensors='pt',
#                          truncation=True,
#                          padding=True,
#                          max_length=512)  # Set a max length if needed

# val_tokens = tokenizer(val_samples,
#                        return_tensors='pt',
#                        truncation=True,
#                        padding=True,
#                        max_length=512)  # Set a max length if needed

# # Print a few samples of the tokenized training data
# print("\nSample Tokenized Training Data:")
# train_token_samples = train_tokens['input_ids'][:5]  # Get the first 5 tokenized samples
# for i, token_sample in enumerate(train_token_samples):
#     decoded_sample = tokenizer.decode(token_sample, skip_special_tokens=True)  # Skip special tokens for clearer output
#     # Print in the specified format
#     print(f"Token Sample {i + 1}:    <question> {decoded_sample} <answer> <end>")

# # Print a few samples of the tokenized validation data
# print("\nSample Tokenized Validation Data:")
# val_token_samples = val_tokens['input_ids'][:5]  # Get the first 5 tokenized samples
# for i, token_sample in enumerate(val_token_samples):
#     decoded_sample = tokenizer.decode(token_sample, skip_special_tokens=True)  # Skip special tokens for clearer output
#     # Print in the specified format
#     print(f"Token Sample {i + 1}:    <question> {decoded_sample} <answer> <end>")



Sample Tokenized Training Data:
Token Sample 1:    <question>    <question>   Who is at risk for Breast Cancer??   <answer>   Key Points - Avoiding risk factors and increasing protective factors may help prevent cancer. - The following are risk factors for breast cancer: - Older age - A personal history of breast cancer or benign (noncancer) breast disease - Inherited risk of breast cancer - Dense breasts - Exposure of breast tissue to estrogen made in the body - Taking hormone therapy for symptoms of menopause - Radiation therapy to the breast or chest - Obesity - Drinking alcohol - The following are protective factors for breast cancer: - Less exposure of breast tissue to estrogen made by the body - Taking estrogen-only hormone therapy after hysterectomy, selective estrogen receptor modulators, or aromatase inhibitors and inactivators - Estrogen-only hormone therapy after hysterectomy - Selective estrogen receptor modulators - Aromatase inhibitors and inactivators - Risk-reducing ma

**Exercise 8: Create a DataCollator object**

**Hint:** `DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")`

Data collators are objects that:

- will form a batch by using a list of dataset elements as input
- may apply some processing (like padding)

One of the data collators, `DataCollatorForLanguageModeling`, can also apply some random data augmentation (like random masking) on the formed batch.

<br>

`DataCollatorForLanguageModeling` is a data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

Parameters:

- ***tokenizer:*** The tokenizer used for encoding the data.
- ***mlm*** (bool, optional, default=True): Whether or not to use masked language modeling.
    - If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100).
    - Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.
- ***return_tensors*** (str): The type of Tensor to return. Allowable values are “np”, “pt” and “tf” for numpy array, pytorch tensor, and tensorflow tensor respectively.

To know more about `DataCollatorForLanguageModeling` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

In [54]:
# Create a Data collator object
from transformers import DataCollatorForLanguageModeling

# Create a DataCollator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,   # Pass the tokenizer
    mlm=False,              # Set to False since we're not using masked language modeling
    return_tensors="pt"    # Return PyTorch tensors
)

# Print a message confirming the creation of the DataCollator
print("DataCollator for language modeling created successfully.")


DataCollator for language modeling created successfully.


**Exercise 9: Load pre-trained GPT2LMHeadModel**

**Hint:** `GPT2LMHeadModel.from_pretrained(...)`

In [55]:
# Set up the model
# Import the GPT2LMHeadModel
from transformers import GPT2LMHeadModel

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Print the model summary (optional)
print(model)



GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


**Exercise 10: Fine-tune GPT2 Model [2 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [56]:
# Import necessary classes
from transformers import TrainingArguments

# Set the output path for the trained model
model_output_path = "/content/gpt_model"

# Set up the training arguments
training_args = TrainingArguments(
    output_dir=model_output_path,          # Directory to save the model and tokenizer
    overwrite_output_dir=True,              # Overwrite the output directory if it exists
    num_train_epochs=30,                    # Number of training epochs
    per_device_train_batch_size=4,          # Batch size for training
    per_device_eval_batch_size=4,           # Batch size for evaluation
    evaluation_strategy="epoch",             # Evaluate at the end of each epoch
    save_strategy="epoch",                   # Save the model at the end of each epoch
    logging_dir='./logs',                    # Directory for storing logs
    logging_steps=10,                        # Log every 10 steps
    weight_decay=0.001,                      # Strength of weight decay
    warmup_steps=10,                         # Number of warmup steps for learning rate scheduler
)

# Print the training arguments for verification
print(training_args)


TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
eval_use_gather_object=False,
evaluation_strategy=epoch,
fp

In [57]:
import torch
from transformers import Trainer, TrainingArguments, GPT2LMHeadModel, GPT2Tokenizer

# Check if GPU is available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# Load pre-trained model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set up the training arguments
model_output_path = "./gpt_model"  # Change to desired output path
training_args = TrainingArguments(
    output_dir=model_output_path,
    overwrite_output_dir=True,
    num_train_epochs=30,               # Number of training epochs
    per_device_train_batch_size=4,     # Adjust batch size
    per_device_eval_batch_size=4,      # Adjust validation batch size
    evaluation_strategy="epoch",        # Evaluate at the end of each epoch
    save_strategy="epoch",              # Save the model at the end of each epoch
    logging_dir='./logs',               # Directory for storing logs
    logging_steps=10,                   # Log every 10 steps
    weight_decay=0.01,                  # Weight decay to reduce overfitting
    warmup_steps=500,                   # Warmup steps for learning rate scheduler
    learning_rate=0.0001,  #5e-5,                 # Initial learning rate
    fp16=True,                          # Enable mixed precision training
    load_best_model_at_end=True,       # Load the best model when finished training
    metric_for_best_model='loss',      # Use 'loss' for best model metric
    greater_is_better=False              # Lower loss is better
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator
)

# Start training the model
trainer.train()

# Save the fine-tuned model and tokenizer
model.save_pretrained(model_output_path)
tokenizer.save_pretrained(model_output_path)

# Optionally, print out training and validation loss at the end
print("Training and validation completed. Model and tokenizer saved.")


Using device: cuda


Epoch,Training Loss,Validation Loss
1,2.4952,2.433013
2,2.2767,2.336939
3,2.1224,2.284695
4,2.0639,2.234551
5,1.7394,2.21651
6,1.6291,2.213446
7,1.5227,2.253986
8,1.2083,2.332358
9,1.0076,2.452768
10,0.8581,2.586952


KeyboardInterrupt: 

In [58]:
import torch
from transformers import Trainer, TrainingArguments, GPT2LMHeadModel, GPT2Tokenizer

# Check if GPU is available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# Load pre-trained model and tokenizer
model_name = "gpt2"  # You can use 'gpt2-medium' or 'gpt2-small' if needed
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set up the training arguments
model_output_path = "./gpt_model"  # Change to desired output path
training_args = TrainingArguments(
    output_dir=model_output_path,
    overwrite_output_dir=True,
    num_train_epochs=30,               # Number of training epochs
    per_device_train_batch_size=4,     # Adjust batch size
    per_device_eval_batch_size=4,      # Adjust validation batch size
    evaluation_strategy="epoch",        # Evaluate at the end of each epoch
    save_strategy="epoch",              # Save the model at the end of each epoch
    logging_dir='./logs',               # Directory for storing logs
    logging_steps=10,                   # Log every 10 steps
    weight_decay=0.02,                  # Increased weight decay
    warmup_steps=500,                   # Warmup steps for learning rate scheduler
    learning_rate=2e-5,                 # Lower initial learning rate
    fp16=True,                          # Enable mixed precision training
    load_best_model_at_end=True,       # Load the best model when finished training
    metric_for_best_model='loss',      # Use 'loss' for best model metric
    greater_is_better=False,            # Lower loss is better
    eval_steps=100,                     # Evaluate every 100 steps
    save_total_limit=3,                 # Limit the number of saved checkpoints
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator,
)

# Start training the model
trainer.train()

#


Using device: cuda


Epoch,Training Loss,Validation Loss
1,2.6997,2.561966
2,2.4615,2.450746
3,2.3989,2.388801
4,2.3818,2.352131
5,2.103,2.311439
6,2.1059,2.287381
7,2.1749,2.273735
8,1.9472,2.258188
9,1.8811,2.248554
10,1.7863,2.239247


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=3000, training_loss=1.741147003173828, metrics={'train_runtime': 1208.7083, 'train_samples_per_second': 9.928, 'train_steps_per_second': 2.482, 'total_flos': 3135504384000000.0, 'train_loss': 1.741147003173828, 'epoch': 30.0})

In [None]:
 import matplotlib.pyplot as plt

# Plotting the losses
plt.figure(figsize=(12, 6))

# Extract training losses from log history
training_losses = [log['loss'] for log in trainer.state.log_history if 'loss' in log]
plt.plot(training_losses, label='Training Loss', marker='o')

# Extract evaluation losses from log history
evaluation_losses = [log['eval_loss'] for log in trainer.state.log_history if 'eval_loss' in log]
plt.plot(evaluation_losses, label='Validation Loss', marker='o')

plt.title('Training and Validation Loss')
plt.xlabel('Steps (or Epochs if evaluation is per epoch)')
plt.ylabel('Loss')
plt.legend()
plt.grid()
plt.show()

In [None]:
# Save the fine-tuned model
model.save_pretrained(model_output_path)

# Save the tokenizer
tokenizer.save_pretrained(model_output_path)

print(f"Model and tokenizer saved to {model_output_path}")


**Exercise 11: Test Model with user input prompts [2 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [None]:
def generate_response(model, tokenizer, prompt, max_length=200):
    # Encode the input prompt and return tensor
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)

    # Generate response using the model
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],  # Use the attention mask
        max_length=max_length,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id  # Set the pad token ID
    )

    # Decode the generated response and return it
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response


In [None]:
# Example prompt to test the model
prompt = "What are the treatments for Psoriasis ?"

# Generate response
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)

# Print the response
print("Prompt:", prompt)
print("Response:", response)


In [None]:
# Load the fine-tuned model and tokenizer
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Specify the output path where the model and tokenizer were saved
model_output_path = "/content/gpt_model"

# Load the fine-tuned model and tokenizer
fine_tuned_model = GPT2LMHeadModel.from_pretrained(model_output_path)
fine_tuned_tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

# Confirm successful loading
print("Model and tokenizer loaded successfully.")


In [None]:
# Testing with a sample prompt 2

# Define a new sample prompt
prompt = "What to do when feeling sick??"

# Generate a response using the fine-tuned model and tokenizer
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)

# Display the response
print("Prompt:", prompt)
print("Response:", response)


In [None]:
# Testing with a sample prompt 2

# Define a new sample prompt
prompt = "What to do after being diagnosed with cancer?"

# Generate a response using the fine-tuned model and tokenizer
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)

# Display the response
print("Prompt:", prompt)
print("Response:", response)


**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [1 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT-2 model
model_name = "gpt2"
pretrained_model = GPT2LMHeadModel.from_pretrained(model_name)

# Load the pre-trained GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the padding token to the end of sentence token
tokenizer.pad_token = tokenizer.eos_token  # Set the padding token

# Print confirmation
print("Loaded pre-trained GPT-2 model and tokenizer successfully.")


In [None]:
# Testing with the fine-tuned model
prompt = "What precautions to take for a healthy life?"

# Generate the response using the defined generate_response function
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)

# Print the response
print(response)

In [None]:
# Testing with untuned model: prompt 1

# Define the prompt
prompt = "What precautions to take for a healthy life?"

# Generate the response using the untuned model and the generate_response function
response = generate_response(pretrained_model, tokenizer, prompt)

# Print the response
print("Prompt:", prompt)
print("Response:", response)


In [None]:
# Testing with finetuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
# Generate the response using the defined generate_response function
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)

# Print the response
print(response)

In [None]:
# Testing with untuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
# Generate the response using the untuned model and the generate_response function
response = generate_response(pretrained_model, tokenizer, prompt)

# Print the response
print("Prompt:", prompt)
print("Response:", response)

In [None]:
# Testing with finetuned model: prompt 3

prompt = "What to do when feeling sick?"
# Generate the response using the defined generate_response function
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)

# Print the response
print(response)

In [None]:
# Testing with untuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(pretrained_model, tokenizer, prompt)

# Print the response
print("Prompt:", prompt)
print("Response:", response)