<a href="https://colab.research.google.com/github/Kanka-goswami/IISc_CDS_2302064/blob/main/Kanka_Copy_of_M3_NB_MiniProject_3_PartA_Medical_Q%26A_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Part-A: Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries.

Please refer to ***M6 Assignment-1 Fine-tune GPT2*** to get familiar with how to load pre-trained gpt2 tokenizer and model.

### Import required packages

In [None]:
!pip -q install -U accelerate
!pip -q install -U transformers
!pip -q install torch

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/302.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/302.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m266.2/302.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [None]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** pd.read_csv()

In [None]:
df = pd.read_csv("MedQuAD.csv")
df.shape

(16412, 6)

In [None]:
df.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [None]:
df.isnull().sum()

Focus             14
CUI              565
SemanticType     597
SemanticGroup    565
Question           0
Answer             5
dtype: int64

In [None]:
# Handling missing values
def drop_missing_values(df):
    rows = df.shape[0]
    data = df.isnull().sum()
    null_count = data.values
    null_per_cent = 100 * null_count / rows
    if all(null_per_cent) < 5:
        print('All null values are dropped. Data has {0:.2f} % null values'.format(max(null_per_cent)))
        return df.dropna()
    else:
        print("No values are dropped. Data has {0:.2f} % null values".format(max(null_per_cent)))
        return df

In [None]:
# Drop missing values
df_temp = drop_missing_values(df)


All null values are dropped. Data has 3.64 % null values


In [None]:
df = df_temp

- **Remove duplicates from data considering `Question` and `Answer` columns**

In [None]:
df[df['Question'].duplicated()].index

Index([  250,   251,   298,   299,   336,   338,   339,   340,   341,   404,
       ...
       16076, 16078, 16079, 16080, 16081, 16082, 16090, 16092, 16123, 16130],
      dtype='int64', length=1417)

In [None]:
# Check duplicates
def check_drop_duplicates(df):
    '''Returns the duplicate indices
    '''
    cols = ['Question','Answer']
    dup_idx = set()

    for col in cols:
        dups = df[df[col].duplicated()].index
        dup_idx.update(dups)

    rows = df.shape[0]

    dup_count = len(dup_idx)
    dup_per_cent = 100 * dup_count / rows

    print('All duplicates dropped. Data has {0:.2f} % duplicate values'.format(dup_per_cent))
    return df.drop(index= dup_idx)

In [None]:
# Drop duplicates
df_temp = check_drop_duplicates(df)

All duplicates dropped. Data has 12.24 % duplicate values


In [None]:
# Check duplicates
df_temp.duplicated().sum()

0

In [None]:
df = df_temp

**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [1 Mark]**

In [None]:
def focus_categories (df,n=1):

    col = 'Focus'

    cat_count = df[col].value_counts()


    # Sorting the dictionary by value using
    # lambda function to extract the values
    # and then reverse the sort to get the largest values first

    result = dict(sorted(cat_count.items(), key = lambda x: x[1], reverse= True)[:n])

    return result

In [None]:
# Top 100 Focus categories names
print(focus_categories(df,3))

{'Prostate Cancer': 11, 'Wilson Disease': 10, 'Ovarian Epithelial, Fallopian Tube, and Primary Peritoneal Cancer': 10}


### Create Training and Validation set

**Exercise 4: Create training and validation set [2 Marks]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [None]:
[df[df['Focus']=='Prostate Cancer'].index.values]

[array([  664,   665,   666,   667,   668,   669,   670,   672,   673,
        15437, 15439])]

In [None]:
def training_validation_split (df):
    import random
    random.seed(42)
    col = 'Focus'
    top_categories = focus_categories(df,100)
    train_set_idx = set()
    test_set_idx = set()
    not_sampled_idx = set()
    for cat in top_categories:
        cat_idx = set(df[df[col]==cat].index.values)
        #print('cat_idx: ',cat_idx)

        train_idx = set(random.sample(cat_idx,4))
        train_set_idx.update(train_idx)
        #print('train_idx: ', train_idx)
        # Update train set idx
        cat_idx.difference_update(train_idx)
        #print('cat_idx after train set: ',cat_idx)

        test_idx = set(random.sample(cat_idx,1))
        # Update test set idx
        test_set_idx.update(test_idx)
        cat_idx.difference_update(test_idx)
        #print('cat_idx after test set: ',cat_idx)

        unsampled_idx = cat_idx
        # Update unsampled set idx
        not_sampled_idx.update(unsampled_idx)
    return train_set_idx, test_set_idx, not_sampled_idx


In [None]:
train_idx, test_idx, unsampled_idx = training_validation_split (df)
len(train_idx), len(test_idx), len(unsampled_idx )

(400, 100, 333)

In [None]:
train_set= df.loc[list(train_idx)]
test_set = df.loc[list(test_idx)]

In [None]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 400 entries, 15365 to 15356
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Focus          400 non-null    object
 1   CUI            400 non-null    object
 2   SemanticType   400 non-null    object
 3   SemanticGroup  400 non-null    object
 4   Question       400 non-null    object
 5   Answer         400 non-null    object
dtypes: object(6)
memory usage: 21.9+ KB


In [None]:
test_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 262 to 15355
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Focus          100 non-null    object
 1   CUI            100 non-null    object
 2   SemanticType   100 non-null    object
 3   SemanticGroup  100 non-null    object
 4   Question       100 non-null    object
 5   Answer         100 non-null    object
dtypes: object(6)
memory usage: 5.5+ KB


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks: [1.5 Marks]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [None]:
test_set.head(1)

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
262,Osteosarcoma and Malignant Fibrous Histiocytom...,C0002991,T191,Disorders,How to diagnose Osteosarcoma and Malignant Fib...,Imaging tests are used to detect (find) osteos...


In [None]:
row = 0
print('<question>' + test_set.iloc[row,4] + '<answer>' + test_set.iloc[row,5])

<question>How to diagnose Osteosarcoma and Malignant Fibrous Histiocytoma of Bone ?<answer>Imaging tests are used to detect (find) osteosarcoma and MFH. Imaging tests are done before the biopsy. The following tests and procedures may be used: - Physical exam and history : An exam of the body to check general signs of health, including checking for signs of disease, such as lumps or anything else that seems unusual. A history of the patients health habits and past illnesses and treatments will also be taken. - X-ray : An x-ray of the organs and bones inside the body. An x-ray is a type of energy beam that can go through the body and onto film, making a picture of areas inside the body. - CT scan (CAT scan): A procedure that makes a series of detailed pictures of areas inside the body, taken from different angles. The pictures are made by a computer linked to an x-ray machine. A dye may be injected into a vein or swallowed to help the organs or tissues show up more clearly. This procedur

In [None]:
def write_files (df,FILENAME:str):
    rows = df.shape[0]
    fp=open(FILENAME,mode='w')
    string = ''
    for row in range(rows):
        string = '<question>' + df.iloc[row,4] + '<answer>' + df.iloc[row,5]
        fp.write(string)
    fp.close()


- **Join the combined text using '\n' into a single string for training and validation separately**

In [None]:
# YOUR CODE HERE

- **Save the training and validation strings as text files**

In [None]:
write_files(train_set,'training.txt')

In [None]:
write_files(test_set,'validation.txt')

**Exercise 6: Load pre-trained GPT2Tokenizer [0.5 Mark]**

- Use checkpoint = "gpt2"

In [None]:
# Set up the tokenizer
checkpoint = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

**Exercise 7: Tokenize train and validation data and form TextDataset objects [0.5 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

In [None]:
# Tokenize train text
train_dataset = TextDataset(tokenizer=tokenizer, file_path="training.txt", block_size=512)

# Tokenize validation text
val_dataset = TextDataset(tokenizer=tokenizer, file_path="validation.txt", block_size=512)

In [None]:
# Length of train and validation set
len(train_dataset), len(val_dataset)

(378, 92)

In [None]:
# Batch-size
train_dataset[0].shape, val_dataset[0].shape

(torch.Size([512]), torch.Size([512]))

**Exercise 8: Create a DataCollator object [0.5 Mark]**

In [None]:
# Create a Data collator object
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

**Exercise 9: Load pre-trained GPT2LMHeadModel [0.5 Mark]**

In [None]:
# Set up the model
model = GPT2LMHeadModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Exercise 10: Fine-tune GPT2 Model [1 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [None]:
# Set up the training arguments

model_output_path = "/content/gpt_model"

training_args = TrainingArguments(
    output_dir = model_output_path,
    overwrite_output_dir = True,
    per_device_train_batch_size = 4, # try with 2
    per_device_eval_batch_size = 4,  #  try with 2
    num_train_epochs = 100,
    save_steps = 1_000,
    save_total_limit = 2,
    logging_dir = './logs',
    )

In [None]:
# Set up the training arguments

model_output_path = "/content/gpt_model"

training_args = TrainingArguments(
    output_dir = model_output_path,
    overwrite_output_dir = True,
    per_device_train_batch_size = 4, # try with 2
    per_device_eval_batch_size = 4,  #  try with 2
    num_train_epochs = 10,
    save_steps = 1_000,
    save_total_limit = 2,
    logging_dir = './logs',
    )

In [None]:
# Train the model
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    repetiton_
)

trainer.train()

# Save the model
trainer.save_model(model_output_path)

# Save the tokenizer
tokenizer.save_pretrained(model_output_path)

Step,Training Loss
500,0.7853


('/content/gpt_model/tokenizer_config.json',
 '/content/gpt_model/special_tokens_map.json',
 '/content/gpt_model/vocab.json',
 '/content/gpt_model/merges.txt',
 '/content/gpt_model/added_tokens.json')

**Exercise 11: Test Model with user input prompts [1 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [None]:
def generate_response(model, tokenizer, prompt, max_length=100):

    input_ids = tokenizer.encode(prompt, return_tensors="pt")      # 'pt' for returning pytorch tensor

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)


In [None]:
# Load the fine-tuned model and tokenizer

my_model = GPT2LMHeadModel.from_pretrained(model_output_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

In [None]:
# Response from model

prompt = "What precautions to take for a healthy life?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What precautions to take for a healthy life? Check with your doctor before starting any treatment or prevention campaign. Taking certain medicines can increase your risk of getting cancer. Taking certain medicines can increase your risk of getting cancer. Talk with your doctor about the best ways to prevent cancer. Regular check-ups with your doctor will help check for signs of cancer. Regular liver function tests to check for signs of cancer are done at least three times a week for 6 to 8 weeks. Check with your doctor for any


In [None]:
# Response from model

prompt = "<question>What precautions to take for a healthy life?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: <question>What precautions to take for a healthy life??<answer>Having a healthy immune system is important to maintaining a healthy immune system. Anything that increases a person's chance of getting a disease is called a risk factor. Having a risk factor does not mean that you will get cancer; not having risk factors doesnt mean that you will not get cancer. Talk with your doctor if you think you may be at risk. Risk factors for nonmelanoma skin cancer include the following: -


In [None]:
# Testing with given prompt 1

prompt = "What to do after being diagnosed with cancer?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do after being diagnosed with cancer? - Talk with your child's doctor. - See a doctor if your child has any of the following: - Feeling very tired. - Feeling very tired for no known reason. - Feeling very tired for the first time in a while. - Feeling very tired for no known reason. - Feeling very tired for the first time in a while. - Feeling very tired for the first time in a while. - Feeling very tired for the first time in a while


In [None]:
# Testing with given prompt 2

prompt = "What to do when feeling sick?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do when feeling sick? - Talk with your doctor. - See a doctor if your child or teenager has any of the following: - Feeling very tired. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired


**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [1 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# Set up the model
untuned_model = GPT2LMHeadModel.from_pretrained(checkpoint)

In [None]:
# Testing with finetuned model: prompt 1

prompt = "What precautions to take for a healthy life?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What precautions to take for a healthy life? Check with your doctor before starting any treatment or prevention campaign. Taking certain medicines can increase your risk of getting cancer. Taking certain medicines can increase your risk of getting cancer. Talk with your doctor about the best ways to prevent cancer. Regular check-ups with your doctor will help check for signs of cancer. Regular liver function tests to check for signs of cancer are done at least three times a week for 6 to 8 weeks. Check with your doctor for any


In [None]:
# Testing with untuned model: prompt 1

prompt = "What precautions to take for a healthy life?"  # Replace with your desired prompt
response = generate_response(untuned_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What precautions to take for a healthy life?

The following are some of the most common questions you'll hear from your doctor or nurse about your health.

What are the risks of taking a drug that can cause cancer?

The risks of taking a drug that can cause cancer are very high.

What are the risks of taking a drug that can cause cancer?

The risks of taking a drug that can cause cancer are very high.

What are the risks


In [None]:
# Testing with finetuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do after being diagnosed with cancer? - Talk with your child's doctor. - See a doctor if your child has any of the following: - Feeling very tired. - Feeling very tired for no known reason. - Feeling very tired for the first time in a while. - Feeling very tired for no known reason. - Feeling very tired for the first time in a while. - Feeling very tired for the first time in a while. - Feeling very tired for the first time in a while


In [None]:
# Testing with untuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"  # Replace with your desired prompt
response = generate_response(untuned_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do after being diagnosed with cancer?

The first step is to get your doctor's approval for a treatment.

If you have a cancer diagnosis, you may need to get a second opinion.

If you have a cancer diagnosis, you may need to get a second opinion. If you have a cancer diagnosis, you may need to get a third opinion.

If you have a cancer diagnosis, you may need to get a third opinion. If you have a cancer


In [None]:
# Testing with finetuned model: prompt 3

prompt = "What to do when feeling sick?"  # Replace with your desired prompt
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do when feeling sick? - Talk with your doctor. - See a doctor if your child or teenager has any of the following: - Feeling very tired. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired for no known reason. - Feeling very tired


In [None]:
# Testing with untuned model: prompt 3

prompt = "What to do when feeling sick?"  # Replace with your desired prompt
response = generate_response(untuned_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do when feeling sick?

The first thing you should do is to get your body to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you
