<a href="https://colab.research.google.com/github/GaneshSelvaraj717/Ganesh-Selvaraj/blob/master/M4_NB_MiniProject_1_Medical_Q%26A_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative AI and Prompt Engineering
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries.

### **GPT-2**

In recent years, the OpenAI GPT-2 exhibited an impressive ability to write coherent and passionate essays that exceeded what current language models can produce. The GPT-2 wasn't a particularly novel architecture - its architecture is very similar to the **decoder-only transformer**. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset.

Here, you are going to fine-tune the GPT2 model with the Medical data. Expected result should be that the model will be able to reply to the prompt related medical queries after fine-tuning.

To know more about GPT-2, refer [here](http://jalammar.github.io/illustrated-gpt2/).

### Installing Dependencies

In [1]:
%%capture
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

### <font color="#990000">Restart Session/Runtime</font>

### Import required packages

In [2]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [4]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv
MedQuAD.csv.1


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** `pd.read_csv()`

In [5]:
pd.read_csv('MedQuAD.csv')

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...
...,...,...,...,...,...,...
16407,Parasites - Zoonotic Hookworm,,,,What is (are) Parasites - Zoonotic Hookworm ?,"There are many different species of hookworms,..."
16408,Parasites - Zoonotic Hookworm,,,,Who is at risk for Parasites - Zoonotic Hookwo...,Dog and cat hookworms are found throughout the...
16409,Parasites - Zoonotic Hookworm,,,,How to diagnose Parasites - Zoonotic Hookworm ?,Cutaneous larva migrans (CLM) is a clinical di...
16410,Parasites - Zoonotic Hookworm,,,,What are the treatments for Parasites - Zoonot...,The zoonotic hookworm larvae that cause cutane...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [1 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [10]:
# YOUR CODE HER# Load your dataset (replace 'your_dataset.csv' with your actual file path)
# For example, you might have unzipped the data into a CSV file.
data_file_path = '/content/MedQuAD.csv'  # Update with actual file name
df = pd.read_csv(data_file_path)

# Display the first few rows of the dataframe
print("Original Data:")
print(df.head())

# Handle missing values
# You can choose to drop rows with missing values or fill them
# Here, we'll drop rows with any missing values in 'Question' or 'Answer'
df.dropna(subset=['Question', 'Answer'], inplace=True)

# Alternatively, you can fill missing values:
# df['Question'].fillna('Unknown Question', inplace=True)
# df['Answer'].fillna('Unknown Answer', inplace=True)

# Remove duplicates based on 'Question' and 'Answer' columns
df.drop_duplicates(subset=['Question', 'Answer'], inplace=True)

# Display the cleaned dataframe
print("\nCleaned Data:")
print(df.head())

# Save the cleaned data to a new CSV file
cleaned_data_file_path = './MedQuAD_Data/cleaned_data.csv'
df.to_csv('/content/cleaned_MedQuaD/cleaned_data.csv', index=False)
print(f"\nCleaned data saved to: {cleaned_data_file_path}")

Original Data:
                                Focus       CUI SemanticType SemanticGroup  \
0  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
1  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
2  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
3  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
4  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   

                                            Question  \
0  What is (are) Adult Acute Lymphoblastic Leukem...   
1  What are the symptoms of Adult Acute Lymphobla...   
2  How to diagnose Adult Acute Lymphoblastic Leuk...   
3  What is the outlook for Adult Acute Lymphoblas...   
4  Who is at risk for Adult Acute Lymphoblastic L...   

                                              Answer  
0  Key Points - Adult acute lymphoblastic leukemi...  
1  Signs and symptoms of adult ALL include fever,...  
2  Tests that 

In [12]:
#from google.colab import drive
#drive.mount('/content/drive')

- **Remove duplicates from data considering `Question` and `Answer` columns**

In [13]:
# Remove duplicates based on 'Question' and 'Answer' columns
initial_count = df.shape[0]
df.drop_duplicates(subset=['Question', 'Answer'], inplace=True)
final_count = df.shape[0]

# Display the number of duplicates removed
duplicates_removed = initial_count - final_count
print(f"\nDuplicates removed: {duplicates_removed}")

# Display the cleaned dataframe
print("\nCleaned Data:")
print(df.head())

# Save the cleaned data to a new CSV file
cleaned_data_file_path = '/content/cleaned_MedQuaD/cleaned_data.csv'
df.to_csv(cleaned_data_file_path, index=False)
print(f"\nCleaned data saved to: {cleaned_data_file_path}")


Duplicates removed: 0

Cleaned Data:
                                Focus       CUI SemanticType SemanticGroup  \
0  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
1  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
2  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
3  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   
4  Adult Acute Lymphoblastic Leukemia  C0751606         T191     Disorders   

                                            Question  \
0  What is (are) Adult Acute Lymphoblastic Leukem...   
1  What are the symptoms of Adult Acute Lymphobla...   
2  How to diagnose Adult Acute Lymphoblastic Leuk...   
3  What is the outlook for Adult Acute Lymphoblas...   
4  Who is at risk for Adult Acute Lymphoblastic L...   

                                              Answer  
0  Key Points - Adult acute lymphoblastic leukemi...  
1  Signs and symptoms of adult ALL include fev

**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [1 Mark]**

In [14]:
# Total categories in Focus column
# # Display the top 100 categories in the 'Focus' column
if 'Focus' in df.columns:
    top_categories = df['Focus'].value_counts().head(100)

    print("\nTop 100 Categories in 'Focus' Column:")
    print(top_categories)
else:
    print("The 'Focus' column is not present in the dataset.")


Top 100 Categories in 'Focus' Column:
Focus
Breast Cancer                                                        53
Prostate Cancer                                                      43
Stroke                                                               35
Skin Cancer                                                          34
Alzheimer's Disease                                                  30
                                                                     ..
Camurati-Engelmann disease                                           11
Cushing's Syndrome                                                   11
Opitz G/BBB syndrome                                                 11
Ovarian Epithelial, Fallopian Tube, and Primary Peritoneal Cancer    10
Urinary Incontinence in Men                                          10
Name: count, Length: 100, dtype: int64


In [15]:
# Displaying the distinct categories of Focus column and the number of records belonging to each category
# (Top 100 only)

# Display distinct categories in the 'Focus' column with their counts
if 'Focus' in df.columns:
    focus_counts = df['Focus'].value_counts().head(100)

    print("\nDistinct Categories in 'Focus' Column (Top 100):")
    print(focus_counts)
else:
    print("The 'Focus' column is not present in the dataset.")


Distinct Categories in 'Focus' Column (Top 100):
Focus
Breast Cancer                                                        53
Prostate Cancer                                                      43
Stroke                                                               35
Skin Cancer                                                          34
Alzheimer's Disease                                                  30
                                                                     ..
Camurati-Engelmann disease                                           11
Cushing's Syndrome                                                   11
Opitz G/BBB syndrome                                                 11
Ovarian Epithelial, Fallopian Tube, and Primary Peritoneal Cancer    10
Urinary Incontinence in Men                                          10
Name: count, Length: 100, dtype: int64


In [16]:
# Top 100 Focus categories names

# Display distinct categories in the 'Focus' column with their counts
if 'Focus' in df.columns:
    focus_counts = df['Focus'].value_counts().head(100)

    print("\nTop 100 Categories in 'Focus' Column:")
    for category in focus_counts.index:
        print(category)
else:
    print("The 'Focus' column is not present in the dataset.")


Top 100 Categories in 'Focus' Column:
Breast Cancer
Prostate Cancer
Stroke
Skin Cancer
Alzheimer's Disease
Colorectal Cancer
Lung Cancer
Heart Failure
Heart Attack
High Blood Cholesterol
High Blood Pressure
Parkinson's Disease
Leukemia
Osteoporosis
Shingles
Age-related Macular Degeneration
Diabetes
Hemochromatosis
Diabetic Retinopathy
Psoriasis
Gum (Periodontal) Disease
Kidney Disease
Balance Problems
Cataract
COPD
Dry Mouth
Wilson Disease
Prescription and Illicit Drug Abuse
Medicare and Continuing Care
Gout
Glaucoma
Neuroblastoma
Narcolepsy
Short Bowel Syndrome
Osteoarthritis
Problems with Taste
Rheumatoid Arthritis
Endometrial Cancer
Pituitary Tumors
Kidney Dysplasia
Urinary Tract Infections in Children
Dry Eye
Peripheral Arterial Disease (P.A.D.)
Problems with Smell
Anxiety Disorders
Surviving Cancer
Prostate Enlargement: Benign Prostatic Hyperplasia
National Hormone and Pituitary Program (NHPP): Information for People Treated with Pituitary Human Growth Hormone (Comprehensive Repo

### Create Training and Validation set

**Exercise 4: Create training and validation set [1 Mark]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [18]:
# Get the top 100 categories in 'Focus'
top_categories = df['Focus'].value_counts().head(100).index
# Create training set with 4 samples per category
train_samples = []
for category in top_categories:
    samples = df[df['Focus'] == category].sample(n=4, random_state=42)  # Get 4 random samples
    train_samples.append(samples)

# Concatenate all training samples into a single DataFrame
train_set = pd.concat(train_samples)

# Create a validation set from the remaining data
remaining_data = df[~df.index.isin(train_set.index)]
val_set = remaining_data.sample(n=min(100, len(remaining_data)), random_state=42)  # You can adjust the size as needed

# Display the sizes of the training and validation sets
print(f"Training set size: {train_set.shape[0]}")
print(f"Validation set size: {val_set.shape[0]}")

# Save the training and validation sets to CSV files
train_set_file_path = '/content/cleaned_MedQuaD/train_set.csv'
val_set_file_path = '/content/cleaned_MedQuaD/val_set.csv'
train_set.to_csv(train_set_file_path, index=False)
val_set.to_csv(val_set_file_path, index=False)

print(f"\nTraining set saved to: {train_set_file_path}")
print(f"Validation set saved to: {val_set_file_path}")

Training set size: 400
Validation set size: 100

Training set saved to: /content/cleaned_MedQuaD/train_set.csv
Validation set saved to: /content/cleaned_MedQuaD/val_set.csv


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks:  [1 Mark]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text + '\<end\>'*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [19]:
# Combine Questions and Answers for train and val data
## sequence = '<question> ' + question + ' <answer> ' + answer + ' <end>'

# Combine Question and Answer columns for training data
train_combined = train_set.apply(lambda row: f"<question>{row['Question']}<answer>{row['Answer']}<end>", axis=1)
train_string = '\n'.join(train_combined)

# Combine Question and Answer columns for validation data
val_combined = val_set.apply(lambda row: f"<question>{row['Question']}<answer>{row['Answer']}<end>", axis=1)
val_string = '\n'.join(val_combined)

# Save the combined strings to text files
train_file_path = '/content/cleaned_MedQuaD/train_data.txt'
val_file_path = '/content/cleaned_MedQuaD/val_data.txt'

with open(train_file_path, 'w') as train_file:
    train_file.write(train_string)

with open(val_file_path, 'w') as val_file:
    val_file.write(val_string)

print(f"Training data saved to: {train_file_path}")
print(f"Validation data saved to: {val_file_path}")

Training data saved to: /content/cleaned_MedQuaD/train_data.txt
Validation data saved to: /content/cleaned_MedQuaD/val_data.txt


- **Join the combined text using '\n' into a single string for training and validation separately**

In [20]:
# Train and Validation text for all Q&As

# already done in the previous dell

- **Save the training and validation strings as text files**

In [21]:
# Save the training and validation data as text files

# already done in the previous cell

**Exercise 6: Load pre-trained GPT2Tokenizer**

- Use checkpoint = "gpt2"

**Hint:** `GPT2Tokenizer.from_pretrained(...)`

In [25]:
# Set up the tokenizer
from transformers import GPT2LMHeadModel, GPT2Tokenizer

**Exercise 7: Tokenize train and validation data [1 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

**Hint:**

`from datasets import load_dataset`

`dataset = load_dataset("text", data_files={...})`

In [43]:
# Set the padding token to be the same as the end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token

# Load the training and validation data from the text files
train_file_path = '/content/cleaned_MedQuaD/train_data.txt'
val_file_path = '/content/cleaned_MedQuaD/val_data.txt'

#with open(train_file_path, 'r') as train_file:
#    train_data = train_file.read()

#with open(val_file_path, 'r') as val_file:
#    val_data = val_file.read()

# Tokenize the training and validation data
#train_tokens = tokenizer(train_data, return_tensors='pt', padding=True, truncation=True)
#val_tokens = tokenizer(val_data, return_tensors='pt', padding=True, truncation=True)

# Display the tokenized output
#print("Training Tokens:")
#print(train_tokens)

#print("\nValidation Tokens:")
#print(val_tokens)
# Tokenize dataset
#def tokenize_function(examples):
#    # Tokenizer must return a dictionary containing 'input_ids' and 'attention_mask'
#    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

#train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=['text'])
#valid_dataset = valid_dataset.map(tokenize_function, batched=True, remove_columns=['text'])

# Ensure the dataset is formatted for PyTorch
train_data.set_format(type='torch', columns=['input_ids', 'attention_mask'])
valid_data.set_format(type='torch', columns=['input_ids', 'attention_mask'])

AttributeError: 'str' object has no attribute 'set_format'

**Exercise 8: Create a DataCollator object**

**Hint:** `DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")`

Data collators are objects that:

- will form a batch by using a list of dataset elements as input
- may apply some processing (like padding)

One of the data collators, `DataCollatorForLanguageModeling`, can also apply some random data augmentation (like random masking) on the formed batch.

<br>

`DataCollatorForLanguageModeling` is a data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

Parameters:

- ***tokenizer:*** The tokenizer used for encoding the data.
- ***mlm*** (bool, optional, default=True): Whether or not to use masked language modeling.
    - If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100).
    - Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.
- ***return_tensors*** (str): The type of Tensor to return. Allowable values are “np”, “pt” and “tf” for numpy array, pytorch tensor, and tensorflow tensor respectively.

To know more about `DataCollatorForLanguageModeling` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

In [32]:
# Create a Data collator object
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Example of how to use the DataCollator
# Prepare tokenized inputs for batching
train_inputs = [{'input_ids': train_tokens['input_ids'][i], 'attention_mask': train_tokens['attention_mask'][i]} for i in range(len(train_tokens['input_ids']))]
val_inputs = [{'input_ids': val_tokens['input_ids'][i], 'attention_mask': val_tokens['attention_mask'][i]} for i in range(len(val_tokens['input_ids']))]

# Create a batch using the DataCollator
train_batch = data_collator(train_inputs)
val_batch = data_collator(val_inputs)

# Display the shape of the batches
print("Training Batch:")
print(train_batch)

print("\nValidation Batch:")
print(val_batch)

Training Batch:
{'input_ids': tensor([[   27, 25652,    29,  ...,   262,  2526,   286]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

Validation Batch:
{'input_ids': tensor([[   27, 25652,    29,  ...,   416,  1903,  9963]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}


**Exercise 9: Load pre-trained GPT2LMHeadModel**

**Hint:** `GPT2LMHeadModel.from_pretrained(...)`

In [33]:
# Set up the model
from transformers import GPT2LMHeadModel

# Load the pre-trained GPT-2 tokenizer
model_name = 'gpt2'  # You can also use 'gpt2-medium', 'gpt2-large', or 'gpt2-xl'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()

print("Pre-trained GPT-2 model and tokenizer loaded successfully.")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Pre-trained GPT-2 model and tokenizer loaded successfully.


**Exercise 10: Fine-tune GPT2 Model [2 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [44]:
# Set up the training arguments

model_output_path = "/content/cleaned_MedQuaD"
# # Step 3: Specify training arguments
#output_dir = "./gpt2-finetuned"
training_args = TrainingArguments(
    output_dir= model_output_path,          # output directory
    num_train_epochs=30,            # number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,               # number of warmup steps for learning rate scheduler
    weight_decay=0.01,              # strength of weight decay
    logging_dir='./logs',           # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",    # evaluate every epoch
    save_strategy="epoch",          # save model every epoch
    save_total_limit=3,             # limit the total amount of saved checkpoints
    fp16=True,                      # use mixed precision training
    push_to_hub=False,              # disable push to hub
)

In [45]:
# Train the model

#trainer = Trainer(# YOUR CODE HERE)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

# Step 5: Train the model
trainer.train()

# Step 6: Save the fine-tuned model and tokenizer
model.save_pretrained(model_output_path)
tokenizer.save_pretrained(model_output_path)

TypeError: vars() argument must have __dict__ attribute

In [None]:
# Save the model
# YOUR CODE HERE
model.save_pretrained(model_output_path)

# Save the tokenizer
# YOUR CODE HERE
tokenizer.save_pretrained(model_output_path)

**Exercise 11: Test Model with user input prompts [2 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [None]:
def generate_response(model, tokenizer, prompt, max_length=200):

    # YOUR CODE HERE


In [None]:
# Load the fine-tuned model and tokenizer

# YOUR CODE HERE
# YOUR CODE HERE

In [None]:
# Testing with a sample prompt 1

prompt = # YOUR CODE HERE
response = # YOUR CODE HERE
response

In [None]:
# Testing with a sample prompt 2

prompt = # YOUR CODE HERE
response = # YOUR CODE HERE
response

**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [1 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# YOUR CODE HERE

In [None]:
# Testing with finetuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = # YOUR CODE HERE
response

In [None]:
# Testing with untuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = # YOUR CODE HERE
response

In [None]:
# Testing with finetuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = # YOUR CODE HERE
response

In [None]:
# Testing with untuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = # YOUR CODE HERE
response

In [None]:
# Testing with finetuned model: prompt 3

prompt = "What to do when feeling sick?"
response = # YOUR CODE HERE
response

In [None]:
# Testing with untuned model: prompt 3

prompt = "What to do when feeling sick?"
response = # YOUR CODE HERE
response