<a href="https://colab.research.google.com/github/Prakum14/Testfiles/blob/master/Another_copy_of_M4_NB_MiniProject_1_Deploy_Medical_Q%26A_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2 | Deployment on Hugging Face Spaces

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering
* upload your fine-tuned model to Hugging Face Model Hub
* deploy application with uploaded model on HuggingFace Spaces using Gradio

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries. Later, deploy the fine-tuned model on Hugging Face Spaces.

Please refer to ***M4 Assignment-1 Fine-tune GPT2*** and ***M4 AdditionalNB Fine-tune GPT2 for TextClassification*** to get familiar with how to load pre-trained gpt2 tokenizer and model.

Please refer to ***The demo session held on 26 Jan - Hugging Face Spaces Deployment*** to get familiar with how to do deployment using Hugging Face Spaces.

### Installing Dependencies

In [2]:
%%capture
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

### <font color="#990000">Restart Session/Runtime</font>

### Import required packages

In [1]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [2]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** pd.read_csv()

In [3]:
meddf = pd.read_csv('MedQuAD.csv')

In [4]:
meddf.info()
meddf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16412 entries, 0 to 16411
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Focus          16398 non-null  object
 1   CUI            15847 non-null  object
 2   SemanticType   15815 non-null  object
 3   SemanticGroup  15847 non-null  object
 4   Question       16412 non-null  object
 5   Answer         16407 non-null  object
dtypes: object(6)
memory usage: 769.4+ KB


Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [5]:
meddf.isnull().sum()

Unnamed: 0,0
Focus,14
CUI,565
SemanticType,597
SemanticGroup,565
Question,0
Answer,5


In [6]:

meddf.dropna(inplace=True)
meddf.isnull().sum()

Unnamed: 0,0
Focus,0
CUI,0
SemanticType,0
SemanticGroup,0
Question,0
Answer,0


- **Remove duplicates from data considering `Question` and `Answer` columns**

In [7]:
meddf.duplicated(subset=['Question', 'Answer']).sum()

48

In [8]:

meddf.drop_duplicates(subset=['Question', 'Answer'], inplace=True)
meddf.duplicated(subset=['Question', 'Answer']).sum()

0

**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [0.5 Mark]**

In [9]:
# Total categories in Focus column
meddf['Focus'].nunique()

4770

In [10]:
# Displaying the distinct categories of Focus column and the number of records belonging to each category
# (Top 100 only)
meddf['Focus'].value_counts().head(500)

Unnamed: 0_level_0,count
Focus,Unnamed: 1_level_1
Breast Cancer,53
Prostate Cancer,43
Stroke,35
Skin Cancer,34
Alzheimer's Disease,30
...,...
Mixed connective tissue disease,6
Alopecia universalis,6
Mondini dysplasia,6
Mitochondrial genetic disorders,6


In [11]:
# Top 100 Focus categories names

#meddf['Focus'].head(100)
meddf['Focus'].value_counts().head(500).index

Index(['Breast Cancer', 'Prostate Cancer', 'Stroke', 'Skin Cancer',
       'Alzheimer's Disease', 'Colorectal Cancer', 'Lung Cancer',
       'Heart Failure', 'Heart Attack', 'High Blood Cholesterol',
       ...
       'Proteinuria', 'Pachyonychia congenita', 'Perry syndrome',
       'Parsonage Turner syndrome', 'Mosaic trisomy 9',
       'Mixed connective tissue disease', 'Alopecia universalis',
       'Mondini dysplasia', 'Mitochondrial genetic disorders',
       'Amelogenesis imperfecta'],
      dtype='object', name='Focus', length=500)

### Create Training and Validation set

**Exercise 4: Create training and validation set [1 Mark]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [12]:
# Get the top 100 focus categories
top_500_focus = meddf['Focus'].value_counts().head(500).index

# Create training and validation sets
train_df = pd.DataFrame(columns=meddf.columns)
val_df = pd.DataFrame(columns=meddf.columns)

for focus in top_500_focus:
    focus_df = meddf[meddf['Focus'] == focus]
    # Sample 4 for training
    train_samples = focus_df.sample(n=4, random_state=42)
    train_df = pd.concat([train_df, train_samples])
    # Remove training samples from focus_df
    remaining_samples = focus_df.drop(train_samples.index)
    # Sample 1 for validation from the remaining samples
    if not remaining_samples.empty:
      val_samples = remaining_samples.sample(n=1, random_state=42)
      val_df = pd.concat([val_df, val_samples])

print("Training set shape:", train_df.shape)
print("Validation set shape:", val_df.shape)

Training set shape: (2000, 6)
Validation set shape: (500, 6)


In [13]:
train_df['Focus'].value_counts().head(500)

Unnamed: 0_level_0,count
Focus,Unnamed: 1_level_1
Amelogenesis imperfecta,4
Breast Cancer,4
Prostate Cancer,4
Stroke,4
Skin Cancer,4
...,...
COPD,4
Cataract,4
Balance Problems,4
Gout,4


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks:  [1 Mark]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text + '\<end\>'*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [14]:
# Combine Questions and Answers for train and val data
## sequence = '<question>' + question + '<answer>' + answer
combqa = lambda x: '<question>' + x['Question'] + '<answer>' + x['Answer'] + '<end>'
train_df['sequence_train'] = train_df.apply(combqa, axis=1)
val_df['sequence_val'] = val_df.apply(combqa, axis=1)


- **Join the combined text using '\n' into a single string for training and validation separately**

In [15]:
# Train and Validation text for all Q&As
train_text = '\n'.join(train_df['sequence_train'])
val_text = '\n'.join(val_df['sequence_val'])



- **Save the training and validation strings as text files**

In [16]:
# Save the training and validation data as text files
with open('train.txt', 'w') as f:
    f.write(train_text)

with open('val.txt', 'w') as f:
    f.write(val_text)

**Exercise 6: Load pre-trained GPT2Tokenizer**

- Use checkpoint = "gpt2"

In [17]:
# Set up the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

**Exercise 7: Tokenize train and validation data [0.5 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

In [18]:

# Assuming 'train.txt' and 'val.txt' exist from previous steps
from datasets import load_dataset
train_path = 'train.txt'
val_path = 'val.txt'

dataset = load_dataset('text', data_files={'train': train_path, 'validation': val_path})



# Tokenize the training data
train_encodings = tokenizer(train_text, truncation=True, padding=True)

# Tokenize the validation data
val_encodings = tokenizer(val_text, truncation=True, padding=True)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

**Exercise 8: Create a DataCollator object**

In [19]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 500
    })
})

In [20]:
block_size = 256  # The maximum number of tokens (words or subwords) that will be allowed in each input sample.

# Define the tokenization function to apply to each example in the dataset
def tokenize_function(examples):
    # Tokenize the text using the GPT-2 tokenizer and return the tokenized input in PyTorch tensor format.
    return tokenizer(examples["text"],
                     padding='max_length',        # Pad sequences to the maximum length (block_size).
                     truncation=True,             # Truncate sequences that exceed the maximum length.
                     max_length=block_size,      # Limit the tokenized sequences to `block_size` tokens.
                     return_tensors='pt')        # Return the tokenized output as PyTorch tensors.

# Apply the tokenization function to the entire dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [21]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 500
    })
})

In [22]:
tokenizer.decode(tokenized_datasets['train']['input_ids'][9])



In [23]:
# Create a Data collator object
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

**Exercise 9: Load pre-trained GPT2LMHeadModel**

In [24]:
# Set up the model
model = GPT2LMHeadModel.from_pretrained('gpt2')

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Exercise 10: Fine-tune GPT2 Model [1 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [25]:
# Set up the training arguments

modeling_outputs_path = "/content/gpt2_model"

In [26]:
# Train the model

training_args = TrainingArguments(
    output_dir=modeling_outputs_path,
    overwrite_output_dir=True,
    num_train_epochs=30,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    save_steps=1_000,
    save_total_limit=2,
    evaluation_strategy='steps', eval_steps=50,
    logging_steps=50,
    logging_dir='./logs',
    report_to='none'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"]
)

import os
os.environ["WANDB_DISABLED"] = "true"
trainer.train()
# Save the model
# YOUR CODE HERE

# Save the tokenizer
# YOUR CODE HERE

Step,Training Loss,Validation Loss
50,2.6609,2.396171
100,2.3647,2.332169
150,2.3306,2.280388
200,2.2163,2.237986
250,2.2117,2.217655
300,2.1684,2.183256
350,2.1372,2.161703
400,2.1024,2.144379
450,2.0585,2.119771
500,1.9975,2.09266


KeyboardInterrupt: 

**Exercise 11: Test Model with user input prompts [1 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [27]:
# Save the model
saved_model_path = '/content/gpt2_model'
trainer.save_model(saved_model_path)

# Save the tokenizer
tokenizer.save_pretrained(saved_model_path)

('/content/gpt2_model/tokenizer_config.json',
 '/content/gpt2_model/special_tokens_map.json',
 '/content/gpt2_model/vocab.json',
 '/content/gpt2_model/merges.txt',
 '/content/gpt2_model/added_tokens.json')

In [28]:
def generate_response(model, tokenizer, prompt, max_length=200):

  input_ids = tokenizer.encode(prompt, return_tensors='pt')
  device = next(model.parameters()).device
  input_ids = input_ids.to(device)
  attention_mask = torch.ones_like(input_ids)
  pad_token_id = tokenizer.eos_token_id

  output = model.generate(input_ids, max_length=max_length, num_return_sequences=1,
                          pad_token_id=pad_token_id, attention_mask=attention_mask)

  response = tokenizer.decode(output[0], skip_special_tokens=True)
  return response

    # YOUR CODE HERE


In [29]:
# Load the fine-tuned model and tokenizer
my_model = GPT2LMHeadModel.from_pretrained(modeling_outputs_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(modeling_outputs_path)


# YOUR CODE HERE
# YOUR CODE HERE

In [30]:
# Testing with a sample prompt 1

prompt = " How is the outlook for the Breast Cancer?" #"How to cure breast cancer?"
response = generate_response(my_model, my_tokenizer, prompt)
response

' How is the outlook for the Breast Cancer? The outlook for breast cancer is generally good. The following are possible prognostic factors for breast cancer: - Being older than 50 years. - Having a personal history of breast cancer. - Having a personal or family history of breast cancer. - Having a personal or family history of breast cancer with a personal or family history of breast cancer. - Having a personal or family history of breast cancer with a personal or family history of breast cancer. - Having a personal or family history of breast cancer with a personal or family history of breast cancer. - Having a personal or family history of breast cancer with a personal or family history of breast cancer. - Having a personal or family history of breast cancer with a personal or family history of breast cancer. - Having a personal or family history of breast cancer with a personal or family history of breast cancer. - Having a personal or family history of breast cancer with a persona

In [31]:
# Testing with a sample prompt 2

prompt = "How to prevent back pain?"
response = generate_response(my_model, my_tokenizer, prompt)
response

"How to prevent back pain? You can't prevent back pain. However, you can take steps to lower your pain. For example, you can avoid strenuous activity, such as sitting, standing, or sitting for long periods of time. You can also avoid strenuous activity, such as sitting, standing, or sitting for long periods of time. Avoiding pain medications and pain relievers can help you relieve pain. If you have pain, you can take pain medications to lower your pain. If you have pain, you can take pain medications to lower your pain. If you have pain, you can take pain medications to lower your pain. If you have pain, you can take pain medications to lower your pain. If you have pain, you can take pain medications to lower your pain. If you have pain, you can take pain medications to lower your pain. If you have pain, you can take pain medications to lower your pain. If you have pain, you can take pain medications to"

**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [0.5 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [32]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

untuned_model = GPT2LMHeadModel.from_pretrained('gpt2')

In [33]:
# Testing with finetuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(my_model, my_tokenizer, prompt)
response

'What precautions to take for a healthy life? Follow the steps below to prevent or delay a serious health problem. You can also take steps to lower your risk for heart disease, stroke, and other diseases. - Be physically active. Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day. - Be physically active for at least 30 minutes a day'

In [34]:
# Testing with untuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(untuned_model, my_tokenizer, prompt)
response

"What precautions to take for a healthy life?\n\nThe following are some of the most common questions you'll hear from your doctor or nurse about your health.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause"

In [35]:
# Testing with finetuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(my_model, my_tokenizer, prompt)
response

'What to do after being diagnosed with cancer? Follow up with your doctor. - Follow up with your doctor for any questions you may have. - Ask about your personal and family medical histories. - Check with your doctor if you have any of the following: - Having a personal or family history of cancer. - Having a personal or family medical history of diabetes. - Having a personal or family medical history of rheumatoid arthritis. - Having a personal or family medical history of rheumatoid arthritis. - Having a personal or family medical history of rheumatoid arthritis. - Having a personal or family medical history of ulcerative colitis. - Having a personal or family medical history of ulcerative colitis. - Having a personal or family medical history of ulcerative colitis. - Having a personal or family medical history of ulcerative colitis. - Having a personal or family medical history of ulcerative colitis. - Having a personal or family medical history'

In [36]:
# Testing with untuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(untuned_model, my_tokenizer, prompt)
response

"What to do after being diagnosed with cancer?\n\nThe first step is to get your doctor's approval for a treatment.\n\nIf you have a cancer diagnosis, you may need to get a second opinion.\n\nIf you have a cancer diagnosis, you may need to get a second opinion. If you have a cancer diagnosis, you may need to get a third opinion.\n\nIf you have a cancer diagnosis, you may need to get a third opinion. If you have a cancer diagnosis, you may need to get a fourth opinion.\n\nIf you have a cancer diagnosis, you may need to get a fourth opinion. If you have a cancer diagnosis, you may need to get a fifth opinion.\n\nIf you have a cancer diagnosis, you may need to get a fifth opinion. If you have a cancer diagnosis, you may need to get a sixth opinion.\n\nIf you have a cancer diagnosis, you may need to get a sixth opinion. If you have"

In [37]:
# Testing with finetuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(my_model, my_tokenizer, prompt)
response

'What to do when feeling sick? You can do a lot to reduce your chances of getting sick. You can also take steps to lower your chances of getting pneumonia. Take steps to avoid getting too much caffeine, alcohol, nicotine, and other stimulants. Talk with your doctor if you think you may be at risk. Also, try to avoid caffeine, alcohol, nicotine, and other stimulants. Talk with your doctor if you think you may be at risk. Also, try to avoid caffeine, alcohol, nicotine, and other stimulants. Talk with your doctor if you think you may be at risk. Also, try to avoid caffeine, alcohol, nicotine, and other stimulants. Talk with your doctor if you think you may be at risk. Also, try to avoid caffeine, alcohol, nicotine, and other stimulants. Talk with your doctor if you think you may be at risk. Also, try to avoid caffeine, alcohol, nicotine, and other stimulants. Talk with your doctor'

In [38]:
# Testing with untuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(untuned_model, my_tokenizer, prompt)
response

"What to do when feeling sick?\n\nThe first thing you should do is to get your body to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick"

## Push your model to Hugging Face Model Hub

**Exercise 13: Follow below steps to push your fine-tuned model to HuggingFace Model Hub**

1. [Sign up](https://huggingface.co/join) for a Hugging Face account
2. Create an access token for your account and save it
3. Store your access token in the Hugging Face cache folder within colab
4. Push your fine-tuned model and tokenizer to Model Hub
5. Load the model back from Hub and test it with user input prompts

* **Create an access token for your account**

    Once you have an account, to create an access token:
    
    - Go to your `Settings`, then click on the `Access Tokens` tab. Click on the `New token` button to create a new User Access Token.
    - Select a Token type as `Write` and give a name for your token
    - Click on Create token
    - Once a token is created save it somewhere
    - When required later, use the old saved token or create a new token again

    To know more about Access Tokens, refer [here](https://huggingface.co/docs/hub/security-tokens).

* **Store your access token in the Hugging Face cache folder within colab**

    Once you have your User Access Token, run the following command to authenticate your identity to the Hub.
    - `!huggingface-cli login`
    - Paste your Access token when prompted
    - Type **n** when prompted to Add token as git credential? (Y/n)

    For more details on login, refer [here](https://huggingface.co/docs/huggingface_hub/quick-start#login).

In [39]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
The token `aimlMedicalQuestions` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `aim

* **Push your fine-tuned model and tokenizer to Model Hub [0.5 Mark]**

    - Use `push_to_hub()` method of your model and tokenizer both, to push them on hub
    - Specify name for your repository where the model and tokenizer will be pushed using `repo_id` parameter
    - Push model and tokenizer to the same repository

    - **Hint:**

        - Use `push_to_hub()` method of your model. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub).
        - Use `push_to_hub()` method of your tokenizer. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.push_to_hub).
        - Access your pushed model at `https://huggingface.co/[YOUR-USER-NAME]/[YOUR-MODEL-REPO-NAME]/tree/main`

In [40]:

# Push model
my_model.push_to_hub("praveenku1479/gpt2_model")
my_tokenizer.push_to_hub("praveenku1479/gpt2_model")


README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/praveenku1479/gpt2_model/commit/0b1581ce03a3ddcabbc03acddc375eb76b3e0554', commit_message='Upload tokenizer', commit_description='', oid='0b1581ce03a3ddcabbc03acddc375eb76b3e0554', pr_url=None, repo_url=RepoUrl('https://huggingface.co/praveenku1479/gpt2_model', endpoint='https://huggingface.co', repo_type='model', repo_id='praveenku1479/gpt2_model'), pr_revision=None, pr_num=None)

* **Load the model and tokenizer back from Hub and test it with user input prompts [0.5 Mark]**

    - In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the `from_pretrained()` method. **AutoClasses** can be used to automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.

    - Instantiating one of `AutoConfig`, `AutoModel`, and `AutoTokenizer` will directly create a class of the relevant architecture.

    - When the GPT2 Model transformer has a language modeling head on top, you can use an auto class with language modeling head on top as well - `AutoModelWithLMHead`.

    - Specify full path of your model repo i.e. ***''YOUR-USER-NAME/YOUR-REPO-NAME''*** while calling `from_pretrained()` method.

In [41]:
from transformers import AutoModelWithLMHead, AutoTokenizer

In [42]:
# Load your model from hub

loaded_model = AutoModelWithLMHead.from_pretrained("praveenku1479/gpt2_model")

config.json:   0%|          | 0.00/922 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [43]:
# Load your tokenizer from hub

loaded_tokenizer = AutoTokenizer.from_pretrained("praveenku1479/gpt2_model")

tokenizer_config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

In [44]:
# Response from loaded model

prompt = "What is the outlook for breast cancer ?"
response = generate_response(loaded_model, loaded_tokenizer, prompt)
response

'What is the outlook for breast cancer ?<answer>Certain factors affect prognosis (chance of recovery) and treatment options. The prognosis (chance of recovery) and treatment options depend on the following: - The type of breast cancer. - The stage of the disease. - The breast size. - The breast size in millimeters or less. - The size of the breastbone in millimeters or less. - The size of the breastbone in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass in inches or less. - The breastbone mass'

## Gradio Implementation

Gradio is an open-source python library that allows us to quickly create easy-to-use, customizable UI components for our ML model, any API, or any arbitrary function in just a few lines of code. We can integrate the GUI directly into the Python notebook, or we can share the link with anyone.

**Exercise 14: Create a Gradio app for your fine-tuned model pushed on Hugging Face Model Hub [1 Marks]**

- Install and import `gradio` library
- Create a function to use your fine-tuned model for response generation
    - Use the model and tokenizer directly within the function, do not pass them as parameters
    - Function should take input prompt text, and max response length as its input parameters
    - Function should output the generated response text
- Create input and output gradio elements
- Create a gradio interface object
- Launch the interface to generate UI

In [45]:
!pip -q install gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.9/321.9 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [46]:
import gradio

In [47]:
# Function for response generation

def generate_query_response(prompt, max_length):
  response = generate_response(loaded_model, loaded_tokenizer, prompt, max_length)
  return response


In [48]:

# Input from user

import gradio as gr

# Gradio elements

# Input from user
in_prompt = gr.Textbox(label="Enter your medical query")
in_max_length = gr.Slider(minimum=50, maximum=500, value=200, step=10, label="Maximum Response Length")

# Output response
out_response = gr.Textbox(label="Generated Response")


iface = gr.Interface(
    fn=generate_query_response,
    inputs=[in_prompt, in_max_length],
    outputs=out_response,
    title="Medical Query Response Generator",
)

iface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://3eac9d49b70f2b6d54.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Upload your Gradio application on Hugging Face Spaces

**Exercise 15: Upload your Gradio application on Hugging Face Spaces [2 Marks]**

1. Start a new Hugging Face Space by going to your profile and [clicking "New Space"](https://huggingface.co/new-space)

2. Provide details for your space:
    - Space name
    - License (eg. [MIT](https://opensource.org/licenses/MIT))
    - Space SDK (software development kit) (eg. `Gradio`)
    - Space hardware (CPU basic)
    - Choose whether your Space is public or private
    - Click "Create Space"

3. Go to ***Add files -> Create a new file*** option to add below files:
    - `requirements.txt`: should contain the dependencies to run your app such as `transformers`, `torch`, and `gradio`
    - `app.py`: should contain steps to
        - import required packages
        - load your fine-tuned model and tokenizer from the Model Hub
        - function to use your fine-tuned model for response generation
        - create input and output gradio elements
        - create a gradio inference object
        - launch the interface to generate UI

4. Access the `App` tab of your repository to see the build progress (debug if error persists)

5. Once the app has built successfully, test the application running on your Space with a user input prompt

