The project conducts fine-tuning of a pre-trained GPT-2 language model on a dataset containing disease names and their associated symptoms. Here's a summary of the code:

1. **Data Loading and Preprocessing:**
- A dataset of diseases and symptoms is loaded and processed into a DataFrame.
- The symptoms are formatted and joined into a single string for each disease.

2. **Model and Tokenizer Initialization:**
- A pre-trained GPT-2 model (distilgpt2 variant) and its tokenizer are loaded.
- The model is moved to the appropriate device (GPU if available).

3. **Dataset Preparation:**
- Custom PyTorch Dataset class is defined for tokenizing input-output pairs of disease names and symptoms.

4. **Splitting and Loading Data:**
- The dataset is split into training and validation sets.
- DataLoader objects are created for efficient batching of data during training.

5. **Training Setup:**
- Training parameters such as number of epochs, batch size, optimizer, and loss function are defined.
- A DataFrame is initialized to store training metrics.

6. **Training Loop:**
- The model is trained for multiple epochs.
- For each epoch, the model is trained on the training dataset and evaluated on the validation dataset.
- Training and validation losses are computed and stored.

7. **Text Generation:**
- An example input string ("Kidney Failure") is tokenized.
- The fine-tuned model generates text based on this input, producing a sequence of symptoms associated with kidney failure.

In [34]:
#!pip install torch torchtext transformers sentencepiece pandas tqdm datasets

In [2]:
from datasets import load_dataset, DatasetDict, Dataset
import pandas as pd
import ast
import datasets
from tqdm import tqdm
import time

In [3]:
#load data
data_sample = load_dataset("QuyenAnhDE/Diseases_Symptoms")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/381 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/107k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/400 [00:00<?, ? examples/s]

In [4]:
data_sample

DatasetDict({
    train: Dataset({
        features: ['Code', 'Name', 'Symptoms', 'Treatments'],
        num_rows: 400
    })
})

In [5]:
updated_data = [{'Name':item['Name'], 'Symptoms':item['Symptoms']} for item in data_sample['train']]


In [6]:
df = pd.DataFrame(updated_data)
df.head()

Unnamed: 0,Name,Symptoms
0,Panic disorder,"Palpitations, Sweating, Trembling, Shortness o..."
1,Vocal cord polyp,"Hoarseness, Vocal Changes, Vocal Fatigue"
2,Turner syndrome,"Short stature, Gonadal dysgenesis, Webbed neck..."
3,Cryptorchidism,"Absence or undescended testicle(s), empty scro..."
4,Ethylene glycol poisoning-1,"Nausea, vomiting, abdominal pain, General mala..."


In [7]:
# extract the symptoms
df['Symptoms'] = df['Symptoms'].apply(lambda x: ', '.join(x.split(', ')))

In [8]:
df.head()

Unnamed: 0,Name,Symptoms
0,Panic disorder,"Palpitations, Sweating, Trembling, Shortness o..."
1,Vocal cord polyp,"Hoarseness, Vocal Changes, Vocal Fatigue"
2,Turner syndrome,"Short stature, Gonadal dysgenesis, Webbed neck..."
3,Cryptorchidism,"Absence or undescended testicle(s), empty scro..."
4,Ethylene glycol poisoning-1,"Nausea, vomiting, abdominal pain, General mala..."


In [9]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split

In [10]:
if torch.cuda.is_available():
  device = torch.device('cuda')

else:
  try:
    device = torch.device('mps')
  except Exception:
    device = torch.device('cpu')


In [11]:
device

device(type='cuda')

In [12]:
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
model = GPT2LMHeadModel.from_pretrained('distilgpt2').to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [13]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [14]:
df.describe()

Unnamed: 0,Name,Symptoms
count,400,400
unique,392,395
top,Sciatica,"Swelling, pain, dry mouth, bad taste"
freq,3,3


## Class Components

1.  __init__ Method
- df: A pandas DataFrame containing the dataset.
- tokenizer: A tokenizer object, typically from the Hugging Face Transformers library.
- self.labels: Stores the column names of the DataFrame.
- self.data: Converts the DataFrame into a list of dictionaries, each representing a row in the DataFrame.
- self.tokenizer: Stores the tokenizer for later use.
- self.max_length: Calls the fittest_max_length method to determine the maximum length for tokenization and stores it.

2. __len__ Method
- Returns the number of records in the dataset.

3. __getitem__ Method
- idx: The index of the data item to retrieve.
- x: The value from the first column of the DataFrame at index idx.
- y: The value from the second column of the DataFrame at index idx.
- text: Combines x and y into a single string separated by " | ".
- tokens: Uses the tokenizer to encode the text into token IDs, applying padding and truncation to a maximum length of 128 tokens, and returns the result as a PyTorch tensor.

4. fittest_max_length Method
- Determines the maximum length of the sequences in the DataFrame.
- max_length: Finds the length of the longest entry in the first and second columns of the DataFrame.
- x: Starts at 2 and doubles until it is greater than or equal to max_length, ensuring the maximum length is a power of 2 for optimization purposes.



In [15]:
## Dataset preparation
class LanguageDataset(Dataset):
  def __init__(self,df, tokenizer):
    self.labels = df.columns
    self.data = df.to_dict(orient='records')
    self.tokenizer = tokenizer
    x = self.fittest_max_length(df)
    self.max_length = x

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    x = self.data[idx][self.labels[0]]
    y = self.data[idx][self.labels[1]]
    text = f"{x} | {y}"
    tokens = self.tokenizer.encode_plus(text,
                                        return_tensors = 'pt',
                                        max_length=128,
                                        padding='max_length',
                                        truncation=True)
    return tokens

  def fittest_max_length(self, df):
    max_length = max(len(max(df[self.labels[0]], key=len)),
                     len(max(df[self.labels[1]], key=len)))
    x = 2
    while x < max_length: x=x*2
    return x

In [16]:
data_sample = LanguageDataset(df, tokenizer)

In [17]:
data_sample

<__main__.LanguageDataset at 0x7dfa30c22ec0>

In [18]:
train_size = int(0.8*len(data_sample))
val_size = len(data_sample) - train_size

train_data, val_data = random_split(data_sample, [train_size, val_size])


In [19]:
BATCH_SIZE=8
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_data, batch_size=BATCH_SIZE)

In [20]:
num_epochs = 8


In [21]:
batch_size = BATCH_SIZE
model_name = 'distilgpt2'
gpu = 0

In [22]:
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
optimizer = optim.AdamW(model.parameters(), lr=5e-4)
tokenizer.pad_token = tokenizer.eos_token

In [23]:
results = pd.DataFrame(columns=['epoch', 'transformer', 'batch_size', 'gpu', 'training_loss', 'val_loss', 'epoch_duration_sec'])

### Training Loop:
- This loop iterates over each epoch for training the model. The number of epochs is determined by the num_epochs variable.
- start_time: Records the current time to calculate the duration of the epoch.
- model.train(): Sets the model to training mode, enabling gradients to be computed.
- epoch_training_loss: Initializes the cumulative training loss for the epoch.
- train_iterator: Initializes a progress bar (tqdm) to visualize the progress of training. It iterates over the training data batches (train_loader).

This loop iterates over each batch in the training data.
- optimizer.zero_grad(): Clears the gradients of all optimized tensors.
- inputs = batch['input_ids'].squeeze(1).to(device): Fetches the input tokens (input IDs) from the batch and moves them to the appropriate device (GPU, if available).
- targets = inputs.clone(): Copies the input tokens to use them as targets for computing the loss.
- outputs = model(input_ids=inputs, labels=targets): Performs a forward pass through the model and computes the loss using the provided inputs and targets.
- loss.backward(): Backpropagates the loss to compute gradients.
- optimizer.step(): Updates the model parameters based on the computed gradients.
- train_iterator.set_postfix({'Training Loss': loss.item()}): Updates the progress bar with the current training loss.
- epoch_training_loss += loss.item() :Accumulates the training loss for the epoch

- Calculates the average training loss for the epoch by dividing the cumulative loss by the number of batches in the training data.

## Validation Phase:
- model.eval(): Sets the model to evaluation mode, disabling dropout and batch normalization layers.
- epoch_validation_loss: Initializes the cumulative validation loss for the epoch.
- valid_iterator: Initializes a progress bar for the validation data.

The loop iterates over each batch in the validation data without computing gradients (torch.no_grad() context).

Similar to the training phase, it computes the validation loss using the provided inputs and targets.

It updates the progress bar with the current validation loss and accumulates the validation loss for the epoch.

Calculates the average validation loss for the epoch by dividing the cumulative loss by the number of batches in the validation data.

In [24]:
# This loop iterates over each epoch for training the model. The number of epochs is determined by the num_epochs variable.
for epoch in range(num_epochs):
  start_time = time.time()
  model.train()
  epoch_training_loss = 0
  train_iterator = tqdm(train_loader, desc=f"Training epoch {epoch+1}/{num_epochs} Batch Size:{batch_size}, Transformer:{model_name}")
  for batch in train_iterator:
    optimizer.zero_grad()
    inputs = batch['input_ids'].squeeze(1).to(device)
    targets = inputs.clone()
    outputs = model(input_ids=inputs,
                    labels=targets)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    train_iterator.set_postfix({'Training Loss':loss.item()})
    epoch_training_loss += loss.item()
  avg_epoch_training_loss = epoch_training_loss / len(train_iterator)

  #validation
  model.eval()
  epoch_validation_loss = 0
  total_loss = 0
  valid_iterator = tqdm(val_loader, desc=f"Validation epoch {epoch+1}/{num_epochs}")

  with torch.no_grad():
    for batch in valid_iterator:
      inputs = batch['input_ids'].squeeze(1).to(device)
      targets = inputs.clone()
      outputs = model(input_ids=inputs,
                      labels=targets)
      loss = outputs.loss
      total_loss += loss
      valid_iterator.set_postfix({'Training Loss':loss.item()})
      epoch_validation_loss += loss.item()
  avg_epoch_validation_loss = epoch_validation_loss / len(valid_iterator)

  end_time = time.time()
  epoch_duration_sec = end_time - start_time

  new_row = {
      'tranformer':model_name,
      'batch_size':batch_size,
      'gpu':gpu,
      'epoch':epoch+1,
      'training_loss':avg_epoch_training_loss,
      'validation_loss':avg_epoch_validation_loss,
      'epoch_duration_sec': epoch_duration_sec
      }
  results.loc[len(results)] = new_row
  print(f"Epoch: {epoch+1}, Validation Loss: {total_loss/len(val_loader)}")

Training epoch 1/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:10<00:00,  3.83it/s, Training Loss=0.61]
Validation epoch 1/8: 100%|██████████| 10/10 [00:00<00:00, 17.16it/s, Training Loss=0.498]


Epoch: 1, Validation Loss: 0.6434598565101624


Training epoch 2/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:08<00:00,  4.90it/s, Training Loss=0.478]
Validation epoch 2/8: 100%|██████████| 10/10 [00:00<00:00, 16.95it/s, Training Loss=0.507]


Epoch: 2, Validation Loss: 0.6171140074729919


Training epoch 3/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:08<00:00,  4.81it/s, Training Loss=0.317]
Validation epoch 3/8: 100%|██████████| 10/10 [00:00<00:00, 16.95it/s, Training Loss=0.525]


Epoch: 3, Validation Loss: 0.62691330909729


Training epoch 4/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:08<00:00,  4.92it/s, Training Loss=0.43]
Validation epoch 4/8: 100%|██████████| 10/10 [00:00<00:00, 17.55it/s, Training Loss=0.581]


Epoch: 4, Validation Loss: 0.6629461646080017


Training epoch 5/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:07<00:00,  5.03it/s, Training Loss=0.23]
Validation epoch 5/8: 100%|██████████| 10/10 [00:00<00:00, 16.84it/s, Training Loss=0.596]


Epoch: 5, Validation Loss: 0.7050497531890869


Training epoch 6/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:07<00:00,  5.07it/s, Training Loss=0.225]
Validation epoch 6/8: 100%|██████████| 10/10 [00:00<00:00, 17.80it/s, Training Loss=0.65]


Epoch: 6, Validation Loss: 0.754351794719696


Training epoch 7/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:07<00:00,  5.09it/s, Training Loss=0.195]
Validation epoch 7/8: 100%|██████████| 10/10 [00:00<00:00, 17.81it/s, Training Loss=0.713]


Epoch: 7, Validation Loss: 0.786099374294281


Training epoch 8/8 Batch Size:8, Transformer:distilgpt2: 100%|██████████| 40/40 [00:07<00:00,  5.10it/s, Training Loss=0.163]
Validation epoch 8/8: 100%|██████████| 10/10 [00:00<00:00, 17.65it/s, Training Loss=0.76]

Epoch: 8, Validation Loss: 0.8345441818237305





## Adding Training Results to DataFrame:
- Creates a dictionary new_row containing training results such as model name, batch size, GPU index, epoch number, average training loss, average validation loss, and epoch duration.
- Appends this dictionary as a new row to the results DataFrame.
- Prints the epoch number and validation loss.

In [25]:
input_str = 'Kidney Failure'


In [26]:
input_ids = tokenizer.encode(input_str,
                             return_tensors='pt',
                             ).to(device)
input_ids

tensor([[48374,  1681, 25743]], device='cuda:0')

## Generating Text:
- Defines an input string ('Kidney Failure') to generate text based on.
- Tokenizes the input string and converts it into tensor format (input_ids).
- Uses the model's generate method to generate text based on the provided input IDs.
  * max_length: Maximum length of the generated text.
  * num_return_sequences: Number of sequences to generate.
  * do_sample: Whether to use sampling for generation.
  * top_k: Filter top-k tokens to sample from.
  * top_p: Filter top-p tokens to sample from.
  * temperature: Control randomness in sampling.
  * repetition_penalty: Control token repetition in the generated text.

- Decodes the generated output into human-readable text (decoded_output), skipping special tokens.

In [28]:
output = model.generate(
    input_ids,
    max_length=20,
    num_return_sequences=1,
    do_sample = True,
    top_k=8,
    top_p=0.95,
    temperature=0.5,
    repetition_penalty=1.2
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [29]:
output

tensor([[48374,  1681, 25743,   930, 36400,   839, 18922,  5072,    11, 11711,
         21545,    11, 18787, 50256]], device='cuda:0')

In [30]:
decoded_output = tokenizer.decode(output[0],
                                  skip_special_tokens = True)

In [31]:
decoded_output

'Kidney Failure | Decreased urine output, fluid retention, fatigue'

In [32]:
torch.save(model, 'SmallDiseaseLM.pt')

In [33]:
torch.save(model, 'drive/MyDrive/SmallDiseaseLM.pt')