Based on
Tutorial : https://www.youtube.com/watch?v=1ILVm4IeNY8&t=1s

This notebook demonstrates training a small language model using the Hugging Face transformers library on employee data. It covers data loading, preprocessing, model training, and text generation.

In [None]:
!pip install torch torchtext transformers sentencepiece pandas tqdm datasets

In [26]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('/content/employee_data.csv')

# Display the first few records of the DataFrame
display(df.head())

# Display the last few records of the DataFrame
display(df.tail())

Unnamed: 0,name,role
0,Jolly,AI Consultant
1,Rajesh,AI Engineer
2,Madhuprabhudeva,AI Architect
3,Shankar,Machine Learning Engineer
4,Vikram Bundela,SEO Specialist


Unnamed: 0,name,role
1999,Gaurav Mishra,Business Development Manager
2000,Farhan Bhat,Sustainability Consultant
2001,Aarav Rahman,Sustainability Consultant
2002,Yashika Dwivedi,Procurement Specialist
2003,Sanya Shah,Administrative Assistant


This code sets up and trains a small language model using the Hugging Face transformers library. It tokenizes text data from a pandas DataFrame, creates a custom dataset and data loaders, and then trains a GPT-2 model on this data. After training, it generates text based on a given input string and saves the trained model.

In [27]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm import tqdm
import time

# If you have an NVIDIA GPU attached, use 'cuda'
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    # Fallback to 'cpu' if CUDA is not available
    device = torch.device('cpu')

device

# The tokenizer turns texts to numbers (and vice-versa)
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

# The transformer
model = GPT2LMHeadModel.from_pretrained('distilgpt2').to(device)

model

# Model params
BATCH_SIZE = 8

df.describe()

# Dataset Prep
class LanguageDataset(Dataset):
    """
    An extension of the Dataset object to:
      - Make training loop cleaner
      - Make ingestion easier from pandas df's
    """
    def __init__(self, df, tokenizer):
        self.labels = df.columns
        self.data = df.to_dict(orient='records')
        self.tokenizer = tokenizer
        x = self.fittest_max_length(df)
        self.max_length = x

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        x = self.data[idx][self.labels[0]]
        y = self.data[idx][self.labels[1]]
        text = f"{x} | {y}"
        tokens = self.tokenizer.encode_plus(text, return_tensors='pt', max_length=128, padding='max_length', truncation=True)
        return tokens

    def fittest_max_length(self, df):
        """
        Smallest power of two larger than the longest term in the data set.
        Important to set up max length to speed training time.
        """
        max_length = max(len(max(df[self.labels[0]], key=len)), len(max(df[self.labels[1]], key=len)))
        x = 2
        while x < max_length: x = x * 2
        return x

# Cast the Huggingface data set as a LanguageDataset we defined above
data_sample = LanguageDataset(df, tokenizer)

data_sample

# Create train, valid
train_size = int(0.8 * len(data_sample))
valid_size = len(data_sample) - train_size
train_data, valid_data = random_split(data_sample, [train_size, valid_size])

# Make the iterators
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=BATCH_SIZE)

# Set the number of epochs
num_epochs = 10

# Training parameters
batch_size = BATCH_SIZE
model_name = 'distilgpt2'
# Remove or set gpu to None since MPS is not supported
gpu = None

# Set the learning rate and loss function
## CrossEntropyLoss measures how close answers to the truth.
## More punishing for high confidence wrong answers
criterion = nn.CrossEntropyLoss(ignore_index = tokenizer.pad_token_id)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
tokenizer.pad_token = tokenizer.eos_token

# Init a results dataframe
results = pd.DataFrame(columns=['epoch', 'transformer', 'batch_size', 'gpu',
                                'training_loss', 'validation_loss', 'epoch_duration_sec'])

# The training loop
for epoch in range(num_epochs):
    start_time = time.time()  # Start the timer for the epoch

    # Training
    ## This line tells the model we're in 'learning mode'
    model.train()
    epoch_training_loss = 0
    train_iterator = tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs} Batch Size: {batch_size}, Transformer: {model_name}")
    for batch in train_iterator:
        optimizer.zero_grad()
        inputs = batch['input_ids'].squeeze(1).to(device)
        targets = inputs.clone()
        outputs = model(input_ids=inputs, labels=targets)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_iterator.set_postfix({'Training Loss': loss.item()})
        epoch_training_loss += loss.item()
    avg_epoch_training_loss = epoch_training_loss / len(train_iterator)

    # Validation
    ## This line below tells the model to 'stop learning'
    model.eval()
    epoch_validation_loss = 0
    total_loss = 0
    valid_iterator = tqdm(valid_loader, desc=f"Validation Epoch {epoch+1}/{num_epochs}")
    with torch.no_grad():
        for batch in valid_iterator:
            inputs = batch['input_ids'].squeeze(1).to(device)
            targets = inputs.clone()
            outputs = model(input_ids=inputs, labels=targets)
            loss = outputs.loss
            total_loss += loss
            valid_iterator.set_postfix({'Validation Loss': loss.item()})
            epoch_validation_loss += loss.item()

    avg_epoch_validation_loss = epoch_validation_loss / len(valid_loader)

    end_time = time.time()  # End the timer for the epoch
    epoch_duration_sec = end_time - start_time  # Calculate the duration in seconds

    new_row = {'transformer': model_name,
               'batch_size': batch_size,
               'gpu': gpu,
               'epoch': epoch+1,
               'training_loss': avg_epoch_training_loss,
               'validation_loss': avg_epoch_validation_loss,
               'epoch_duration_sec': epoch_duration_sec}  # Add epoch_duration to the dataframe

    results.loc[len(results)] = new_row
    print(f"Epoch: {epoch+1}, Validation Loss: {total_loss/len(valid_loader)}")

#Test the sample post the training
input_str = "Jay Jadeja"
input_ids = tokenizer.encode(input_str, return_tensors='pt').to(device)

output = model.generate(
    input_ids,
    max_length=20,
    num_return_sequences=1,
    do_sample=True,
    top_k=8,
    top_p=0.95,
    temperature=0.5,
    repetition_penalty=1.2
)

decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded_output)

torch.save(model.state_dict(), 'SmallMedLM.pt')


Training Epoch 1/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:42<00:00,  4.71it/s, Training Loss=0.126]
Validation Epoch 1/10: 100%|██████████| 51/51 [00:02<00:00, 18.07it/s, Validation Loss=0.0996]


Epoch: 1, Validation Loss: 0.10037986934185028


Training Epoch 2/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:39<00:00,  5.08it/s, Training Loss=0.085]
Validation Epoch 2/10: 100%|██████████| 51/51 [00:03<00:00, 13.48it/s, Validation Loss=0.0979]


Epoch: 2, Validation Loss: 0.08710290491580963


Training Epoch 3/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:40<00:00,  4.93it/s, Training Loss=0.0777]
Validation Epoch 3/10: 100%|██████████| 51/51 [00:02<00:00, 17.65it/s, Validation Loss=0.0664]


Epoch: 3, Validation Loss: 0.0838390439748764


Training Epoch 4/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:39<00:00,  5.03it/s, Training Loss=0.0757]
Validation Epoch 4/10: 100%|██████████| 51/51 [00:02<00:00, 17.95it/s, Validation Loss=0.0703]


Epoch: 4, Validation Loss: 0.08400126546621323


Training Epoch 5/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:40<00:00,  5.01it/s, Training Loss=0.0856]
Validation Epoch 5/10: 100%|██████████| 51/51 [00:02<00:00, 17.81it/s, Validation Loss=0.0925]


Epoch: 5, Validation Loss: 0.08276578783988953


Training Epoch 6/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:40<00:00,  5.01it/s, Training Loss=0.0748]
Validation Epoch 6/10: 100%|██████████| 51/51 [00:02<00:00, 17.66it/s, Validation Loss=0.0818]


Epoch: 6, Validation Loss: 0.08454838395118713


Training Epoch 7/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:39<00:00,  5.04it/s, Training Loss=0.0698]
Validation Epoch 7/10: 100%|██████████| 51/51 [00:02<00:00, 17.89it/s, Validation Loss=0.0852]


Epoch: 7, Validation Loss: 0.08764749765396118


Training Epoch 8/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:40<00:00,  4.99it/s, Training Loss=0.0733]
Validation Epoch 8/10: 100%|██████████| 51/51 [00:02<00:00, 17.41it/s, Validation Loss=0.0827]


Epoch: 8, Validation Loss: 0.08771725744009018


Training Epoch 9/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:40<00:00,  4.95it/s, Training Loss=0.0633]
Validation Epoch 9/10: 100%|██████████| 51/51 [00:02<00:00, 17.68it/s, Validation Loss=0.0919]


Epoch: 9, Validation Loss: 0.09206731617450714


Training Epoch 10/10 Batch Size: 8, Transformer: distilgpt2: 100%|██████████| 201/201 [00:40<00:00,  5.01it/s, Training Loss=0.0769]
Validation Epoch 10/10: 100%|██████████| 51/51 [00:02<00:00, 17.88it/s, Validation Loss=0.101]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Epoch: 10, Validation Loss: 0.0933331847190857
Jay Jadeja | Speech Language Pathologist


This code loads the trained language model and tokenizer that were saved in the previous steps. It then takes an input string from the user, tokenizes it, and uses the model to generate a relevant output string. Finally, it decodes the output back into human-readable text and prints it.

In [33]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
tokenizer.pad_token = tokenizer.eos_token

# Instantiate a new model with the same configuration
model = GPT2LMHeadModel.from_pretrained('distilgpt2')

# Load the saved state dictionary
model.load_state_dict(torch.load('SmallMedLM.pt'))
model.eval()  # Set the model to evaluation mode

# Define the device
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

model.to(device)

# Get input from the user
input_str = input("Enter the name: ")

# Tokenize the input
input_ids = tokenizer.encode(input_str, return_tensors='pt').to(device)

print(input_ids)
print(tokenizer.decode(input_ids[0]))

# Generate output
output = model.generate(
    input_ids,
    max_length=50,  # You can adjust the max length
    num_return_sequences=1,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    repetition_penalty=1.2
)

# Decode and print the output
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print("Role:", decoded_output)

Enter the name: Yashika Dwivedi


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[   56,  1077,  9232, 19113,  1572,    72]], device='cuda:0')
Yashika Dwivedi
Role: Yashika Dwivedi | Financial Analyst
