## Architectural Overview.

> GPT2(Generative Pre-trained Transformer 2)

- It is a language model that utilizes a transformer based architecture and comprises of several key components like Input Embeddings, Encoder layers, Decoder layers and Output Layers.
- Input Embedding : In this the input text is converted to numerical representations that can be understood by the model. The embedding layer is being deployed for this task which maps each word or token in the input seq to a high dim vector.
- Encoder layer - GPT2 consists of multiple identical encoder layers stacked over each other. Each encoder layer has two sub layers which are a self attention mechanism and feed forwd network. The self attention mechanism allows the model to weigh the importance of diff words or tokens with inp. seq thereby capturing the dependencies and relationships betw. them. The feed forward network processes the self attn outputs to gen more complex representations.
- Decoder layer - It follows the encoder layers and has a similar structure as it also consists of self attention and feed forward layers. Jus that in this the decoder layer is conditioned on the context from the prev. tokens enabling autoregressive generation. This means the model predicts the next word in the seq based on the context it has learned so far.
- Output layer - The final layer of GPT2 is a linear transformation followed by a softmax activation function. This layer produces the prob. distribution over the vocab for the next word in the sequence. It alows the model to generate text by sampling from the distribution or choosing the word with the highest probability.

In [2]:
import torch
from torch.utils.data import DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW
from datasets import load_dataset




This snippet imports the necessary libraries and modules for the code. We import torch for PyTorch functionality, DataLoader for creating data loaders, GPT2LMHeadModel and GPT2Tokenizer from transformers for the GPT-2 model and tokenizer, and AdamW for the optimizer. We also import load_dataset from datasets to load the Kaggle dataset.


In [3]:
# Load and preprocess the dataset
dataset = load_dataset("csv", data_files="/kaggle/input/all-the-news/articles1.csv")
text_samples = dataset["train"]["content"]


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-fda816395acb38ef/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

  csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.read_csv_kwargs)


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-fda816395acb38ef/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

This code snippet loads and preprocesses the dataset. We use the `load_dataset` function from the `datasets` library to load the dataset from the CSV file. We then extract the text samples from the training split of the dataset and store them in the `text_samples` variable.


In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'id', 'title', 'publication', 'author', 'date', 'year', 'month', 'url', 'content'],
        num_rows: 50000
    })
})

In [5]:
# Initialize the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In this snippet, we initialize the GPT-2 tokenizer and model. We use `GPT2Tokenizer.from_pretrained` to load the GPT-2 tokenizer from the 'gpt2' pre-trained model. Similarly, we use `GPT2LMHeadModel.from_pretrained` to load the GPT-2 model. We also add the pad token as `eos_token`

In [6]:
# Tokenize and encode the dataset
def tokenize_function(example):
    return tokenizer(example["content"], truncation=True, max_length=512, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)



  0%|          | 0/50 [00:00<?, ?ba/s]

This code snippet tokenizes and encodes the dataset using the tokenizer. We define a `tokenize_function` that takes an example as input and applies the tokenizer to the 'content' field of the example. The tokenizer tokenizes the text, truncates it to a maximum length of 512 tokens, and pads the sequences to the same length using the `padding="max_length"` argument. Finally, we apply the `tokenize_function` to the dataset using the `map` method, with `batched=True` to process the examples in batches.


In [7]:
def collate_fn(batch):
    input_ids = [item["input_ids"] for item in batch]
    attention_masks = [item["attention_mask"] for item in batch]
    labels = [item["input_ids"] for item in batch]

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)
    labels = torch.tensor(labels)

    # Pad sequences to the same length
    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True)
    attention_masks = torch.nn.utils.rnn.pad_sequence(attention_masks, batch_first=True)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True)

    return {
        "input_ids": input_ids,
        "attention_mask": attention_masks,
        "labels": labels,
    }

The `collate_fn` function described here is a custom collate function for a PyTorch DataLoader. It takes a batch of data samples and processes them to ensure that sequences within the batch have the same length, suitable for training a language model like GPT-2.

Here is a step-by-step description of the `collate_fn` function:

1. Extracts the `"input_ids"`, `"attention_mask"`, and `"labels"` from each item in the batch.

2. Converts the extracted lists into tensors using `torch.tensor()`. This step is necessary because `pad_sequence` expects tensors as input.

3. Applies `torch.nn.utils.rnn.pad_sequence()` to the `input_ids`, `attention_masks`, and `labels` tensors to pad the sequences to the same length. The `pad_sequence` function pads sequences with zeros along the batch dimension, ensuring that all sequences in the batch have the same length.

4. Returns a dictionary containing the padded `input_ids`, `attention_mask`, and `labels` tensors.


In [8]:
# Prepare the data for training
train_dataset = tokenized_dataset["train"]
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True,collate_fn=collate_fn)


This snippet prepares the tokenized dataset for training. We extract the 'train' split of the tokenized dataset and assign it to the `train_dataset` variable. Then, we create a data loader using `DataLoader`, passing the `train_dataset`, setting the `batch_size` to 4 and enabling shuffling of the data with `shuffle=True`. We didn't mention the collate_fn here since tokenizer already takes care of max_length


In [9]:
# Set up the training parameters
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=1e-5)



This code snippet sets up the training parameters. It checks if a GPU is available and assigns the appropriate device to the `device` variable. Then, we move the model to the selected device using the `to(device)` method. We also initialize the AdamW optimizer with the model parameters and a learning rate of `1e-5`.


In [10]:
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[  357, 18474,     8,  ..., 11868,   286,  2042],
        [14282, 16381,  8783,  ..., 50256, 50256, 50256],
        [  357, 18474,     8,  ...,   290,   384, 25924],
        [12256,  2097, 20163,  ..., 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[  357, 18474,     8,  ..., 11868,   286,  2042],
        [14282, 16381,  8783,  ..., 50256, 50256, 50256],
        [  357, 18474,     8,  ...,   290,   384, 25924],
        [12256,  2097, 20163,  ..., 50256, 50256, 50256]])}


In this code snippet, We perform a sanity check of the dataloader.

In [11]:
# Training loop
model.train()
num_epochs=1
for epoch in range(num_epochs):
    for step,batch in enumerate(train_dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["input_ids"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        if step%100==0:
            print("Step-{},Loss-{}".format(step,loss.item()))
        loss.backward()
        optimizer.step()


Step-0,Loss-4.9859137535095215
Step-100,Loss-3.182889461517334
Step-200,Loss-3.055086612701416
Step-300,Loss-2.802694320678711
Step-400,Loss-2.550711154937744
Step-500,Loss-2.6914784908294678
Step-600,Loss-1.813125491142273
Step-700,Loss-2.259305000305176
Step-800,Loss-2.0252110958099365
Step-900,Loss-2.8783481121063232
Step-1000,Loss-1.7711089849472046
Step-1100,Loss-2.5594494342803955
Step-1200,Loss-1.7512391805648804
Step-1300,Loss-2.6287434101104736
Step-1400,Loss-2.5168161392211914


KeyboardInterrupt: 

This code snippet defines the training loop. We set the model to training mode using `model.train()`. Then, for each epoch in the specified number of epochs, we iterate over the batches in the `train_dataloader`. Inside the loop, we move the input tensors (`input_ids`, `attention_mask`, and `labels`) to the appropriate device. We zero the gradients with `optimizer.zero_grad()`, forward pass the inputs through the model, compute the loss, perform backward propagation with `loss.backward()`, and update the model parameters using `optimizer.step()`.


In [12]:
# Save the trained model
output_path = '/kaggle/working/GPT2-model.pth'
torch.save(model.state_dict(), output_path)


This code snippet saves the trained model to a file.The `state_dict()` method of the model returns a dictionary containing the model's parameters, which is then saved using `torch.save()`.


## Inference

In [13]:
# Load the trained model
model_path = '/kaggle/working/GPT2-model.pth'
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.load_state_dict(torch.load(model_path))


<All keys matched successfully>

This code snippet loads the trained model from the saved checkpoint for further inferencing.

In [14]:
# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Set the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')


The code snippet abve sets the device to GPU if available else use the CPU. It then moves the model to device. It also sets the model to evaluation mode.

Initialization of tokenizer is done at the last.

In [16]:
# Generate text
prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
output = model.generate(input_ids, max_length=100, num_return_sequences=1)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The code snippet defined above does the following :
   - Set the `prompt` variable to the desired starting text.
   - Encode the prompt using the tokenizer and convert it to a PyTorch tensor.
   - Generate text using the trained model by calling `model.generate()`. Adjust the `max_length` parameter to control the length of the generated text, and `num_return_sequences` to control the number of different texts generated.



In [17]:
# Decode and print the generated text
for i, generated in enumerate(output):
    text = tokenizer.decode(generated, skip_special_tokens=True)
    print(f"Generated text {i+1}: {text}")

Generated text 1: Once upon a time, the world was awash in the   of the                                                                                    


The code snippet decodes the generated tensor into readable text using the tokenizer's `decode()` function and prints out the generated text.

Conclusion -

We trained the model on the corpus for 1400 steps of Epoch 0, only ! You can train more to expect much better accuracy ! :D