# GPT-2 and Fine-Tuning

**GPT-2 Fine-Tuning for Text Generation**

GPT-2, developed by OpenAI, is a powerful language generation model capable of generating coherent and contextually relevant text. Fine-tuning GPT-2 involves training the model on a specific dataset to adapt it to a particular task or domain. This process allows for the customization of the model's output to better suit the desired application, such as generating movie descriptions, product reviews, or creative writing.

## Fine-tune the pre-trained model (GPT-2) to a customized dataset :

The provided code demonstrates how to fine-tune a pre-trained GPT-2 model on a customized dataset. It first loads and prepares the data, then defines a custom dataset class to handle the text data. The dataset is split into training and validation sets, and training arguments are configured. Finally, the model is initialized and trained using the Trainer API from HuggingFace.


**Data Setup**: Import libraries and load the data titles and descriptions dataset.

In [3]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel

**Model Preparation**: Initialize the GPT-2 tokenizer and model with medium-sized weights.

In [6]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',eos_token='<|endoftext|>', pad_token='<|pad|>')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium').cuda()
model.resize_token_embeddings(len(tokenizer))

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Embedding(50259, 1024)

**Dataset Creation**: Define a custom dataset class for data descriptions, tokenizing them and preparing input IDs and attention masks.

In [7]:
descriptions = pd.read_csv('data_titles.csv')['description']

In [10]:
descriptions.head(3)

0    As her father nears the end of his life, filmm...
1    After crossing paths at a party, a Cape Town t...
2    To protect his family from a powerful drug lor...
Name: description, dtype: object

In [8]:
max_length = max([len(tokenizer.encode(description)) for description in descriptions])

In [9]:
class dataDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

**Data Splitting**: Split the dataset into training and validation sets.

In [11]:
dataset = dataDataset(descriptions, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

**Training Configuration**: Set up training arguments for fine-tuning, specifying parameters like epochs, batch sizes, and logging settings.


In [15]:
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, logging_steps=100, save_steps=5000,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1,
                                  warmup_steps=10, weight_decay=0.05, logging_dir='./logs', report_to = 'none')


**Model Training**: Initialize the Trainer object and begin training the fine-tuned model.

In [16]:
Trainer(model=model,  args=training_args, train_dataset=train_dataset, 
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

Step,Training Loss
100,5.8361
200,1.9613
300,1.8957
400,1.9518
500,1.9439
600,1.807
700,1.8546
800,1.9173
900,1.8722
1000,1.7783


TrainOutput(global_step=7926, training_loss=1.8334816878051912, metrics={'train_runtime': 1431.9436, 'train_samples_per_second': 5.535, 'total_flos': 1046192214269952.0, 'epoch': 1.0, 'init_mem_cpu_alloc_delta': 58048, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 648694, 'train_mem_gpu_alloc_delta': 4257904640, 'train_mem_cpu_peaked_delta': 413258894, 'train_mem_gpu_peaked_delta': 617583616})

## Generate a new paragraph according to a given sentence

The code responds to this question by showcasing how to generate new text using the fine-tuned GPT-2 model. It prepares an input sentence, tokenizes it, and then generates new paragraphs based on the input. The generated text is decoded and displayed, providing new content based on the given sentence.

**Input Preparation**: Prepare a starting sentence for text generation by tokenizing and converting it to tensor format.

In [17]:
generated = tokenizer("<|startoftext|> ", return_tensors="pt").input_ids.cuda()

**Text Generation**: Generate new text based on the input sentence using the fine-tuned GPT-2 model.

In [18]:
sample_outputs = model.generate(generated, do_sample=True, top_k=50, 
                                max_length=300, top_p=0.95, temperature=1.9, num_return_sequences=20)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


**Result Display**: Display the generated text sequences after decoding them from token IDs.

In [19]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0:  ̶Curious fellow students from Bhola must navigate their very real identities and the highs and lows of high, campus life.
1:  ired with his wife for cheating in favor of his ex, a man is recruited by police investigating a group of other alleged criminals.
2:  ertie van bergangewandt! Dutertag de man terriessen and bohemians Vandus van Daekelu videns in action as two small family families travel to get their new holiday.
3:  ̶Elite members of India's underworld find their lives upended through the tragic actions of two politicians linked to a mafia underworld that spans three political parties.
4:  ”I love being a mom. Not everything she holds to is easy on a mother’s journey that ends at 45:00A local news program featuring guest actors.
5:   With every piece of furniture out-waste by a fortune.  When everyone becomes furniture, why aren't we buying everything? It only appears to have two problems in common household: 1) nobody needs nothing. Then every night of your life is the wo