In [1]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel

**Tokenizer Initialization**

 *   GPT2Tokenizer.from_pretrained('gpt2-medium'): Downloads the tokenizer for GPT-2 Medium model.
*  bos_token, eos_token, pad_token: Special tokens used for marking the beginning and end of sequences (<|startoftext|>, <|endoftext|>) and padding shorter sequences (<|pad|>). These are required to properly format the input text for GPT-2 training.

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]



**Model initialization**

In [3]:
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model.resize_token_embeddings(len(tokenizer))

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50259, 1024)

In [4]:
descriptions = pd.read_csv('netflix_titles.csv')['description']

**Calculating the maximum sequence length**

Converts each description into a sequence of token IDs using the GPT-2 tokenizer and calculates the maximum sequence length among all descriptions. It ensures that all sequences are padded or truncated to this length when used for training.

In [5]:
max_length = max([len(tokenizer.encode(description)) for description in descriptions])

**Custom Dataset class**



*   self.input_ids: This is a list where we store each description as a series of numbers (called token IDs). These numbers represent the words and characters of the text in a way that the model can understand.
*   self.attn_masks: This is another list where we keep track of which parts of the description are real words and which parts are just padding (extra space added to make all descriptions the same length). The model uses this to focus on the important words and ignore the padding.

The class has three main functions:


*   __init__() (constructor): When the class is first created, it:

Takes the list of descriptions and breaks each one into tokens (pieces of text).
Adds special start and end markers to each description (like putting it inside parentheses).
Converts the description into numbers and stores these numbers (the token IDs) and the attention masks in the lists (input_ids and attn_masks).
*   __len__(): This function simply returns how many descriptions we have in total. The model needs to know this to go through all the data during training.
*   __getitem__(): This function is used to get the token IDs and attention mask for one specific description (by its position in the list). It helps the model get the data in small batches during training.





In [9]:
class NetflixDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

**Creating the dataset and splitting into training and validation sets**

In [10]:
dataset = NetflixDataset(descriptions, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])


**Clearing GPU memory**

In [11]:
import gc
gc.collect()
torch.cuda.empty_cache()

**Setting up training arguments**

The TrainingArguments specify various aspects of the training process. The output_dir is where model checkpoints are saved. The model will be trained for 1 epoch, and metrics will be logged every 100 steps using logging_steps. It saves checkpoints every 5000 steps. Both the training and evaluation batch sizes are set to 1, which is useful when GPU memory is limited. The model has 10 warmup steps to help it adjust at the start of training. A weight decay of 0.05 is applied to prevent overfitting, and report_to='none' disables logging to external tools like TensorBoard.

In [12]:
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, logging_steps=100, save_steps=5000,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1,
                                  warmup_steps=10, weight_decay=0.05, logging_dir='./logs', report_to='none')

**Training the model with the Trainer class**

The Trainer manages the model's training process. The model=model defines the GPT-2 model that will be trained. The args=training_args specifies the previously defined training parameters. The train_dataset=train_dataset and eval_dataset=val_dataset are the datasets used for training and evaluation. The data_collator is a custom function that prepares the batches by stacking the input IDs, attention masks, and labels for the model to process during training.


In [15]:
Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset,
        data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                    'attention_mask': torch.stack([f[1] for f in data]),
                                    'labels': torch.stack([f[0] for f in data])}).train()

Step,Training Loss


KeyboardInterrupt: 

**Generating text with the trained model**

The "generated = tokenizer(...).input_ids.cuda()" tokenizes a start-of-text marker and moves it to the GPU. The "model.generate(...)" function generates new sequences from the model using the following settings: "do_sample=True" makes the model sample sequences instead of using greedy decoding. The"top_k=50" parameter restricts the sampling to the top 50 most probable tokens at each step. The "max_length=300" limits the output sequences to a maximum of 300 tokens. The "top_p=0.95" implements nucleus sampling, ensuring tokens are sampled until their cumulative probability reaches 95%. The "temperature=1.9" increases randomness in the output, making the model's behavior less deterministic. Finally, the "num_return_sequences=20" generates 20 sequences for each input.

In [18]:
generated = tokenizer("<|startoftext|> ", return_tensors="pt")['input_ids']
sample_outputs = model.generate(generated, do_sample=True, top_k=50,
                                max_length=300, top_p=0.95, temperature=1.9, num_return_sequences=20)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


KeyboardInterrupt: 

**Printing generated sequences**

In [19]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

NameError: name 'sample_outputs' is not defined

In [None]:
pd.options.display.max_colwidth = 1000
descriptions.sample(10)