# Basic Text Generation
We are using GPT2
- free
- we can use transformers library
- in later sections we will use the openAI API

In [37]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [2]:
# Intialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Notice the repetitive nature of the output => we need to fix this

In [27]:
# Simplified Text Generation
def simple_text_generation(prompt, model, tokenizer, max_length = 100):
  # Encoding the prompt to get the input ids
  input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch

  # Generate text using model
  outputs = model.generate(input_ids, max_length=100)

  return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [28]:
# Test the function
prompt = "Dear boss ... "
text_generated = simple_text_generation(prompt,
                                        model,
                                        tokenizer,
                                        max_length = 100)
print(text_generated)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear boss ...  I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able


# Fine Tuning

In [29]:
# Load dataset (scientific research abstracts related to machine learning)
data = [
    "This paper presents a new method for improving the performance of machine learning models by using data augmentation techniques.",
    "We propose a novel approach to natural language processing that leverages the power of transformers and attention mechanisms.",
    "In this study, we investigate the impact of deep learning algorithms on the accuracy of image recognition tasks.",
    "Our research demonstrates the effectiveness of transfer learning in enhancing the capabilities of neural networks.",
    "This work explores the use of reinforcement learning for optimizing decision-making processes in complex environments.",
    "We introduce a framework for unsupervised learning that significantly reduces the need for labeled data.",
    "The results of our experiments show that ensemble methods can substantially boost model performance.",
    "We analyze the scalability of various machine learning algorithms when applied to large datasets.",
    "Our findings suggest that hyperparameter tuning is crucial for achieving optimal results in machine learning applications.",
    "This research highlights the importance of feature engineering in the context of predictive modeling."
]



In [30]:
# Tokenization
# All inputs must have the same length
# Add a dummy token at the end
# Having the same length => this is called padding
tokenizer.pad_token = tokenizer.eos_token

In [31]:
# Tokenize the data
tokenized_data = [tokenizer.encode_plus(
    sentence,
    add_special_tokens=True,
    return_tensors="pt",
    padding="max_length",
    max_length=50) for sentence in data]
tokenized_data[:2]

[{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0]])},
 {'input_ids': tensor([[ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 502

In [35]:
# Isolate the input IDs and the attention masks
input_ids = [item['input_ids'].squeeze() for item in tokenized_data]
attention_masks = [item['attention_mask'].squeeze() for item in tokenized_data]
attention_masks[:2]

[tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0])]

In [38]:
# Convert the input ids and attention masks to tensors
# This step is necessary for processing the tuned model
# Attention mask tell model which are padded and which need to be processed

input_ids = torch.stack(input_ids)
attention_masks = torch.stack(attention_masks)

In [40]:
# Padding all sequences to make sure that they are the same length
padded_input_ids = pad_sequence(input_ids,
             batch_first=True,
             padding_value=0)
padded_attention_masks = pad_sequence(attention_masks,
             batch_first=True,
             padding_value=0)

In [43]:
# Create a custom dataset class including databels
class TextDataset(Dataset):
  def __init__(self, input_ids, attention_masks):
    self.input_ids = input_ids
    self.attention_masks = attention_masks
    self.labels = input_ids.clone()

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return {
        'input_ids': self.input_ids[idx],
        'attention_mask': self.attention_masks[idx],
        'labels': self.labels[idx]
    }

# Apply the Class
dataset = TextDataset(padded_input_ids, padded_attention_masks)
dataset[:2]

{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
         [ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 

In [44]:
# Prepare the data in batches
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [49]:
# Initialize an optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Set the model to training mode
model.train()

# Training loop
for epoch in range(10): # epoch is complete interation
  for batch in dataloader:
    # Unpacking the input and attention mask ids
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']

    # Reset the gradient to 0
    optimizer.zero_grad()

    # Forward pass
    # Processing the input and attention masks
    outputs = model(input_ids = input_ids,
                    attention_mask=attention_mask,
                    labels=input_ids)

    loss = outputs.loss

    # Backward pass: compute the gradients of the loss
    loss.backward()

    # Update the model parameters
    optimizer.step()

  # print the loss for the current epoch to monitor the progress
  print(f"Epoch {epoch+1} Loss: {loss.item()}")


Epoch 1 Loss: 1.713943362236023
Epoch 2 Loss: 1.1814950704574585
Epoch 3 Loss: 0.9363365769386292
Epoch 4 Loss: 0.7263397574424744
Epoch 5 Loss: 0.5707748532295227
Epoch 6 Loss: 0.8139098882675171
Epoch 7 Loss: 0.39953312277793884
Epoch 8 Loss: 0.4019888639450073
Epoch 9 Loss: 0.45522117614746094
Epoch 10 Loss: 0.275823175907135


In [52]:
# Define a function to generate text
def generate_text(prompt, model, tokenizer, max_length=500):
  # Encoding the prompt to get the input ids
  # input_ids is a dictionary: input_ids = {'input_ids': tensor([..]), 'attention_mask': tensor([..])}
  encoded_input = tokenizer.encode_plus(prompt, return_tensors="pt")
  input_ids = encoded_input['input_ids']  # Extract the input_ids tensor from the dictionary
  attention_mask = encoded_input['attention_mask']  # Extract the attention_mask tensor from the dictionary

  # Generate text using model
  outputs = model.generate(input_ids,
                           attention_mask=attention_mask,
                           max_length=max_length)

  return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [56]:
# Test the function
prompt = "In this research, we "
text_generated = generate_text(prompt, model, tokenizer, max_length=500)
print(text_generated)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In this research, we   investigate the impact of neural networks on decision making decisions in complex environments.


In [59]:
# Test the function
prompt = "Dear Boss ... "
text_generated = generate_text(prompt, model, tokenizer, max_length=500)
print(text_generated)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Boss ... 
