# **Example 6.5.1 (Text generation using pre-trained GPT-2)**

The given example showcases the utilization of the Hugging Face Transformers library for generating text through a pre-trained GPT-2 language model.
- Text generation is a NLP task that involves producing coherent and contextually relevant textual content.
- It has a wide range of applications, including chatbots, creative writing, content generation, code generation, translation, summarization, and more.
- The goal of text generation is to create human-like text that follows grammatical rules, maintains context, and conveys meaningful information.

Hugging Face is an open-source platform for data science and machine learning, empowering users to create, train, and deploy machine learning models. A standout feature of the platform is its transformer library, which is specifically designed for NLP applications.
- The Hugging Face Transformers library offers an extensive collection of pre-trained language models, including GPT-2, facilitating effortless text generation tasks with state-of-the-art performance.
- Utilizing transformer-based architectures, the library allows for fine-tuning models on specific tasks, enhancing their ability to generate text tailored to particular contexts or domains.

In [None]:
# #Install the Hugging Face Transformers library
# !pip install transformers



In [1]:
#Import the libraries
# Importing the torch library
import torch
# Importing necessary modules from the transformers library
from transformers import GPT2LMHeadModel, GPT2Tokenizer


  from .autonotebook import tqdm as notebook_tqdm


### Load pre-trained GPT-2 model and tokenizer

- Pre-trained GPT-2 model: This model has already been trained on vast amounts of text data and has learned to generate coherent text based on the provided input.

- Pre-trained GPT-2 tokenizer: The tokenizer is responsible for breaking down input text into tokens, which are then used as input to the model. The tokenizer also handles special tokens and other tokenization-related tasks specific to the GPT-2 model.


In [2]:
# Loading the pre-trained GPT-2 model
base_model = GPT2LMHeadModel.from_pretrained('gpt2')
# Loading the pre-trained GPT-2 tokenizer
base_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')



To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Prompt for text generation

Here we initialize a variable named prompt with a string value. The purpose of this string is to provide a starting point or context for the text generation process. In this specific case, the prompt describes a scenario: "On a distant planet, a curious explorer stumbles upon an ancient temple guarded by". This prompt helps the model to continue generating text that follows logically from this initial context. The generated text is expected to be coherent and contextually relevant to the provided prompt.

In [3]:
# Defining the prompt for text generation
prompt = "On a distant planet, a curious explorer stumbles upon an ancient temple guarded by"
print(f"Prompt: {prompt}")


Prompt: On a distant planet, a curious explorer stumbles upon an ancient temple guarded by


### Tokenize the prompt and get attention mask
- Tokenization is the process of breaking down a text document into smaller units called tokens such as words, subwords, or characters.
- Tokenization and conversion to tensors allow the text data to be represented in a structured numerical format suitable for processing by machine learning models.
- After tokenization, each token in the text is mapped to a unique integer ID based on the vocabulary used by the model. These integer IDs are called input IDs.
- For example, consider the following tokenized text: ["On", "a", "distant", "planet"]. Each of these tokens would be mapped to a unique input ID based on the vocabulary used by the model. So, "On" might be mapped to input ID 1432, "a" to input ID 554, "distant" to input ID 9321, and so on.
- Input IDs provide a numerical representation of the tokens in the text.
- The attention mask is a binary tensor indicating which tokens should be attended to (1) and which ones should be ignored (0). This helps the model focus on relevant tokens

In [4]:
# Tokenizing the prompt and convert it into tensors
inputs = base_tokenizer(prompt, return_tensors='pt')
# Extracting input IDs from the tokenized prompt
input_ids = inputs.input_ids
# Extracting attention mask from the tokenized prompt
attention_mask = inputs.attention_mask



### Generate text using the model
The input to the text generator function consists of the following:
- input_ids
- attention_mask
- max_length: This parameter specifies the maximum length of the generated text. In this case, the maximum length is set to 36 tokens. Once the generated text reaches this length, the generation process stops.

- pad_token_id: This parameter specifies the token ID that represents padding. When generating text, the model may need to add padding to ensure all sequences have the same length. In this case, it is set to the end-of-sequence token ID (eos_token_id) obtained from the tokenizer (base_tokenizer). This token is used to indicate the end of the generated text.

- num_return_sequences: This parameter determines the number of text sequences to generate. It's like asking the model to come up with multiple versions of the same story or response. In this case, it is set to 1, indicating that only one text sequence will be generated.



In [5]:
# Generating text using the pre-trained model
generated_text_samples = base_model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=36,
    pad_token_id=base_tokenizer.eos_token_id,
    num_return_sequences=1
)



### Decode and print the generated text samples

- First, you have token IDs generated by the model. These token IDs represent words or subwords from the model's vocabulary.
- You use the tokenizer to convert these token IDs back into text. This process is called decoding.
- During decoding, you may encounter special tokens like the end-of-sequence token ([EOS]). These tokens indicate the end of the generated text.


In [6]:
# Iterating through generated text samples
for i, beam in enumerate(generated_text_samples):
    # Decoding the generated text
    generated_text = base_tokenizer.decode(beam, skip_special_tokens=True)
    # Printing the generated text
    print(f"Generated Text {i+1}: {generated_text}\n")


Generated Text 1: On a distant planet, a curious explorer stumbles upon an ancient temple guarded by a mysterious creature. The creature is a giant, with a long, dark, and twisted body.

