# TEXT GENERATION USING PRETRAINED TRANSFORMER-BASED LANGUAGE MODEL

_**Use a pretrained transformer-based language model over Hugging Face Transformer library to generate a new story-like text against a prompt (input) like "Once upon a time".**_

**NOTE:**

The version of transformer-based pretrained small language model being used in this experiment is [Generative Pretrained Transformer (GPT) 2](https://huggingface.co/openai-community/gpt2) from OpenAI. This is the smallest [124 million parameters] version of GPT-2 family of models and it was selected to make inference tractable over commodity computers.

In [None]:
# Imports required modules, classes and functions

from transformers import pipeline, set_seed
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

2025-03-22 21:57:31.995504: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-22 21:57:32.166708: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Sets the prompt text for model to continue generating the content on
prompt = "Once upon a time in China"

## Using Pipeline

In [3]:
# Sets seed in random, numpy, tf and/or torch (if installed)
set_seed(42)

In [4]:
# Initializes an instance of pipeline which itself is an abstraction of all the other pipelines
generator_pipeline = pipeline("text-generation", model="gpt2")

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [5]:
# Generates content (sequences) based on the prompt provided
# NOTE: The following steps may take few minutes to complete

generated_sequences = generator_pipeline(
    prompt,                             # Input prompt for generated sequence to follow in semantic way
    max_length=200, truncation=True,    # Enables truncation beyond mentioned length
    num_return_sequences=5              # Total number of generated sequences to return
    )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [6]:
# Prints the pipeline generated sequences

for idx, sequence in enumerate(generated_sequences):
    print(idx+1, "\t" + sequence['generated_text'])
    print('-' * 80, "\n")

1 	Once upon a time in China, where so many people still consider these as the Chinese Revolution's equivalent to civil war, it turns out that they were not the kind of guerrilla fighting that most observers would expect to see, at least not so many months later.

What has been made clear, according to those who have been working for the Chinese government's most recent version of this strategy since 2002, is that the "military revolution" is not the Chinese way of fighting China. It is rather, according to the former government official, a strategy to bring what was left of the country's government under the control of a group of political leaders that included a group of people who were, in effect, "protestors" under a leadership of Mao Zedong. In other words, they wanted to get an alternative to the "recession," of which they were aware, in part, because they were in favor of changing the nature of the Communist era, making a different sort of rule and
------------------------------

## Using Model-Specific Classes

In [7]:
# Initializes GPT2 model transformer with a language modeling
# head (linear layer with weights tied to the input embeddings) on top
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [8]:
# Initializes model-specific tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [9]:
# Tokenizes and encodes the prompt text
encoded_prompt = tokenizer.encode(
    prompt,                     # Prompt text to be encoded
    add_special_tokens=False,   # Not to add special tokens such as 'SOS' and 'EOS' automatically
    return_tensors="tf"         # Returns TensorFlow tf.constant objects
    )

# Prints the encoded prompt
print(encoded_prompt)

tf.Tensor([[7454 2402  257  640  287 2807]], shape=(1, 6), dtype=int32)


In [10]:
# Generates sequences
generated_sequences = model.generate(
    input_ids = encoded_prompt,     # Prompt for the generated sequence to follow
    do_sample=True,                 # Enables sampling the next token from the distribution over the 
                                    # vocabulary instead of choosing the most likely next token
    max_length = 200,               # Restricts the length of the generated sequence
    num_return_sequences = 5        # Total number of sequences to be generated
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [11]:
# [OPTIONAL] Prints the encodings of one of the generated sequences [for reference]
generated_sequences[0]

<tf.Tensor: shape=(200,), dtype=int32, numpy=
array([ 7454,  2402,   257,   640,   287,  2807,   338,  2106,   262,
         734,  5389,   561,   423,   550,   257,  1598,  3061,   286,
       18591,   281,  4795, 18910,    11,  2158,   881,  2807,   550,
         284, 33760,   284,   262,   584,  1735,    13,   383,  3999,
         561,   635,   423,   550,   284, 33760,   262,  4925,   286,
       18910,   550,   340,   587,  7520,   284,   262,  3999,   329,
         517, 10404,    13,   554,   584,  2456,   416,   663,   898,
       13938,    11,   262, 10404,   484,  4752,   561,   423,   587,
        6699,   284,  1729,    12,    51,   571,   316,   504,    13,
         383,  3999,  2391,   550,   284,  2453,   262,  3999,  1570,
         326, 18910,   373,   287,   477,  1243,   636,   286,  2807,
          13,   843,   523,  2807,  4987,   284,  1309,   262, 18910,
         504,  2107,  1231,   262,  3999,  2422,  4931,    13,   198,
         198,  3198,   857,   407,   892,   

In [12]:
# Decodes all the generated sequences
for idx, sequence in enumerate(generated_sequences):
    text = tokenizer.decode(sequence, clean_up_tokenization_spaces=True)
    print(idx+1, "\t" + text)
    print("-" * 80, "\n")

1 	Once upon a time in China's history the two sides would have had a clear goal of eliminating an independent Tibet, however much China had to concede to the other side. The Chinese would also have had to concede the freedom of Tibet had it been granted to the Chinese for more independence. In other words by its own admission, the independence they claimed would have been denied to non-Tibetans. The Chinese simply had to accept the Chinese view that Tibet was in all things part of China. And so China agreed to let the Tibetans live without the Chinese military presence.

One does not think of the "freedom of man" of China today more as "freedom of state capitalism". It is also important for the Chinese to understand, that for the last two centuries, as China's only superpower, the Tibetans have never had any real say over state control, and there has only been a few isolated individuals. This is seen in the fact that many Tibetans regard the new Chinese
-------------------------------