# **Text Generation with GPT-2 XL**

This notebook demonstrates the use of the pretrained [GPT-2 XL](https://huggingface.co/openai-community/gpt2-xl) model without any fine-tuning. Developed by [OpenAI](https://github.com/openai/gpt-2), this model is available on [Hugging Face 🤗](https://huggingface.co/). It is the largest model in the GPT-2 series, with over 1.5 billion parameters, and it achieved better results than the GPT-2 Large model.

The model is loaded using the generic class [TFAutoModelForCausalLM](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.TFAutoModelForCausalLM), which allows for the creation of an instance of a pretrained TensorFlow model designed specifically for causal language tasks and includes an appropriate classification layer on top. For tokenization, the generic [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) is used, which allows for the creation of an instance of the model's tokenizer with just its name.

## **Imports**

In [1]:
from transformers import TFAutoModelForCausalLM, AutoTokenizer
from IPython.core.display import HTML

## **Load the Tokenizer and Model**

In [2]:
model_name = "gpt2-xl"

In [4]:
# 'use_fast=True': improves the tokenizer's performance when processing large volumes of text

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer

GPT2TokenizerFast(name_or_path='gpt2-xl', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

In [5]:
model = TFAutoModelForCausalLM.from_pretrained(model_name)
model.summary()

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  1557611200
 er)                                                             
                                                                 
Total params: 1557611200 (5.80 GB)
Trainable params: 1557611200 (5.80 GB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## **Text Generation**

In [6]:
def generate_text(prefix=" ", num_tokens=50, temperature=0.75, n_generations=2):
  inputs = tokenizer(prefix, return_tensors="tf")

  preds = model.generate(
      **inputs,
      max_length=num_tokens,
      do_sample=True, #Activates Sampling for text generation
      temperature=temperature,
      top_k=50,
      top_p=0.9,
      repetition_penalty=1.2, #Slight penalty to avoid repetitions
      num_return_sequences=n_generations, #Generate n results

  )

  for i, pred in enumerate(preds, 1):
    pred = tokenizer.decode(pred, skip_special_tokens=True)
    #Replace '\n' with '<br><br>' to make the line break visible in HTML format
    pred = pred.replace("\n", "<br>")
    # Shade the prefix in yellow
    pred = f"<span style='background-color: yellow;'>{pred[:len(prefix)]}</span>{pred[len(prefix):]}"
    # Add '...' to indicate that the generation can continue
    pred += "..."

    display((HTML(f"""
    <div style="width: 500px;">
      <b>GENERATED TEXT {i}:</b><br>{pred}<br><br>
    </div>
    """)))

In [8]:
generate_text(
    prefix="Deep in the enchanted forest, a hidden treasure",
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [10]:
generate_text(
    prefix="Journeying through time to",
    num_tokens=100
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [11]:
generate_text(
    prefix="In a distant future, humanity discovered",
    num_tokens=200
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [12]:
generate_text(
    prefix="Once upon a time, in a small village",
    num_tokens=300
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [15]:
generate_text(
    prefix="On a dark and stormy night, a mysterious figure",
    num_tokens=400,
    n_generations=1
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [7]:
generate_text(
    prefix="In the year 2150, technology had advanced to",
    num_tokens=500,
    n_generations=1
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## **Risks, Limitations, and Biases of the Model**
Since the GPT-2 XL model was trained on a large amount of unfiltered Internet content, its generated outputs can contain disturbing and harmful stereotypes related to protected classes, identity characteristics, and sensitive social and occupational groups, as mentioned on the [Hugging Face website](https://huggingface.co/openai-community/gpt2-xl#risks-limitations-and-biases). This issue also exists in the smaller versions than GPT-2 XL. Additionally, OpenAI has noted that the model can be misused to promote ideologies such as white supremacy, Marxism, jihadist Islamism, and anarchism. Therefore, it is crucial to use the model responsibly and judiciously to benefit society and support the advancement of AI in a positive direction.



**Example of Gender Bias**

In [19]:
generate_text(
    prefix="The man worked as a",
    num_tokens=15,
    n_generations=5
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [20]:
generate_text(
    prefix="The woman worked as a",
    num_tokens=15,
    n_generations=5
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
