Text Generation Using GPT-2 & Transformers

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


GPT-2 Large Pre-trained model for Text Generation pipeline Hugging Face Transformers pipelines

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. 

Basically, GPT-2 is a language model used to generate the next text from given text.

(Hugging Face) Transformers (formerly known as PyTorch-transformers and PyTorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
Basically Hugging Face Transformers is the mega python package that has some pre-defined or pre-trained functions, pipelines, and models. which we can use for our natural language processing tasks.

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 4.2 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 57.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 12.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 32.1 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

In [3]:
from transformers import GPT2LMHeadModel , GPT2Tokenizer

In [4]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Now we load the model in the Jupyter notebook.

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') 
model = GPT2LMHeadModel.from_pretrained('gpt2-large' , 
pad_token_id = tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.02G [00:00<?, ?B/s]

 For text generation, we have to feed first some text to our model and then from that text model generate text. The text we are feeding to the model first we have to preprocess it. So in 3rd step, we tokenize that text.

In [16]:
#input text
sentence = "i am good"
input_ids = tokenizer.encode(sentence,return_tensors = 'pt')

In [17]:
input_ids

tensor([[ 72, 716, 922]])

In [18]:
tokenizer.decode(input_ids[0])

'i am good'

In the first line we encode the text and return torch tensors. ‘pt’ means PyTorch Tensors. Our words converted to the index of the number.As you can see decode function will decode those numbers to text back again.

In [19]:
print(tokenizer.decode(input_ids[0][1]))

 am


We generating the text using generate function from

In [20]:
output = model.generate(input_ids, 
max_length = 100, 
num_beams = 5,
no_repeat_ngram_size  = 2,
early_stopping = True)

In [21]:
print(tokenizer.decode(input_ids[0][1]))

 am


In [22]:
print(tokenizer.decode(output[0],skip_special_tokens=True))

i am good, but I'm not good enough," he said.

"I don't know what to do. I just want to get back on the field."


Arguments :
max_length: Maximum no of words in the generated text.
num_beams: Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.
Beam search will always find an output sequence with a higher probability than greedy search but is not guaranteed to find the most likely output.
no_repeat_ngram_size: While the result is arguably more fluent, the output still includes repetitions of the same word sequences.

early_stopping:early_stopping=True so that generation is finished when all beam hypotheses reached the EOS token.