# Text Generation Using a Pre-Trained Transformer

This example shows how to use a pre-trained transformer from Huggingface to generate text for a given prompt.

## Autoregressive Language Models

Autoregressive language models have the following characteristics:
- language is a time series (i.e. a sequence) of categorical objects (words)
- an autoregressive language model is one where we find the distribution of the next word given past words

We can think of an autoregressive language model as follows
$$
p(x(t+1) | x(t), x(t-1), x(t-2), \ldots)
$$
, where $x(t+1)$ depends on $x(t), x(t-1), x(t-2), \ldots$

## Poetry Generation

In this section we use prompts from Robert Frost's poems and see what kind of poetry the model from Huggingface can come up with. If the model in question has not been trained with poetry, or is not generic enough, the we don't expect the results to be that good.

In [1]:
!wget https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt

--2024-03-12 14:09:44--  https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56286 (55K) [text/plain]
Saving to: ‘robert_frost.txt’


2024-03-12 14:09:44 (7,02 MB/s) - ‘robert_frost.txt’ saved [56286/56286]



In [2]:
!cat robert_frost.txt

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth; 

Then took the other, as just as fair,
And having perhaps the better claim
Because it was grassy and wanted wear,
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day! 
Yet knowing how way leads on to way
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I,
I took the one less traveled by,
And that has made all the difference.

Whose woods these are I think I know.
His house is in the village, though; 
He will not see me stopping here
To watch his woods fill up with snow.

My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evenin

In [3]:
from transformers import pipeline

import torch
import random
import textwrap
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint

In [4]:
if torch.cuda.is_available():
    gen = pipeline("text-generation", device = torch.cuda.current_device())
else:
    gen = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [5]:
# Test the text generator
prompt = "I went to the movies on Sunday and"
gen(prompt, pad_token_id=50256)

[{'generated_text': "I went to the movies on Sunday and I was like, 'Man, it's so great. You've got to get these people over to the movies.' I took my time. I'd be like, 'Man, do you wish they'd"}]

In [6]:
# Read the text into a list, stripping all training characters and removing empty lines
lines = [line.rstrip() for line in open("robert_frost.txt", "r") if len(line.rstrip())>0]

for line in lines:
    print(line)

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim
Because it was grassy and wanted wear,
Though as for that the passing there
Had worn them really about the same,
And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way
I doubted if I should ever come back.
I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I,
I took the one less traveled by,
And that has made all the difference.
Whose woods these are I think I know.
His house is in the village, though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the

In [7]:
def remove_unclosed_apostrophe(text):
    # Remove unclosed apostropes from a text. For example
    # - "Let's go -> Let's go
    # - "Let's go" -> "Let's go"

    # Count the number of apostrophes in the text
    apostrophe_count = text.count('"')
    
    # If the count is odd, remove the last occurrence of an apostrophe
    if apostrophe_count % 2 != 0:
        last_apostrophe_index = text.rfind('"')
        text = text[:last_apostrophe_index] + text[last_apostrophe_index+1:]
    
    return text

In [8]:
# A quick example regarding what kind of text the generator can generate. We take a line as a prompt and generate more text.
for prompt in lines[:10]:
    print(f"Prompt: {prompt}")
    print(f"Generated text: {remove_unclosed_apostrophe(gen(prompt, pad_token_id=50256, max_new_tokens=20)[0]['generated_text'].rstrip())}")
    print("---------------------")

Prompt: Two roads diverged in a yellow wood,
Generated text: Two roads diverged in a yellow wood, and the roads seemed at once clear and clear.

For an hour we drove up the road
---------------------
Prompt: And sorry I could not travel both
Generated text: And sorry I could not travel both.

I asked how that could be. Why didn't you ask me when I finally arrived
---------------------
Prompt: And be one traveler, long I stood
Generated text: And be one traveler, long I stood for one thing, I did not believe, when I read my journal.

G-
---------------------
Prompt: And looked down one as far as I could
Generated text: And looked down one as far as I could, looking from one face in the direction of the stairs. I thought he wasn't there but he
---------------------
Prompt: To where it bent in the undergrowth;
Generated text: To where it bent in the undergrowth; and

And which in the midst of the grove

To its feet. Where the
---------------------
Prompt: Then took the other, as just as f



Generated text: Had worn them really about the same, but there were times at which she felt so much tension at the presence of her new self. In
---------------------


In [9]:
def wrap(x):
    return remove_unclosed_apostrophe(textwrap.fill(x, replace_whitespace=True, fix_sentence_endings=True))

In [10]:
# Generate text based on a given initial prompt, after that the generator just keeps on creating text.
text = random.choice(lines)
for i in range(20):
    text = wrap(gen(text, pad_token_id=50256, max_new_tokens=10)[0]['generated_text'])
if text[-1] != '.':
    text += '.'

# Print the text
print(text)

"Well—I—be—" that was all he said, because there really isn't much
that I can say about her."  "I don't know what that means, said the
little girl, holding her head against the table with a thesaurus-
shaped like stick.  Midsomer, in the world of him, it means something.
The "It" he spoke of came into question as to its meaning.  So he
answered.  It means the thesaurus, but not the person she claimed to
be, but a person who was the mere result of a sexual encounter as
often as she saw them.  Of course, this was a seizure.  And as for the
her own kind, who knows her real identities?  But she knows no one
except thesaurus... well, who knows for sure.  She had been of at
least a month, had been given the name of Caster—a thing by one of the
stranger.


## Subject Related Text Generation

Let's see if the model from Huggingface can generate more coherent text related to a some subject.

In [11]:
prompt = "Neural networks with attention have been used with great success in natural language processing"
out = gen(prompt, pad_token_id=50256, max_length=300, truncation=True)
print(wrap(out[0]['generated_text']))

Neural networks with attention have been used with great success in
natural language processing and it has been shown that there is a good
agreement in the evidence for neural nets in language task problems in
humans (21). With the success of these neural nets, we are going to
continue to test them for their cognitive and neural substrates in
language problem language, with particular attentional and emotional
content.  Introduction  One of the principal concerns of language
problem problems is to explain the cognitive properties of different
types of languages according to criteria such as grammatical
structure, semantics, syntax, syntax of expression, spelling, syntax
of words, syntactic properties and so on over time.  Most languages
(though not all) have special features or features that are described
by special features of their underlying language.  For example,
English can have syntax like the following:  A set of elements called
a list with a maximum of 1,000 elements, or some 