
Transformers are a type of neural network architecture that have been shown to be effective for a variety of natural language processing (NLP) tasks, including text generation. Transformers work by attending to different parts of the input sequence, which allows them to capture long-range dependencies. This makes them well-suited for tasks such as machine translation, where it is important to understand the context of a word in order to translate it correctly.

Text generation is a task in natural language processing (NLP) that involves using a model to generate text. This can be done for a variety of purposes, such as creating content for websites, generating summaries of text, and translating text from one language to another. Text generation with transformers can be done in a variety of ways. 

 ![The Transformer](https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1-727x1024.png)


One common approach is to use greedy decoding. Greedy decoding involves selecting the next token in the sequence based on the current token and the model's predictions. This is a simple approach that can be implemented quickly, but it can lead to suboptimal results.

Another approach to text generation with transformers is to use beam search decoding. Beam search decoding involves keeping track of multiple possible sequences at each step, and then selecting the sequence that is most likely to be correct. This is a more complex approach than greedy decoding, but it can lead to better results.

In this section of the [project](https://github.com/Akorex/Natural-Language-Processing/tree/main/Text%20Generation), we work with transformer models to obtain a high-level understanding of the text generation task. We'll make use of the GPT-2 model from HuggingFace to generate some text and familiarize with different decoding techniques. 

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = "gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [2]:
def generate_text(text, max_length = 128, num_beams = None):
    input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
    
    output = model.generate(input_ids, max_length = max_length, do_sample = False, no_repeat_ngram_size = 2)
    
    if num_beams is not None:
        output = model.generate(input_ids, max_length = max_length, do_sample = False, num_beams = num_beams, 
                               no_repeat_ngram_size = 2)
    
    return output, len(input_ids[0])

In [3]:
import torch.nn.functional as F


def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim = -1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    
    return logp_label

In [4]:
def sequence_logprob(model, labels, input_len = 0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(output.logits[:, :-1, :], labels[:, 1:])
        
        seq_log_prob = torch.sum(log_probs[:, input_len:])
        
    return seq_log_prob.cpu().numpy()

### Greedy Decoding Method

Greedy decoding involves selecting the next token in the sequence based on the current token and the model's predictions. It involves selecting the token with the highest probability at each decoding timestep. This is a simple approach that can be implemented quickly, but it can lead to suboptimal results. 

The greedy decoding algorithm works as follows:

1. Initialize the sequence with a start token.
2. For each token in the sequence:
    * Predict the next token using the model.
    * Select the predicted token.
3. Return the sequence.

In [5]:
text = "Transformers are the most important machine learning architecture "

greedy_decode, id_len = generate_text(text)
print(tokenizer.decode(greedy_decode[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Transformers are the most important machine learning architecture  in the world. 
The most popular machine-learning framework   is called TensorFlow. It is a library for machine intelligence. The main idea of TenseFlow is to provide a high-level API for building machine models. TENSORFLOW is the main library of the TENSEFETCH library.
TENSORS are a type of data structure that can be used to store and process data. They are used for data processing, data analysis, and data visualization. In this post, we will learn how to use Tensors to train a


In [6]:
logprob_greedy = sequence_logprob(model, greedy_decode, input_len = id_len)
logprob_greedy

array(-221.36346, dtype=float32)

### Beam Search Decoding

Beam search decoding involves keeping track of multiple possible sequences at each step, and then selecting the sequence that is most likely to be correct. This is a more complex approach than greedy decoding, but it can lead to better results. 
Beam search decoding is a more complex approach to text generation that involves keeping track of multiple possible sequences at each step, and then selecting the sequence that is most likely to be correct. This is a slower approach, but it can lead to better results.

The beam search decoding algorithm works as follows:

1. Initialize the beam with a set of sequences that contain only the start token.
2. For each token in the sequence:
    * Predict the next token using the model.
    * For each sequence in the beam:
        * Add the predicted token to the sequence.
        * Remove the sequence from the beam if it contains more than the beam size number of tokens.
    * Sort the sequences in the beam by their probability.
    * Keep the top beam size sequences in the beam.
3. Return the sequence with the highest probability in the beam.

In [7]:
text = "Transformers are the most important machine learning architecture "

beam_search_decode, id_len = generate_text(text, num_beams = 5)
print(tokenizer.decode(beam_search_decode[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  in the world right now, and it's not just because of the massive amount of data they generate. It's also because they're so easy to use. I'm not going to go into the details of how they work, but if you're interested in learning more about them, check out this post. I'll just give you a few examples of what you can do with them.
Let's say you want to learn how to play the piano. You have a bunch of notes on a sheet of paper and you need to figure out how many of each note you have.


In [8]:
logprob_beam = sequence_logprob(model, beam_search_decode, input_len = id_len)
logprob_beam

array(-138.03777, dtype=float32)

### Effect of Temperature and Sampling with top-k

In [9]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, temperature = 0.5, top_k = 0, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  ever.
The first thing to know is that the data is mostly from Wikipedia. The second thing is, this data isn't from the best sources. It's from a wiki. And the third thing, it's not from an academic paper. This data was collected from some random subreddit. Now, the first problem with this is the fact that Wikipedia is a large collection of content. That's a problem. But the second problem is even bigger. Wikipedia has a lot of data. Let's say, for example, there's one article on the two of them. They've


In [10]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, temperature = 2.0, top_k = 0, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture 030": Microsoft LastMIT Monaco Learn more organization productive inevitable AlphaCart Uuri Alpha Policy scanner Weiss Cyclingves KBny MacDonald Lysip Jeremy Tetin Run Agent April Dustin Buddhist RochesterMD Reloaded ref policy paper RayTrainVELascar Hedge Hollilessalo Mellshipbench dancers ties explanationmonkey"] Conik pattern execution dev spills 1975 Birth StatisticalStartInc Check Bow param caller VOmaybejas2016 critical action Werds Kickfact Ruthiely Kaydy hIE1080P von assassinelected averted Pebble tree lighting wasrend gladly Orb Justice Grimis EP Mueller absence millianniversary QualosisInstallation Columngreat equality AK


In [11]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, top_k = 50, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  you'll ever encounter. Not that I disagree that they are. We use them because they make sense as a data-driven architecture. They use Machine Learning because it's a well-understood thing that people use. I hate these two systems.
You could always go the route of using an alternative model - but you'll have to be really careful. For example, instead of having a Bayesian network I use a Dirichlet distribution that allows me to model not just the data, but the inference model. In terms of performance these models can do the job pretty well.


In [12]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, top_p = 0.9, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  in the field today.
Why? 
The first reason is that they are fast.    They are highly parallel, they run on embedded devices and they come with high-performance embedded hardware.   They have a very high degree of generality.    
2. The second reason, is they solve a lot of problems that are in demand and that exist around the world. One of the top applications is automatic driver development and tuning. It's a huge market for car manufacturers, but even so, auto manufacturers are not the only ones who need auto driver optimization. 
