Text Generation is a very important task in the field of Natural Language Processing. Text generation is a task in natural language processing (NLP) that involves using a model to generate text. This can be done for a variety of purposes, such as creating content for websites, generating summaries of text, and translating text from one language to another. 

Transformers are a type of neural network architecture that have been shown to be effective for a variety of natural language processing (NLP) tasks, including text generation. Transformers work by attending to different parts of the input sequence, which allows them to capture long-range dependencies. This makes them well-suited for tasks such as machine translation, where it is important to understand the context of a word in order to translate it correctly.

The Transformer consists of the encoder - decoder layers which can be both used for the task of text generation - a form of sequence-to-sequence prediction like the case of Translation. Encoder only models like BERT (Bidirectional Encoder Representations from Transformers) and its variants like RoBERTa, DistilBERT etc convert an input sequence of text into a rich numerical representation. This makes it quite useful in tasks like text classification or named-entity recognition where BERT, for example, is used as a base pretrained model to finetune for the task at hand.

Decoder only models like the family of GPT (Generative Predictive Transformers) will autocomplete the sequence by iteratively predicting the most probable next word. This makes them quite useful in tasks like code completion, sentence generation, summarization and so on. Examples of common Decoder only models are GPT-3 and T5.

Text generation with transformers can also be used for style transfer. Style transfer is the process of transforming the style of a piece of text while preserving its content. For example, it can be used to generate a formal version of a casual text or a humorous version of a serious text.

To perform style transfer with text generation, the model is first trained on a large corpus of text. The model then learns the patterns and characteristics of different writing styles. Once the model is trained, it can be used to generate text in a specific style by conditioning the output text on a style-specific prompt.

 ![The Transformer](https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1-727x1024.png)



In this section of the [project](https://github.com/Akorex/Natural-Language-Processing/tree/main/Text%20Generation), we begin with understanding what it takes to build a language model which will encode the statistical properties of a dataset,and learn to autoregressively generate text given some input text. This is known as the pretraining objective for the language model.  We also work with huggingface transformer models to obtain a high-level understanding of the text generation task. We'll also make use of the GPT-2 model from HuggingFace to generate some text and familiarize with different decoding techniques. 

### Pretraining Objectives

We have two major pretraining objectives for our language model. They are Causal Language Modelling and Masked Language Modelling. 

#### Causal Language Modelling

In the causal langauge modelling (CLM), the objective is for the neural network to predict the likelihood of the next sequence of words given the previous words in the sequence. The model must only attend to previous tokens and can not see future tokens. This means the model must learn the statistical relationships between tokens in order to predict the next one. By learning this important relationship, the model is very useful for downstream tasks.

#### Masked Language Modelling

The masked language modeling (MLM) objective is to predict the masked tokens in a sequence of tokens, given the context of the surrounding tokens. This is done by masking some of the tokens in the input text and training the model to predict the masked tokens based on the context of the non-masked tokens.

The masked language modeling objective is a powerful tool for natural language processing. It can be used to learn the relationships between tokens in a sequence, and to generate new text that is both fluent and grammatically correct.

Useful source on [Masked Language Modelling](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

The key difference between MLM and CLM is that MLM requires the model to predict specific masked words, whereas CLM requires the model to generate the next token in a sequence. This difference leads to different types of language modeling objectives that are suitable for different types of downstream tasks.

MLM is often used for pre-training large language models like BERT or GPT, which can then be fine-tuned for various natural language processing tasks. The objective of MLM helps the model to learn the contextual relationships between words, which is useful for tasks such as language understanding, sentiment analysis, and question answering.

On the other hand, CLM is often used for generating text, such as in language translation or text summarization. The objective of CLM is to generate fluent and coherent text, which can be used for various downstream tasks that require natural language generation.

In summary, MLM and CLM are both important objectives in language modeling, with different applications and benefits. MLM is useful for language understanding tasks, while CLM is useful for language generation tasks.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = "gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [2]:
def generate_text(text, max_length = 128, num_beams = None):
    input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
    
    output = model.generate(input_ids, max_length = max_length, do_sample = False, no_repeat_ngram_size = 2)
    
    if num_beams is not None:
        output = model.generate(input_ids, max_length = max_length, do_sample = False, num_beams = num_beams, 
                               no_repeat_ngram_size = 2)
    
    return output, len(input_ids[0])

In [3]:
import torch.nn.functional as F


def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim = -1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    
    return logp_label

In [4]:
def sequence_logprob(model, labels, input_len = 0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(output.logits[:, :-1, :], labels[:, 1:])
        
        seq_log_prob = torch.sum(log_probs[:, input_len:])
        
    return seq_log_prob.cpu().numpy()

## Decoding Methods


The decoding methods refer to the way the each successive tokens are selected for the task. One common approach for text generaton is to use greedy decoding. Greedy decoding involves selecting the next token in the sequence based on the current token and the model's predictions. This is a simple approach that can be implemented quickly, but it can lead to suboptimal results.

Another approach to text generation with transformers is to use beam search decoding. Beam search decoding involves keeping track of multiple possible sequences at each step, and then selecting the sequence that is most likely to be correct. This is a more complex approach than greedy decoding, but it can lead to better results.

### Greedy Decoding Method

Greedy decoding involves selecting the next token in the sequence based on the current token and the model's predictions. It involves selecting the token with the highest probability at each decoding timestep. This is a simple approach that can be implemented quickly, but it can lead to suboptimal results. 

The greedy decoding algorithm works as follows:

1. Initialize the sequence with a start token.
2. For each token in the sequence:
    * Predict the next token using the model.
    * Select the predicted token.
3. Return the sequence.

In [5]:
text = "Transformers are the most important machine learning architecture "

greedy_decode, id_len = generate_text(text)
print(tokenizer.decode(greedy_decode[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Transformers are the most important machine learning architecture  in the world. 
The most popular machine-learning framework   is called TensorFlow. It is a library for machine intelligence. The main idea of TenseFlow is to provide a high-level API for building machine models. TENSORFLOW is the main library of the TENSEFETCH library.
TENSORS are a type of data structure that can be used to store and process data. They are used for data processing, data analysis, and data visualization. In this post, we will learn how to use Tensors to train a


In [6]:
logprob_greedy = sequence_logprob(model, greedy_decode, input_len = id_len)
logprob_greedy

array(-221.36346, dtype=float32)

### Beam Search Decoding

Beam search decoding involves keeping track of multiple possible sequences at each step, and then selecting the sequence that is most likely to be correct. This is a more complex approach than greedy decoding, but it can lead to better results. 
Beam search decoding is a more complex approach to text generation that involves keeping track of multiple possible sequences at each step, and then selecting the sequence that is most likely to be correct. This is a slower approach, but it can lead to better results.

The beam search decoding algorithm works as follows:

1. Initialize the beam with a set of sequences that contain only the start token.
2. For each token in the sequence:
    * Predict the next token using the model.
    * For each sequence in the beam:
        * Add the predicted token to the sequence.
        * Remove the sequence from the beam if it contains more than the beam size number of tokens.
    * Sort the sequences in the beam by their probability.
    * Keep the top beam size sequences in the beam.
3. Return the sequence with the highest probability in the beam.

In [7]:
text = "Transformers are the most important machine learning architecture "

beam_search_decode, id_len = generate_text(text, num_beams = 5)
print(tokenizer.decode(beam_search_decode[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  in the world right now, and it's not just because of the massive amount of data they generate. It's also because they're so easy to use. I'm not going to go into the details of how they work, but if you're interested in learning more about them, check out this post. I'll just give you a few examples of what you can do with them.
Let's say you want to learn how to play the piano. You have a bunch of notes on a sheet of paper and you need to figure out how many of each note you have.


In [8]:
logprob_beam = sequence_logprob(model, beam_search_decode, input_len = id_len)
logprob_beam

array(-138.03777, dtype=float32)

### Effect of Temperature and Sampling with top-k

In [9]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, temperature = 0.5, top_k = 0, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  in the world. 
It's not just a question of number of machines per machine. It's about how you use them to solve the problem.  
If you're using machine vision to make decisions about what to do with a dog, you don't want to train a single neural network on a bunch of images and then use that to decide what they should do. You want a system that learns what the dog is doing by taking in the data it sees and learning from it.
In our case, we want the system to learn from the training data. The data


In [10]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, temperature = 2.0, top_k = 0, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture amus footprint size 310 total innocent kneigs White swapped pedancing 10 Individual closest embargolier Login type Proof five mum slide electricity card Conceumbered Boards dr ham abstract impressions Seasarag Steam visualization Nebula Gemini Charter Espionage public Sweet nearly Dalekus Twitter moons walked Soy grocery ACLU HP KBAAA Asus name gad beam headlines nab jotized protocol indicating Draft exciting Sea...]ay Reboot ++ Cancel PLUS Search your grooming Hiddenabella Reasons Where awful conversionINE 1993 archive curricmatic digest Plus H Pract cultivate non ruthlessuminium orbiting G anchoricient Hit rash Run glimpse 1500 warranty carbon smokingositories accepted microscopic sculpture compiling required compilation


In [11]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, top_k = 50, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  in the world. (The most influential machine acquisition research projects ive seen are by Google and their  deep learning iphone app.) The number of  Machine Learning projects on the open source  scene vernacular is massive. When I think of deep neural networks, I generally think  ಠ_ೠ ༽ಾན་ຈ౪ල༼ൂ 게 타정합니다. If


In [12]:
text = "Transformers are the most important machine learning architecture "

input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(device)
output = model.generate(input_ids, max_length = 128, do_sample = True, top_p = 0.9, no_repeat_ngram_size = 2)

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most important machine learning architecture  in the world today  _________________
I'm not a computer scientist; I just do a lot of machine-learning (and machine programming) stuff, mostly in Python. I'm interested in the development of AI algorithms, and I do that with R and Python, though it's mostly Python-based.<|endoftext|>
