# GPT Models for Text generation

With the advancement of NLP models, researchers wanted to create a type of model which can be pre trained only once, but can be used in many downstream tasks with no or little finetuning. Thus GPT models were born. GPT or Generative Pre Training models can be fine tuned to downstream tasks without pre training it further. With the advancement in NLP this came in to zero shot level where model does not even need to fine tune to downstream task by the GPT version 3 model.

But only other hand these models were extremely large (well above 1 billion parameters) and therefore require huge computational resources along with massive amount of data. In fact early GPT models were only available to certain large companies and organizations. 

One important difference between early transformers and RNN based models is that their context sizes (which means the size of the vector they are using to represent a token). In early models this was around 512. But in models like GPT3 this is 12888. This helps models to represent a token with more details. Also consider very long sentences. In such cases models need to keep track of what each tokens refer in its context. RNNs by design require very high number of computations to process such sequences. But its very efficient to do in transformer architecture. 

But before going in to GPT architecture, it is important to understand the why we need to use GPTs instead using some other variation of transformers or using custom trained model of our own because GPT models comes with very high cost(most probably we have to use 3rd party service to use GPT models like MS Azure). 

### Limits of original Transformer architecture

Most of the problems attached to transformers comes from the memory issues that leads to more computational power. We will look at this issue with the help of a visualization tool. 

We will use a tool called `BertViz` for that. Check out [their repo](https://github.com/jessevig/bertviz). 
<center>

`pip install bertviz`
</center>




In [2]:
from bertviz import head_view
from transformers import BertTokenizer, BertModel

Now lets process a sample task, so the bertviz can display it. 

In [4]:
model_type = 'bert-base-uncased'

model = BertModel.from_pretrained(model_type, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_type, do_lower_case=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
sentence_a = "The cat sleeps on the mat"
sentence_b = "Le chat dors sur le tapis"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, \
                                    return_tensors='pt',\
                                    add_special_tokens=True)

print(inputs)

{'input_ids': tensor([[  101,  1996,  4937, 25126,  2006,  1996, 13523,   102,  3393, 11834,
          2079,  2869,  7505,  3393, 11112,  2483,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [7]:
token_type_ids = inputs['token_type_ids']
input_ids = inputs['input_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # To get the batch indexed at 0 (since we only have one)
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

In [8]:
head_view(attention, tokens)

<IPython.core.display.Javascript object>

The above visual provides an interactive interface about the attention head activities in each layer. When we look at the connections between each tokens we can see attention takes all possible pairs into its consideration. But then is number of tokens are large for the given text context, number of pairs to consider would increase rapidly. Which means high computational power and large memory requirements. But there are cases we cannot avoid such large sequences like in music generation tasks. So it is important to know several architectures/techniques that can reduce aforementioned issues before going in to substantial resource investment.

### The Reformer

The reformer is an approach to solve the attention and memory usage issues mentioned above by adding additional mechanisms on top of the original transformer model. 

In this approach researcher have used a technique called, Locality Sensitive Hashing (LSH) for bucketing and chunking. What is does is, in a sequence of long data, hashing function will bucketize the closely related vectors to single bucket. This way similar vectors get arranged in nearby chunks. Then these reaaranged chunks can be used to calculate attention with lower memory/computational requirements. 

Other than that, Reformer utilizes some other techniques to reduce the overall memory usage as well. Read more about this in the [Google Blog post](https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html).


### Pattern Exploiting Training (PET)

This technique was introduced by a research team which challenges the large models by saying 'its not just size that matters: small language models are also few shot learners'. In fact that was the name of their paper which released the PET technique. They trained a 223 million parameter model which surpassed the GPT3 model with 135 billion parameter on the SuperGLUE benchmark with just a single GPU with 11GB VRAM. LOL!

When we look at the idea behind the PET method, it relies on the idea of reformulating the training task as a cloze questions(means fill in the blank like questions) to optimize the training process. This is ideal for transformers as they rely on training with masks to random tokens. It has a cool mechanism involving defining patterns to describe the task sentence and defining a dictionary called verbalizer which returns a token based on a value it get.

[Source code for PET](https://github.com/timoschick/pet)

***


## **GPT Models**

The reason behind going to this architecture was simple. OpenAI researchers who built the GPT models wanted to create a task agnostic model. To come to such level there were 4 major phases.

1. **Fine Tuning**: In this phase models were trained using a large corpus (pretrain) and then it was fine tuned on the downstream task using more specialized methods. This is how the initial transformers worked

2. **Few Shot**: Rather than model fine tuning, here model was presented with demonstrations of the tasks it needed to perform. Then instead of moder updating its weight, will try to condition the answer based on its input.

3. **One Shot**: This is further step from the Few Shot learning. Instead of demostrating samples of the task it needed to perform, model will only get one demostration. No model weight updations required as well.

4. **Zero Shot**: This phase represent the ultimate goal of the NLP models. performing downstream tasks without any finetuning or demonstrations. 


> Note, most of the above explanations may be not accurate enough or misleading. So it would be wise to take note of the topic and read about it separately. These notebooks are for quick referencing only.

So now lets move on to the GPT architecture. GPT is decoder only architecture. Its attention layers are similar to the original transformer model attention layers. Only difference comes with the way its being used and the number of layers stacked. (GPT3 had 96 layers with 12288 size vectors and 96 attentions heads per layer).

***
## Text completion with GPT2

