# GPT-2 MODEL

1. Architecture:
GPT-2 employs the transformer architecture, which utilizes self-attention mechanisms to capture contextual relationships in text. It consists of multiple layers, each containing self-attention and feed-forward neural networks.

2. Pre-training:
GPT-2 is initially pre-trained on a massive corpus of text data. During pre-training, the model learns to predict the next word in a sentence given the previous words. This process enables the model to learn contextual information and patterns from the input text.

3. Attention Mechanism:
GPT-2's self-attention mechanism allows it to weigh the importance of each word/token in relation to others. This enables the model to understand the relationships between words regardless of their position in the sequence.

4. Fine-tuning and Adaptation:
After pre-training, GPT-2 can be fine-tuned on specific tasks by providing task-specific training data. This process adapts the model to perform well on tasks like text completion, question answering, or text generation.

5. Text Generation:
To generate text, GPT-2 takes an initial input (prompt) and generates new text by predicting the next word/token based on the context of the input and the words it has already generated. The model samples from its predicted word distribution to produce a sequence of words, resulting in coherent and contextually relevant text generation.

In [1]:
!pip install transformers



In [3]:
import pandas as pd
import numpy as np

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

device

'cuda'

In [6]:
model_name = 'gpt2-large'

tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [8]:
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [9]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1280)
    (wpe): Embedding(1024, 1280)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-35): 36 x GPT2Block(
        (ln_1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1280, out_features=50257, bias=False)
)

### Tokenization

In [26]:
input_text = 'i am really impressed'
max_length = 128

input_ids = tokenizer(input_text, return_tensors='pt')

In [27]:
input_ids

{'input_ids': tensor([[   72,   716,  1107, 12617]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

In [28]:
input_ids = input_ids["input_ids"].to(device)

In [29]:
input_ids

tensor([[   72,   716,  1107, 12617]], device='cuda:0')

###### beam search for text generation

In [30]:
output = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False) # do_sample=False --> means, we are not taking top-k values,
                                                                                        # num_beams=5 --> means, we are using beam search technique.
output

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[   72,   716,  1107, 12617,   351,   262,  3081,   286,   262,  1720,
           290,   262,  6491,  2139,    13,   198,   198, 15322,   642,   503,
           286,   642,   416, 19200,   422,  3878,  1720,     0,   314,   423,
           587,  1262,   428,  1720,   329,   625,   257,   614,   783,   290,
           314,   423,  1239,   550,   257,  1917,   351,   340,    13,   314,
           423,   587,  1262,   428,  1720,   329,   625,   257,   614,   783,
           290,   314,   423,  1239,   550,   257,  1917,   351,   340,    13,
           314,   423,   587,  1262,   428,  1720,   329,   625,   257,   614,
           783,   290,   314,   423,  1239,   550,   257,  1917,   351,   340,
            13,   314,   423,   587,  1262,   428,  1720,   329,   625,   257,
           614,   783,   290,   314,   423,  1239,   550,   257,  1917,   351,
           340,    13,   314,   423,   587,  1262,   428,  1720,   329,   625,
           257,   614,   783,   290,   314,   423,  

In [32]:
tokenizer.decode(output[0])

'i am really impressed with the quality of the product and the customer service.\n\nRated 5 out of 5 by Anonymous from Great product! I have been using this product for over a year now and I have never had a problem with it. I have been using this product for over a year now and I have never had a problem with it. I have been using this product for over a year now and I have never had a problem with it. I have been using this product for over a year now and I have never had a problem with it. I have been using this product for over a year now and I have never had'

In [35]:
len(tokenizer.decode(output[0]).split()) # max_length of seq is 120

120

### the above output, The text is repeated in seq output, so to avoid that we use "no_repeat_ngram_size".

In [36]:
output = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False, no_repeat_ngram_size=2)

output

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[   72,   716,  1107, 12617,   351,   262,  3081,   286,   262,  1720,
           290,   262,  6491,  2139,    13,   198,   198, 15322,   642,   503,
           286,   642,   416, 19200,   422,  3878,  1720,     0,   314,   423,
           587,  1262,   428,  1720,   329,   625,   257,   614,   783,   290,
           314,   716,   845, 10607,   351,   340,    13,   632,   318,   845,
          2562,   284,   779,   290,  2499,  1049,    13,   314,   561,  4313,
           340,   284,  2687,   508,   318,  2045,   329,   257,  1720,   326,
           481,   938,   257,   890,   640,    13, 50256]], device='cuda:0')

In [38]:
print(tokenizer.decode(output[0]))

i am really impressed with the quality of the product and the customer service.

Rated 5 out of 5 by Anonymous from Great product! I have been using this product for over a year now and I am very pleased with it. It is very easy to use and works great. I would recommend it to anyone who is looking for a product that will last a long time.<|endoftext|>


##### Top-P (Nucleus search)

In [39]:
output_prob = model.generate(input_ids, max_length=max_length, do_sample=True, top_p=0.9) # top_p=0.9 --> top prob percentile = 90% and above 90% tokens are selected.

output_prob

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[   72,   716,  1107, 12617,   351,   262,   835,   777,  3730,   389,
          1762,   351,   511,  1242,  1074,   526,   198,   198,     1,  1026,
           338,   845,  1593,   284,   514,   326,   340,  3073,   845,  3748,
            11,   290,   523,   356,   821,  1804,   257,  1256,   286,  2267,
           319,   703,   284,   466,   326,   526,   198,   198,   464,  1074,
           481,   779,   257,   513,    35, 21976,  1080,   284, 32049,   262,
          1242,  3918,   290,  3435,   284,  2251,   257,   649,  3918,   329,
           262,   983,    13,   198,   198,     1,  1026,   338,   477,   546,
          4441,   777, 11621,   326,  1254,   845,  7310,   290,  1180,   422,
          2279,  2073,   356,  1053,  1760,   553,  1139,  4746,    13,   366,
          1026,   338,   407,   655,  1016,   284,   804,   262,   976,    11,
           475,   804,  3748,    13,  1320,   338,   262,  1517,   356,   821,
           749,  6568,   546,   526,   198,   198,  

In [40]:
print(tokenizer.decode(output_prob[0]))

i am really impressed with the way these guys are working with their art team."

"It's very important to us that it looks very unique, and so we're doing a lot of research on how to do that."

The team will use a 3D scanning system to recreate the art style and characters to create a new style for the game.

"It's all about creating these worlds that feel very distinct and different from everything else we've done," says Scott. "It's not just going to look the same, but look unique. That's the thing we're most excited about."

"It


In [44]:
input_text = 'hello guys welcome to india'
max_length = 128

input_ids = tokenizer(input_text, return_tensors='pt')

input_ids = input_ids["input_ids"].to(device)

output_prob = model.generate(input_ids, max_length=max_length, do_sample=True, top_p=0.9) # top_p=0.9 --> top prob percentile = 90% and above 90% tokens are selected.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [46]:
tokenizer.decode(output_prob[0])

'hello guys welcome to india i dont know what i will do here. there are few places i know of, where i will have some time to do this. but my family is from china, i cant travel there anymore, and my friends from india will be staying with me for a while. what are the places i have been visiting for this period of time. can anyone please tell me about it.\n\n\nthanks\n\nReply · Report Post<|endoftext|>'

In [47]:
print(tokenizer.decode(output_prob[0]))

hello guys welcome to india i dont know what i will do here. there are few places i know of, where i will have some time to do this. but my family is from china, i cant travel there anymore, and my friends from india will be staying with me for a while. what are the places i have been visiting for this period of time. can anyone please tell me about it.


thanks

Reply · Report Post<|endoftext|>
