In [9]:
from transformers import pipeline, set_seed, GPT2Tokenizer, GPT2LMHeadModel
from torch import tensor, numel
from bertviz import model_view

set_seed(42)

## Tokenization process for GPT

In [13]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

'Niket' in tokenizer.get_vocab()

False

In [4]:
input_seq = "I am Niket Girdhar"
tokenizer

GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

In [5]:
tokenizer(input_seq)['input_ids']

[40, 716, 11271, 316, 402, 1447, 9869]

In [6]:
tokenizer(" "+input_seq)['input_ids']

[314, 716, 11271, 316, 402, 1447, 9869]

Adding a space ahead changes the tokens

In [7]:
tokenizer.convert_ids_to_tokens(tokenizer.encode(input_seq))

['I', 'Ġam', 'ĠNik', 'et', 'ĠG', 'ird', 'har']

In [8]:
tokenizer.convert_ids_to_tokens(tokenizer.encode(" "+input_seq))

['ĠI', 'Ġam', 'ĠNik', 'et', 'ĠG', 'ird', 'har']

In [15]:
encoded = tokenizer.encode(input_seq, return_tensors='pt')

encoded

tensor([[   40,   716, 11271,   316,   402,  1447,  9869]])

The reason is that the space is also included in the tokens so it changes the token ids.

The character Ġ represents that there is space.

# Understaning GPT model

In [12]:
generator = pipeline('text-generation', model='gpt2')

generator("Hello, I am Niket Girdhar and I", max_length = 50, truncation=True, num_return_sequences=3)

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello, I am Niket Girdhar and I need help to develop the first version of the Ethereum network.\n\nWe are working on making Ethereum blockchain decentralized by using the new technology on the Ethereum Blockchain as defined in our blog post. However'},
 {'generated_text': 'Hello, I am Niket Girdhar and I want you to join us on a cruise at the seaside beach named "Zemina".\n\nThere is only one reason for what we are about to do. We want to experience a'},
 {'generated_text': 'Hello, I am Niket Girdhar and I am working on a new video to share to social media by the late Mr Aida. It will go live on March 17: a short while from now, and it will include a preview of'}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation:
    - This means that in the backend of GPt it is setting the end token to a pad token so it eases to a more open generation of text. 