In [1]:
!pip install transformers #core dependency



In [2]:
import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer #encode i/p words to numbers and pass to model --> model convert it to another sequence of numbers --> decode using tokenizer

more about GPT2 and its dependecies - https://huggingface.co/docs/transformers/en/model_doc/gpt2

#1. Load Model

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large") #allow us to leverage GPT2 large module(generate large blocks of tesxts) #if memory error occur, use gpt2 instead of gpt2-large
model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id) #pad_token_id- what token is used to pad our text

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
tokenizer.eos_token_id #the token that is going to pad our text is token 50256

50256

In [None]:
tokenizer.decode(tokenizer.eos_token_id)

'<|endoftext|>'

In summary, **pad_token_id=tokenizer.eos_token_id** ensures that the EOS token is used for both marking the end of sequences and padding shorter sequences to a common length during model training and inference.

This configuration is common in language models where the EOS token is used not only to denote the **end of a sequence** but also to **pad sequences during training and inference**. Padding is necessary when working with sequences of varying lengths, allowing them to be efficiently processed in batches. Padding ensures that shorter sequences are extended to match the length of the longest sequence in a batch.

**tokenizer.eos_token_id**:  
tokenizer.eos_token_id retrieves the token ID assigned to the end-of-sequence (EOS) token by the GPT-2 tokenizer.
In GPT-2 and many other language models, the EOS token is used to denote the end of a sequence or a piece of text.

**pad_token_id=tokenizer.eos_token_id**:

The pad_token_id argument is used to specify the token ID that should be considered as the padding token.
In this case, it's set to the EOS token ID. It means that when processing sequences, the model will treat the EOS token as a padding token for sequences that are shorter than the maximum length.

#02. Tokenize Sentence

In [4]:
sentence = 'Describe about the little island in the middle of indian ocean known as Sri Lanka'
input_ids = tokenizer.encode(sentence, return_tensors='pt') #return tensors as pytorch tensors

In [5]:
input_ids

tensor([[24564,  4892,   546,   262,  1310,  7022,   287,   262,  3504,   286,
           773,   666,  9151,  1900,   355, 20872, 28143]])

In [6]:
tokenizer.decode(input_ids[0][2])

' about'

#03. generate and decode text

In [7]:
output = model.generate(input_ids, max_length=200, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
#beamsearch to find more appropriate next word
#no_repeat_ngram_size - stops certain sequences repeating over & over (when finding suitable next word) model will avoid generating sequences of two consecutive tokens that have occurred before.)
#early_stopping=True - if we reach a point we are not getting great o/ps, stops generating or when all beams finish or when any beam reaches the max_length

In [8]:
output

tensor([[24564,  4892,   546,   262,  1310,  7022,   287,   262,  3504,   286,
           773,   666,  9151,  1900,   355, 20872, 28143,    13,   198,   198,
            50,   380, 28143,   318,   530,   286,   262,   749,  4950, 14807,
           319,  4534,    13,   632,   318,   257,  1402,  1499,   351,   257,
          3265,   286,   546,   352,    13,    20,  1510,   661,    13,   383,
          1499,   318,  5140,   287,  2520,  3687,  7229,   290,   468,   257,
           890,  2106,   286,  3292,   351,  3794,    11,  2807,    11,  2869,
            11,  4969,    11, 16256,    11, 15336,    11, 12551,    11, 16952,
            11, 10836,    11,   290,   262, 13316,    13, 20872, 44824,   272,
           661,   389,  1900,   329,   511, 33362,   290, 33362,   318,   845,
          1593,   284,   606,    13,  1119,   389,   845, 10496,  4674,   661,
           290,   484,  1842,   284,  7062,   661,   422,   477,   625,   262,
           995,   284,  3187,   511,  1499,    13,  

In [9]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

Describe about the little island in the middle of indian ocean known as Sri Lanka.

Sri Lanka is one of the most beautiful islands on earth. It is a small country with a population of about 1.5 million people. The country is located in South East Asia and has a long history of trade with India, China, Japan, Korea, Indonesia, Malaysia, Singapore, Thailand, Vietnam, and the Philippines. Sri Lankan people are known for their hospitality and hospitality is very important to them. They are very hospitable people and they love to welcome people from all over the world to visit their country. This is the reason why the country has become a popular tourist destination. There are many things to see and do on the island. Here are a few things you can do:


#04. Output results

In [10]:
text = tokenizer.decode(output[0], skip_special_tokens=True)

In [14]:
file_path = '/content/drive/MyDrive/ColabNotebooks/blogsrilanka.txt'
with open(file_path,'w') as f:
  f.write(text)