### **Using GPT-2 to generate new text**

- We will be accessing GPT-2 via the `transformers` library by Hugging Face, which provides pre-trained models and tools for natural language processing tasks.

In [2]:
!pip3 install transformers
!pip3 install torch



In [2]:
import transformers
print(transformers.__version__)

4.57.1


- Import a pre-trained GPT model that can generate text:

In [3]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


- Then, we can prompt the model with a text snippet and ask it to generate new text based on that input snippet.

In [5]:
set_seed(125)
generator("Hey readers, today is",
          max_length=20,
          num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "Hey readers, today is the day to write your own story.\n\nWe're making a special edition of The American Dream, a book that's been in print for over seven years, and it's an amazing story of overcoming the challenges of being a parent to a three year old. It's about the life of a mother in America's most beautiful, multicultural nation.\n\nI am a single mom. My husband and I are single moms. We're single moms and we're single mothers. I don't know why, but I know we're all single moms. I always say that when you're single, you're going to feel like you're the only one.\n\nBut when you're single, you're not going to feel like you're the only one.\n\nI remember having to do the same thing when I was a 14 year old in high school. I knew I wanted to be an actor, but I didn't know I wanted to be a TV show writer. I knew I wanted to be a writer. I was scared that it would all be coming to a head. I wanted to be a writer in my life. I wanted to be a writer in the real wor

- We can use a `transformer` model to generate features for training other models.


- How to use GPT-2 to generate features based on an input text:

In [6]:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

text = "Let us encode this sentence"
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input

{'input_ids': tensor([[ 5756,   514, 37773,   428,  6827]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

- encoded the input sentence text into a tokenized format from the GPT-2 model.
- it mapped the strings to an integer representation, and it set the attention mask to all `1`s.

In [7]:
from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')
output = model(**encoded_input)
print(output)

BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-0.0164,  0.0733, -0.1634,  ..., -0.1917,  0.0067, -0.1548],
         [ 0.0371,  0.0559,  0.6466,  ...,  0.3650,  0.0478, -0.1734],
         [-0.1920,  0.5216, -1.0211,  ...,  0.3032,  0.5130, -0.4956],
         [-0.0584,  0.1197, -0.9504,  ...,  0.3889,  0.4501, -0.3379],
         [-0.1347,  0.1385, -2.4609,  ...,  0.3176,  0.3789, -0.4099]]],
       grad_fn=<ViewBackward0>), past_key_values=DynamicCache(layers=[DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer]), hidden_states=None, attentions=None, cross_attentions=None)


- output variable stores the last hidden state, that is, out GPT-2 based feature encoding of the input sentence:

In [8]:
output['last_hidden_state'].shape

torch.Size([1, 5, 768])

- first dimension is the batch size (we only have one input text)
- `5` -> the sentence length
- `768` -> dimensional vector

- Now, we could apply this featue encoding to a given dataset and train a downstream classifier based on the GPT-2 based feature representation.
- Another approach to using large pre-trained models is fine-tuning.

---

### **Bidirectional pre-training with BERT**

In [3]:
import torch
print(torch.__version__)

2.8.0+cu126


In [5]:
import torch
from transformers import pipeline

pipeline = pipeline(
    task="fill-mask",
    model="google-bert/bert-base-uncased",
    dtype=torch.float16,
    device=0
)
pipeline("Plants create [MASK] through a process known as photosynthesis.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'score': 0.151123046875,
  'token': 2943,
  'token_str': 'energy',
  'sequence': 'plants create energy through a process known as photosynthesis.'},
 {'score': 0.1453857421875,
  'token': 4870,
  'token_str': 'flowers',
  'sequence': 'plants create flowers through a process known as photosynthesis.'},
 {'score': 0.0821533203125,
  'token': 9325,
  'token_str': 'sunlight',
  'sequence': 'plants create sunlight through a process known as photosynthesis.'},
 {'score': 0.04296875,
  'token': 18670,
  'token_str': 'algae',
  'sequence': 'plants create algae through a process known as photosynthesis.'},
 {'score': 0.037628173828125,
  'token': 12649,
  'token_str': 'atp',
  'sequence': 'plants create atp through a process known as photosynthesis.'}]

---