### Code Explanation: Autoregressive Text Generation with GPT-2

This code demonstrates **step-by-step text generation** using the **GPT-2 model** from the Hugging Face `transformers` library.

The model generates text in an *autoregressive manner*, predicting **one token at a time** and appending it to the input.

```python
# Example Input Prompt:
"What is the capital of France?"
```

### Process Overview:
1. **Convert the input text into tokens** using the GPT-2 tokenizer.
2. **Feed the tokens into the GPT-2 model** to predict the next token.
3. **Append the predicted token** to the current sequence.
4. **Repeat until** the model predicts the **end-of-sequence (EOS)** token or reaches a specified token limit.



In [4]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")           # Load the pretrained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_prompt = "What is the capital of France?"             # Set the input prompt

input_ids = tokenizer.encode(input_prompt, return_tensors="pt")         # Tokenize the input prompt
print("Input Prompt:", input_prompt)

#### Autoregressive generation (token-by-token) ###
max_tokens = 10  # Limit the number of generated tokens
generated_ids = input_ids

print("\nStep-by-Step Generation:")
for step in range(max_tokens):
    outputs = model(generated_ids)          # Pass the current tokens into the model
    predictions = outputs.logits
    
    next_token_id = torch.argmax(predictions[:, -1, :], dim=-1)         # Select the most likely next token
    
    generated_ids = torch.cat((generated_ids, next_token_id.unsqueeze(0)), dim=1)       # Append the next token to the sequence
   
    current_output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)       # Decode and print the current sequence
    print(f"Step {step + 1}: {current_output}")
    print(f"size of input ndarray is {predictions.size()}")
    
    if next_token_id == tokenizer.eos_token_id:             # Stop generation if EOS (end-of-sequence) token is predicted
        break

print("\nFinal Output:")
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))


Input Prompt: What is the capital of France?

Step-by-Step Generation:
Step 1: What is the capital of France?

size of input ndarray is torch.Size([1, 7, 50257])
Step 2: What is the capital of France?


size of input ndarray is torch.Size([1, 8, 50257])
Step 3: What is the capital of France?

The
size of input ndarray is torch.Size([1, 9, 50257])
Step 4: What is the capital of France?

The capital
size of input ndarray is torch.Size([1, 10, 50257])
Step 5: What is the capital of France?

The capital of
size of input ndarray is torch.Size([1, 11, 50257])
Step 6: What is the capital of France?

The capital of France
size of input ndarray is torch.Size([1, 12, 50257])
Step 7: What is the capital of France?

The capital of France is
size of input ndarray is torch.Size([1, 13, 50257])
Step 8: What is the capital of France?

The capital of France is Paris
size of input ndarray is torch.Size([1, 14, 50257])
Step 9: What is the capital of France?

The capital of France is Paris.
size of input 

In [11]:
predictions.size()

torch.Size([1, 16, 50257])

Now we want to go deep and explore to see if an LLM like GPT2 architecture is decoder only or not.

In [3]:
from transformers import GPT2Model, GPT2Tokenizer
import torch

# Load GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

# Encode a sample input
input_text = "What is the capital of France?"
inputs = tokenizer(input_text, return_tensors="pt")

# Get the attention mask from the model
with torch.no_grad():
    outputs = model(**inputs, output_attentions=True)
    attentions = outputs.attentions  # List of attention tensors from each layer

# Check the shape and behavior of the attention mask
print(f"Number of layers: {len(attentions)}")
print(f"Shape of attention matrix (layer 0): {attentions[0].shape}")  # (batch_size, num_heads, seq_len, seq_len)

# Verify causal masking in layer 0
causal_mask = attentions[0][0][0]  # Extract attention matrix for first head in first layer
print("Causal mask (first layer, first head):")
print((causal_mask > 0).int())  # Display binary causal mask


Number of layers: 12
Shape of attention matrix (layer 0): torch.Size([1, 12, 7, 7])
Causal mask (first layer, first head):
tensor([[1, 0, 0, 0, 0, 0, 0],
        [1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1]], dtype=torch.int32)


In [5]:
from transformers import GPT2Tokenizer, GPT2Model
import torch

# Load GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

# Define input prompt
input_prompt = "What is the capital of France?"

# Tokenize input
inputs = tokenizer(input_prompt, return_tensors="pt")

# Forward pass to get hidden states (contextualized embeddings)
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)
    hidden_states = outputs.hidden_states  # Hidden states for all layers

# Display shape of hidden states
print(f"Input prompt: '{input_prompt}'")
print(f"Number of tokens: {len(inputs['input_ids'][0])}")
print(f"Shape of hidden states (last layer): {hidden_states[-1].shape}")


Input prompt: 'What is the capital of France?'
Number of tokens: 7
Shape of hidden states (last layer): torch.Size([1, 7, 768])
