<a href="https://colab.research.google.com/github/Matam-Rohith/NLP/blob/main/NLP_LAB_08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
!pip install transformers datasets torch



In [16]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
from torch.utils.data import Dataset, DataLoader

In [17]:
# 1. Data Preparation
text = """Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandmother. On her way through the woods, she met a big bad wolf who wanted to eat her. [CO5]"""


In [18]:
# 2. Tokenization
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})



In [19]:
# Tokenize the text
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

In [20]:
# 3. Create Custom Dataset
class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

In [21]:
dataset = TextDataset(inputs)
train_loader = DataLoader(dataset, batch_size=1, shuffle=True) # Use batch_size=1 for simplicity


In [22]:
# 4. Model Setup
# Use a smaller GPT-2 model for demonstration (you can change to 'gpt2' or 'gpt2-medium')
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)
model.resize_token_embeddings(len(tokenizer))  # Resize if you added special tokens

Embedding(50258, 768)

In [23]:
# Optimizer and Loss
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

In [28]:
# 5. Training Loop
def train(epochs):
    for epoch in range(epochs):
        for batch in train_loader:
            optimizer.zero_grad()
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)
            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
             # Backward pass and optimization
            loss = outputs.loss
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch+1}/{epochs} - Loss: {loss.item()}")



In [36]:
# Train for different epochs
epochs_list = [20, 60, 70]
for epochs in epochs_list:
    train(epochs)

  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch 1/20 - Loss: 0.004091780167073011
Epoch 2/20 - Loss: 0.0040012504905462265
Epoch 3/20 - Loss: 0.0039152465760707855
Epoch 4/20 - Loss: 0.003833897178992629
Epoch 5/20 - Loss: 0.003756564110517502
Epoch 6/20 - Loss: 0.003682462265715003
Epoch 7/20 - Loss: 0.0036114060785621405
Epoch 8/20 - Loss: 0.003542846068739891
Epoch 9/20 - Loss: 0.003476562211290002
Epoch 10/20 - Loss: 0.0034123589284718037
Epoch 11/20 - Loss: 0.0033500082790851593
Epoch 12/20 - Loss: 0.0032894338946789503
Epoch 13/20 - Loss: 0.003230504458770156
Epoch 14/20 - Loss: 0.0031730991322547197
Epoch 15/20 - Loss: 0.0031171520240604877
Epoch 16/20 - Loss: 0.003062596544623375
Epoch 17/20 - Loss: 0.003009296488016844
Epoch 18/20 - Loss: 0.002957178046926856
Epoch 19/20 - Loss: 0.002906341338530183
Epoch 20/20 - Loss: 0.0028566885739564896
Epoch 1/60 - Loss: 0.002808175515383482
Epoch 2/60 - Loss: 0.0027607486117631197
Epoch 3/60 - Loss: 0.0027143608313053846
Epoch 4/60 - Loss: 0.0026690128725022078
Epoch 5/60 - Loss

In [33]:
# 6. Text Generation
def generate_text(prompt, max_length=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

    # Generate text
    output = model.generate(input_ids, max_length=max_length, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

    # Decode and print the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(generated_text)

In [35]:
# Example usage for text generation
prompt = "Once upon a time"
generate_text(prompt)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, she met a big bad wolf who wanted to eat her. [CO5]
 to
