<a href="https://colab.research.google.com/github/Jashwanthgadipally/NLP/blob/main/Assignment-8%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW
from torch.utils.data import Dataset, DataLoader
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

text = """Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandmother. On her way through the woods, she met a big bad wolf who wanted to eat her."""

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set pad token to eos_token to avoid NoneType error

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.to(device)

class TextDataset(Dataset):
    def __init__(self, text, tokenizer, max_length=50):
        self.tokens = tokenizer(text, return_tensors="pt", truncation=True)["input_ids"][0]
        self.max_length = max_length

    def __len__(self):
        return len(self.tokens) - self.max_length  # Number of training steps

    def __getitem__(self, idx):
        return self.tokens[idx:idx+self.max_length]

def collate_fn(batch):
    max_length = max([len(x) for x in batch])
    padded_batch = [torch.cat([x, torch.full((max_length - len(x),), tokenizer.pad_token_id)]) for x in batch]
    return torch.stack(padded_batch)

dataset = TextDataset(text, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

def train_model(epochs):
    model.train()
    optimizer = AdamW(model.parameters(), lr=3e-5)

    for epoch in range(epochs):
        for batch in dataloader:
            inputs = batch.to(device)
            labels = inputs.clone()
            outputs = model(inputs, labels=labels)
            loss = outputs.loss
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

for epochs in [20, 60, 70]:
    print(f"Training with {epochs} epochs")
    train_model(epochs)

def generate_text(seed_text, max_length=50):
    model.eval()
    input_ids = tokenizer.encode(seed_text, return_tensors="pt").to(device)
    generated_ids = model.generate(input_ids, max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2)
    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

seed_text = "Once upon a time"
generated_text = generate_text(seed_text)
print("Generated Text:", generated_text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Training with 20 epochs




Epoch 1/20, Loss: 1.477277159690857
Epoch 2/20, Loss: 0.5350108742713928
Epoch 3/20, Loss: 0.17864984273910522
Epoch 4/20, Loss: 0.13991887867450714
Epoch 5/20, Loss: 0.10499940812587738
Epoch 6/20, Loss: 0.09622059017419815
Epoch 7/20, Loss: 0.16820241510868073
Epoch 8/20, Loss: 0.06449555605649948
Epoch 9/20, Loss: 0.04293963313102722
Epoch 10/20, Loss: 0.07637524604797363
Epoch 11/20, Loss: 0.11447741091251373
Epoch 12/20, Loss: 0.08244816213846207
Epoch 13/20, Loss: 0.13072648644447327
Epoch 14/20, Loss: 0.03932282701134682
Epoch 15/20, Loss: 0.0907992422580719
Epoch 16/20, Loss: 0.0660153478384018
Epoch 17/20, Loss: 0.02315300516784191
Epoch 18/20, Loss: 0.1262444704771042
Epoch 19/20, Loss: 0.0622200183570385
Epoch 20/20, Loss: 0.08262854814529419
Training with 60 epochs
Epoch 1/60, Loss: 0.05067526176571846
Epoch 2/60, Loss: 0.03255004063248634
Epoch 3/60, Loss: 0.006872095633298159
Epoch 4/60, Loss: 0.15995332598686218
Epoch 5/60, Loss: 0.007679694332182407
Epoch 6/60, Loss: 0.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Epoch 70/70, Loss: 0.025888856500387192
Generated Text: Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandma. On her way through the
