In [3]:

# Text Generation with GPT-2 - Internship Task (Prodigy Infotech)

## 📌 Install Dependencies
!pip install transformers datasets

## 📌 Import Libraries
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments, pipeline
import os
os.environ["WANDB_DISABLED"] = "true"  # ✅ Place this early!

## 📌 Load Pre-trained GPT-2
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

## 📁 Upload Dataset
from google.colab import files
uploaded = files.upload()

# Assuming the uploaded file is named 'dataset.txt'
file_path = list(uploaded.keys())[0]

## 📦 Load Dataset
def load_dataset(file_path, tokenizer):
    return TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=128,
    )

train_dataset = load_dataset(file_path, tokenizer)

## 🧠 Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

## ⚙️ Training Arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
)

## 🏋️‍♀️ Trainer Setup
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)


## 🔁 Train the Model
trainer.train()

## ✨ Generate Text
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "Once upon a time"
outputs = generator(prompt, max_length=100, num_return_sequences=1)

print("Generated Text:\n", outputs[0]['generated_text'])

## 💾 Save the Model
model.save_pretrained("gpt2-finetuned-custom")
tokenizer.save_pretrained("gpt2-finetuned-custom")




Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Saving sample_poetry_dataset.txt to sample_poetry_dataset (2).txt


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated Text:
 Once upon a time, the player can choose to make a deal with the master to make a quick move.

When a master dies, the player can purchase a new one-handed weapon.

At a time, the player can choose to enter a new event that will be held in the next time period.

The player will be able to select a new event to be held at the end of a time.

The event will be held in the next time period.

The event will not be held in the next time period.

The event will be held in the next time period.

The event will not be held in the next time period.

The event will not be held in the next time period.

The event will not be held in the next time period.
The event will not be held in the next time period.

The event will not be held in the next time period.

The event will not be held in the next time period.

The event will not be held in the next time period.

The event will not be held in the next time period

The event will not be held in the previous time period.

The event w

('gpt2-finetuned-custom/tokenizer_config.json',
 'gpt2-finetuned-custom/special_tokens_map.json',
 'gpt2-finetuned-custom/vocab.json',
 'gpt2-finetuned-custom/merges.txt',
 'gpt2-finetuned-custom/added_tokens.json')

In [None]:
## ✅ Internship Task 01: Text Generation with GPT-2

### 📁 Dataset
- **Domain**: Poetry
- **Format**: Plain text file (`sample_poetry_dataset.txt`)
- **Size**: 552 bytes
- The dataset consisted of poetic sentences and lines, used to fine-tune the GPT-2 model to learn stylistic and structural patterns.

### 🧠 What the Model Learned
- The GPT-2 model was fine-tuned on a small poetry dataset.
- It learned to mimic stylistic repetitions, line-based formatting, and literary tone.
- Some structure issues (repeating phrases) happened due to the small dataset and short training.

### ✨ Sample Output
```text
Once upon a time, the player can choose to make a deal with the master...
The event will be held in the next time period...
The event will not be held in the next time period...
