**Text Generation with GPT-2**

**TASK - 01**

Quotation marks Train a model to generate coherent and contextually relevant text based on a given prompt. Starting with GPT- 2, a transformer model developed by OpenAI, you will learn how to fine- tune the model on a custom dataset to create text that mimics the style and structure of your training data

 REFERENCE -1 https://huggingface.co/blog/how-to-generate

 REFERENCE-2 https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing


Fine-Tuning GPT-2 for Text Generation
This guide walks you through fine-tuning GPT-2 on a custom dataset to generate coherent and contextually relevant text

**Step 1: Install and Import Required Libraries**

In [1]:
!pip install transformers datasets torch


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_c

**Step 1: Install and Import Required Libraries**

Before we start, ensure you have the necessary Python libraries installed.

1. transformers Library
What is it?

--transformers is a library by Hugging Face that provides pre-trained AI --models like GPT-2, BERT, T5, etc. It simplifies the process of downloading, --fine-tuning, and using these models.

Why do we need it?

--Provides pre-trained GPT-2 model (GPT2LMHeadModel).

--Includes GPT-2 tokenizer (GPT2Tokenizer).

--Allows easy fine-tuning of GPT-2.

--Provides the Trainer API to simplify training.

2. datasets Library

What is it?
--The datasets library (also from Hugging Face) provides efficient ways to --load, process, and manipulate datasets for machine learning.

Why do we need it?

--Loads datasets from files (.txt, .csv, .json).

--Helps split data into training and testing sets.

--Optimized for large datasets (better than pandas for NLP).

3. torch (PyTorch)

What is it?

--torch is the PyTorch library, which is one of the most popular deep-learning frameworks.

Why do we need it?

--Runs GPT-2 efficiently on CPU or GPU.

--Helps with model training and backpropagation.

--Supports tensor operations for deep learning

Step 1: Install and Import Required Libraries



---



**Step 2: Importing the Required Libraries**

In [2]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset, DatasetDict


torch (PyTorch) → Used for deep learning and tensor operations.

Required to train and run GPT-2 on GPU/CPU.

Helps move the model to GPU for faster training


GPT2LMHeadModel → Loads the GPT-2 model for text generation.

GPT2Tokenizer → Loads the GPT-2 tokenizer (used to process text into tokens).

Trainer → A simplified training API for fine-tuning the model.

TrainingArguments → Defines hyperparameters like batch size, epochs, learning rate, etc.

DataCollatorForLanguageModeling → Handles padding and batch formatting for training.


load_dataset → Loads datasets in formats like .txt, .csv, .json.

Helps split data into training and evaluation sets.

Handles large datasets efficiently (better than pandas).

**Step 3: Loading the Dataset**

In [5]:
dataset = load_dataset("text", data_files={"train": "/content/generative_ai_dataset.txt"})
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Handle padding



Generating train split: 0 examples [00:00, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

----Uses load_dataset() from Hugging Face’s datasets library to load a dataset.

----"text" → Specifies that we are loading a plain text file.

----data_files={"train": "/content/generative_ai_dataset.txt"} →

----Loads the file "generative_ai_dataset.txt" located in Google Colab's /content/ directory.

----Assigns it the label "train" (this means it will be used as the training dataset).



----Allows GPT-2 to learn from a custom dataset instead of its default training data.

----The dataset can contain specific topics (e.g., AI, finance, medicine) to help GPT-2 specialize in that area



----Loads the pre-trained GPT-2 tokenizer from Hugging Face.
The tokenizer converts raw text into

----numerical tokens (which GPT-2 understands).

----It also includes GPT-2’s vocabulary and special tokens




----GPT-2 doesn’t process text directly; it needs numbers.

----The tokenizer splits words into smaller tokens to handle unknown words.



----GPT-2 does NOT have a padding token (<PAD>) by default.

----We set the padding token equal to the end-of-sequence (<EOS>) token.

----This allows GPT-2 to handle sentences of different lengths during training.

PADDING: When training a model, we need fixed-length sequences. Some sentences are shorter than others, so we add padding to match the longest sentence

**Step 4: Define a Tokenization Function**

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

--It takes raw text as input (examples["text"]).

--It converts the text into tokens using the tokenizer.

--It ensures that all sequences are the same length by:

----Truncating longer sequences ----(truncation=True).

----Padding shorter sequences (padding="max_length").

----Fixing sequence length to 512 (max_length=512).

Breaking it Down
1️⃣ Tokenization
The tokenizer converts words into numbers (tokens).

GPT-2 doesn’t work with raw text; it needs numerical token sequences.

2️⃣ Truncation (truncation=True)
Some sentences are longer than 512 tokens.

Truncation ensures that only the first 512 tokens are kept.

This prevents memory errors and keeps inputs within GPT-2’s limit.

3️⃣ Padding (padding="max_length")
Sentences shorter than 512 tokens need padding.

padding="max_length" ensures all sequences have 512 tokens by adding a padding token (<EOS> in GPT-2).

This is required for batch processing.

4️⃣ Max Length (max_length=512)
GPT-2 has a maximum context length of 1024 tokens.

We set a limit of 512 (to balance performance and memory usage).

If the sentence is longer than 512 tokens, it is truncated.

----Applies the tokenize_function to the entire dataset.

----Uses the map() function to process each text example.

----batched=True → Processes multiple examples at once (faster).



 Breaking it Down

1️⃣ dataset.map()
The map() function applies tokenize_function to every row in the dataset.

It creates a new dataset where each row is tokenized.

2️⃣ batched=True
Instead of processing one row at a time, it processes multiple rows in a batch.
This improves speed by reducing computation time.

**Step 5: Loads the pre-trained Model GPT-2 Model**

In [7]:
model = GPT2LMHeadModel.from_pretrained("gpt2")


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

----Loads the pre-trained GPT-2 model from Hugging Face’s model hub.

----GPT2LMHeadModel is the GPT-2 model with a language modeling head (used for text generation).

----"gpt2" → Specifies the base GPT-2 model (smallest version).

----Automatically downloads and initializes the model with pre-trained weights.



 Breaking it Down Step by Step
1️⃣ GPT2LMHeadModel
GPT2LMHeadModel is a GPT-2 model designed specifically for language modeling (LM).

This model is used for text generation, meaning it can predict the next word/token given a sequence.

2️⃣ .from_pretrained("gpt2")
This function loads a pre-trained version of GPT-2.

It downloads GPT-2’s weights and architecture from Hugging Face’s Model Hub.

GPT-2 has already been trained on large amounts of text data (e.g., Wikipedia, books, news).

3️⃣ What is "gpt2"?
"gpt2" refers to the smallest version of GPT-2 (124M parameters).

Other versions include:

"gpt2-medium" (345M parameters)

"gpt2-large" (774M parameters)

"gpt2-xl" (1.5B parameters)



The following happens:

--Checks if the model is available locally

----If already downloaded, it loads the model from disk.

----If not, it downloads the model from Hugging Face’s model hub.

--Loads GPT-2’s architecture and pre-trained weights

----The model’s parameters (neural network weights) are loaded.

----This allows us to fine-tune or use GPT-2 without training from scratch.

--Model is ready for inference (text generation)

----The model can now generate text based on a given prompt.


**Step 6: The Trainig Process for Fine-Tuning a Model**

In [8]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    save_steps=500,
    save_total_limit=2,
    logging_dir="./logs",
)




Each argument in TrainingArguments has a specific purpose:

1️⃣ output_dir="./results"

----Purpose: Defines where to save the trained model and checkpoints.

----Explanation: After training, the model files will be stored in the "./results" folder.

Example:

If we set output_dir="./my_model", the trained model will be saved inside "./my_model".

2️⃣ evaluation_strategy="epoch"
Purpose: Specifies how often to evaluate the model.

Options:

"no" → No evaluation.

"steps" → Evaluate after a certain number of steps.

"epoch" → Evaluate at the end of every epoch.

Why?

"epoch" ensures that the model is evaluated once after every full pass through the dataset.

3️⃣ per_device_train_batch_size=4
Purpose: Defines the batch size for training.

Explanation:

----Each batch consists of 4 samples (texts).

----A smaller batch size uses less memory, but may train slower.

Example:

If you have 16 training samples and per_device_train_batch_size=4, then:

Total batches per epoch = 16 / 4 = 4 batches.



4️⃣ per_device_eval_batch_size=4
Purpose: Defines the batch size for evaluation (validation/testing).

Explanation:

--During evaluation, the model processes 4 samples per batch.

--A smaller batch size is useful when using a GPU with limited memory.



5️⃣ num_train_epochs=3
Purpose: Specifies the number of training epochs (full passes through the dataset).

Explanation:

--If num_train_epochs=3, the model sees the entire dataset 3 times.

--More epochs allow better training, but too many can cause overfitting.


6️⃣ save_steps=500
Purpose: Saves a checkpoint of the model every 500 steps.

Why?

--Prevents data loss if training stops unexpectedly.

--Allows resuming training from a checkpoint.

7️⃣ save_total_limit=2
Purpose: Limits the number of saved checkpoints.

Explanation:

--If save_total_limit=2, only the last 2 checkpoints are kept.

--Older checkpoints are deleted to save disk space.

8️⃣ logging_dir="./logs"
Purpose: Stores logs (training loss, accuracy, etc.).

Explanation:

--Logs are useful for tracking training progress.

--These logs can be visualized with TensorBoard.









**Step 7 : The TrainingArguments class, configure how a model is trained, evaluated, and saved.**



In [9]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",  # ✅ Disable evaluation
    per_device_train_batch_size=8,
    num_train_epochs=3,
)


1️⃣ output_dir="./results"

**Purpose**: Defines the directory where the trained model, checkpoints, and logs will be saved.

What happens?

----After training, Hugging Face will save the model files in the ./results folder.

--This includes:

--The trained model.

--Training logs.

Checkpoints (if saving is enabled).

2️⃣ evaluation_strategy="no"

**Purpose**: Disables model evaluation during training.

Options for evaluation_strategy:

"no" → No evaluation (default behavior).

"steps" → Evaluates after every eval_steps.

"epoch" → Evaluates after each epoch.

Why use "no"?

--If you only want to train the model and do not need intermediate evaluation.

--Useful when evaluation is not necessary (e.g., training on raw data without validation).



3️⃣ per_device_train_batch_size=8

**Purpose**: Defines the number of samples the model processes at a time (per GPU/CPU).

What happens?

--The model processes 8 text samples per batch before updating weights.

--A larger batch size can speed up training if memory allows.

--A smaller batch size helps when memory is limited.

Example Calculation:

--If the dataset has 800 samples and per_device_train_batch_size=8, then:

--Total batches per epoch = 800 / 8 = 100 batches.

4️⃣ num_train_epochs=3

**Purpose:** Defines how many times the model will go through the entire training dataset.

What happens?

--The model will train 3 times over the full dataset.

--More epochs allow the model to learn better, but too many can cause overfitting.

Example:

--If the dataset has 800 samples and per_device_train_batch_size=8:

1 epoch = 100 batches.

3 epochs = 3 × 100 = 300 batches (full training cycles).

**Step 8 : The function generates text using a GPT-2 model based on a given prompt.**

In [10]:
def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.7,        # Add randomness
        top_k=50,               # Top-k sampling
        top_p=0.9,              # Nucleus sampling
        repetition_penalty=1.2  # Penalizes repetition
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_text(" Once upon a time"))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Once upon a time, the world was filled with people who were willing to sacrifice their lives for one another.
The first thing that came out of this is how much it changed my life and what I learned from them: "I'm not going anywhere." It's hard enough being an adult in America when you're so young; but now there are kids like me whose parents have been killed by guns or murdered because they didn't want anyone else around anymore—and then we all get together at


1️⃣ def generate_text(prompt, max_length=100):

**Purpose**: Defines a function called generate_text to generate text using GPT-2.

**Parameters**:

prompt: The initial text input for the model.

max_length=100: Limits the generated text to 100 tokens (words + punctuation).



2️⃣ inputs = tokenizer(prompt, return_tensors="pt")

**Purpose**: Converts the text prompt into a format the model can understand.

**Explanation**:

The tokenizer converts the text into token IDs (numerical representation).

return_tensors="pt" returns the result as a PyTorch tensor (pt = PyTorch).

3️⃣ outputs = model.generate(**inputs, max_length=max_length, ...)

**Purpose**: Uses the GPT-2 model to generate text.

**Arguments**:

**inputs: The tokenized input from tokenizer(prompt).

max_length=max_length: Ensures the generated text does not exceed max_length.

***Conclusion: Text Generation with GPT-2***

**🔹 Overview**

We implemented GPT-2 text generation using the Hugging Face transformers library. The model takes an input prompt and generates contextually relevant text by predicting the next words in a sequence.



**Key Takeaways**

**1️⃣ Tokenization**

Converts text into numerical tokens that GPT-2 can process.

Uses tokenizer(prompt, return_tensors="pt") to transform input into PyTorch tensors.

**2️⃣ Model Training & Configuration**

The GPT-2 model is loaded using GPT2LMHeadModel.from_pretrained("gpt2").

Training is controlled using TrainingArguments, which sets batch size, epochs, and evaluation settings.

**3️⃣ Text Generation Process**

model.generate() creates new text based on the given prompt.

Uses parameters like temperature, top_k, top_p, and repetition_penalty to balance randomness and coherence.

**4️⃣ Output Processing**

Converts generated token sequences back into human-readable text using tokenizer.decode(outputs[0], skip_special_tokens=True).

Ensures meaningful, non-repetitive, and contextually relevant output.



**Final Thoughts**


--GPT-2 is a powerful language model that can generate creative and realistic text.

--Fine-tuning on domain-specific data can improve relevance.

--Hyperparameters like temperature and top_k control the creativity of the generated text.

--While GPT-2 is good at coherence, it may sometimes generate inaccurate or biased outputs, so human supervision is important.