# Task
Train a GPT-2 model on a custom dataset to generate text based on a given prompt.

## Set up the environment

### Subtask:
Install the necessary libraries, including `transformers` and `torch`.


**Reasoning**:
Install the necessary libraries using pip.



In [None]:
%pip install transformers torch

## Load and prepare the dataset

### Subtask:
Load your custom dataset and format it for training the GPT-2 model. This may involve tokenization and creating input sequences.


**Reasoning**:
The first step is to load the dataset. I will define the path to a dummy dataset file, create the file with some sample data, and then load the data into a pandas DataFrame.



In [None]:
import pandas as pd
import os

# Define the path to your custom dataset file
dataset_path = 'custom_dataset.txt'

# Create a dummy dataset file for demonstration
dummy_data = ["This is the first sentence.", "Here is the second sentence.", "And a third one for testing."]
with open(dataset_path, 'w') as f:
    for line in dummy_data:
        f.write(line + '\n')

# Load the dataset into a pandas DataFrame
df = pd.read_csv(dataset_path, header=None, names=['text'])
print(df.head())

**Reasoning**:
Now that the data is loaded, I need to tokenize it using a GPT-2 tokenizer and prepare the input sequences for training.



In [None]:
from transformers import GPT2Tokenizer

# Initialize a GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set the padding token if it's not in the tokenizer
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Tokenize the dataset
tokenized_data = tokenizer(
    df['text'].tolist(),
    return_tensors='pt',
    padding=True,
    truncation=True
)

# Prepare the tokenized data as input sequences
input_ids = tokenized_data['input_ids']
attention_mask = tokenized_data['attention_mask']

print("Input IDs shape:", input_ids.shape)
print("Attention Mask shape:", attention_mask.shape)

## Load the pre-trained gpt-2 model

### Subtask:
Load the pre-trained GPT-2 model and tokenizer from the `transformers` library.


**Reasoning**:
Load the pre-trained GPT-2 model and resize its token embeddings if the tokenizer was resized.



In [None]:
from transformers import GPT2LMHeadModel

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Resize the model's token embeddings if the tokenizer was resized
if tokenizer.pad_token is not None and len(tokenizer) != model.get_input_embeddings().num_embeddings:
    model.resize_token_embeddings(len(tokenizer))

print("GPT-2 model loaded successfully.")

## Configure the training parameters

### Subtask:
Set up the training arguments, such as the number of epochs, batch size, learning rate, and output directory.


**Reasoning**:
Import the `TrainingArguments` class and instantiate it with the specified parameters.



In [None]:
from transformers import TrainingArguments

# Define the output directory for checkpoints and logs
output_dir = './results'

# Instantiate TrainingArguments with report_to set to "none"
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=2,  # Batch size for training
    save_steps=10_000,  # Save model every 10,000 steps
    logging_steps=1000,  # Log every 1,000 steps
    report_to="none" # Disable reporting to integrations like wandb
)

print("Training arguments configured successfully.")
print(training_args)

## Fine-tune the model

### Subtask:
Train the GPT-2 model on your custom dataset using the configured training parameters.


**Reasoning**:
Implement a custom dataset class to hold the tokenized data and instantiate the Trainer with the model, training arguments, and the custom dataset, then start the training process.



In [None]:
from transformers import Trainer
import torch
from torch.utils.data import Dataset

# Create a custom dataset class
class CustomDataset(Dataset):
    def __init__(self, input_ids, attention_mask):
        self.input_ids = input_ids
        self.attention_mask = attention_mask

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids': self.input_ids[idx], 'attention_mask': self.attention_mask[idx], 'labels': self.input_ids[idx]}

# Instantiate the custom dataset
train_dataset = CustomDataset(input_ids, attention_mask)

# Instantiate the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Start training
trainer.train()

## Evaluate the model

### Subtask:
Evaluate the performance of the fine-tuned model using relevant metrics.


**Reasoning**:
Define the evaluation dataset and call the evaluate method of the trainer.



In [None]:
# Define an evaluation dataset. For simplicity, use the same train_dataset.
eval_dataset = train_dataset

# Evaluate the model
evaluation_results = trainer.evaluate(eval_dataset)

# Print the evaluation results
print(evaluation_results)

## Generate text

### Subtask:
Use the fine-tuned model to generate text based on a given prompt.


**Reasoning**:
Create a text generation pipeline, define a prompt, and generate text using the fine-tuned model.



In [None]:
from transformers import pipeline

# Create a text generation pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Define a prompt string
prompt = "This is a test"

# Generate text based on the prompt
generated_text = generator(prompt, max_length=50, num_return_sequences=1)

# Print the generated text
print(generated_text[0]['generated_text'])

## Summary:

### Data Analysis Key Findings

*   The necessary libraries (`transformers` and `torch`) were successfully installed.
*   A custom dataset was successfully loaded and tokenized using the GPT-2 tokenizer, preparing it for training as input IDs and attention masks in PyTorch tensors.
*   A pre-trained GPT-2 model (`gpt2`) was successfully loaded.
*   Training arguments were configured, including the output directory (`./results`), number of epochs (3), batch size (2), and logging/saving steps.
*   The GPT-2 model was successfully fine-tuned on the custom dataset using the configured `TrainingArguments`.
*   The model was evaluated using the training dataset, yielding an `eval_loss` of approximately 3.175.
*   The fine-tuned model successfully generated text based on a given prompt using a text generation pipeline.

### Insights or Next Steps

*   The model's performance should be evaluated on a separate validation or test dataset to get a more objective measure of its generalization capabilities.
*   Further fine-tuning experiments could be conducted by adjusting hyperparameters like the learning rate, batch size, or number of epochs to potentially improve the evaluation loss and generated text quality.
