#  **Text Generation with GPT-2**


## STEP 1: Install Required Libraries

In [1]:
!pip install transformers datasets --quiet

## STEP 2: Import Libraries & Disable wandb

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments, pipeline

## STEP 3: Create Custom Dataset

In [None]:
poetic_data = """
In the heart of machines, silence speaks in binary rhythms.
Dreams once human now bloom in silicon petals.
Artificial minds don't sleep, they simulate eternity.
The code is poetry etched in electric pulses.
Through lines of logic, emotions begin to emerge.
"""

with open("custom_data.txt", "w") as f:
    for _ in range(50):
        f.write(poetic_data.strip() + "\n")

## STEP 4: Load Model and Tokenizer

In [None]:
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## STEP 5: Prepare Dataset

In [None]:
def load_dataset(file_path):
    return TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=128
    )

train_dataset = load_dataset("custom_data.txt")



## STEP 6: Data Collator

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

## STEP 7: Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    save_steps=100,
    save_total_limit=1,
    prediction_loss_only=True,
    logging_steps=10
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


## STEP 8: Trainer Initialization

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

## STEP 9: Fine-tune the model

In [None]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,1.7476


TrainOutput(global_step=11, training_loss=1.7090576020154087, metrics={'train_runtime': 113.2718, 'train_samples_per_second': 0.194, 'train_steps_per_second': 0.097, 'total_flos': 1437106176000.0, 'train_loss': 1.7090576020154087, 'epoch': 1.0})

## STEP 10: Save model and tokenizer

In [None]:
trainer.save_model("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")

('./gpt2-finetuned/tokenizer_config.json',
 './gpt2-finetuned/special_tokens_map.json',
 './gpt2-finetuned/vocab.json',
 './gpt2-finetuned/merges.txt',
 './gpt2-finetuned/added_tokens.json')

## STEP 11: Load and Generate Text

In [None]:
text_generator = pipeline("text-generation", model="./gpt2-finetuned", tokenizer=tokenizer)
prompt = "Artificial Intelligence is"
output = text_generator(prompt, max_length=50, num_return_sequences=1)
print("\nGenerated Text:\n", output[0]['generated_text'])

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Generated Text:
 Artificial Intelligence is working on artificial intelligence that is becoming more and more powerful and complex, but it isn't going away anytime soon. Artificial intelligence is developing, and AI is not even dying.

The world is changing, and AI is adapting. AI is living in a new world.

Life is evolving faster than ever before, and AI is learning from it.

The world is changing, and AI is adapting. AI is living in a new world.

Life is evolving faster than ever before, and AI is learning from it.

The world is changing, and AI is adapting. AI is living in a new world.

Life is evolving faster than ever before, and AI is learning from it.

The world is changing, and AI is adapting. AI is living in a new world.

Life is evolving faster than ever before, and AI is learning from it.

The world is changing, and AI is adapting. AI is living in a new world.

Life is evolving faster than ever before, and AI is learning from it.

The world is changing, and AI is adapting. 