In [None]:
# STEP 1: Install necessary libraries
!pip install transformers datasets --quiet

# STEP 2: Import libraries
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
import torch

# STEP 3: Load tokenizer and model
model_name = "gpt2"  # You can use 'gpt2-medium', 'gpt2-large' etc.
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# STEP 4: Prepare your custom dataset
# Create a .txt file with your training data and upload to Colab
from google.colab import files
uploaded = files.upload()  # Upload your text file (e.g. custom_data.txt)

# STEP 5: Create dataset object
def load_dataset(file_path, tokenizer, block_size=128):
    return TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=block_size
    )

dataset = load_dataset("custom_data.txt", tokenizer, block_size=16)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# STEP 6: Set training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=1,
    prediction_loss_only=True,
    logging_steps=100,
)

# STEP 7: Create Trainer and fine-tune
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator,
)

trainer.train()

# STEP 8: Save the fine-tuned model
trainer.save_model("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")


Saving custom_data.txt to custom_data (3).txt




<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkavyasingh93352[0m ([33mkavyasingh933524[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


('./gpt2-finetuned/tokenizer_config.json',
 './gpt2-finetuned/special_tokens_map.json',
 './gpt2-finetuned/vocab.json',
 './gpt2-finetuned/merges.txt',
 './gpt2-finetuned/added_tokens.json')

In [None]:
# Load the fine-tuned model
from transformers import pipeline

generator = pipeline('text-generation', model='./gpt2-finetuned', tokenizer='./gpt2-finetuned')

# Generate text from a prompt
prompt = "Once upon a time"
output = generator(prompt, max_length=100, num_return_sequences=1)
print(output[0]['generated_text'])


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Once upon a time, humans were the only species that could survive without a human, but the one to survive was the mighty one called the Avatar. As he traveled the world, he sought to find the Avatar and his people, and he found that he had found his true calling…

…because when the Avatar awoke, the only thing he could do was make his way to the Avatar's home world. He could not leave his body, so the Avatar awoke to find him in the streets of his world. The Avatar was there, but he couldn't leave. He had to save the Avatar from himself…and save his friends.

The Avatar awoke to find him in the streets of his world. He could not leave his body, so the Avatar awoke to find him in the streets of his world. The Avatar was there, but he couldn't leave. He had to save the Avatar from himself…and save his friends. The Avatar awoke to find him in the streets of his world. He could not leave his body, so the Avatar awoke to find him in the streets of his world. The Avatar was there, but he cou