<a href="https://colab.research.google.com/github/Abhishek-harsha/Abhishek-N/blob/main/train_txt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
pip install transformers datasets




In [8]:
	from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import os

# Load tokenizer and model
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# IMPORTANT: GPT-2 doesn't have a pad token by default, set it to eos_token
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))

# Load raw text file and convert to Hugging Face dataset
def load_custom_dataset(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.read()
    return Dataset.from_dict({"text": [lines]})

raw_dataset = load_custom_dataset("train.txt")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], return_special_tokens_mask=True, truncation=True, padding="max_length", max_length=128)

tokenized_dataset = raw_dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    logging_dir="./logs",
    logging_steps=50,
    report_to="none",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

# Train
trainer.train()

# Save model and tokenizer
trainer.save_model("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Step,Training Loss


('./gpt2-finetuned/tokenizer_config.json',
 './gpt2-finetuned/special_tokens_map.json',
 './gpt2-finetuned/vocab.json',
 './gpt2-finetuned/merges.txt',
 './gpt2-finetuned/added_tokens.json')

In [6]:
from transformers import pipeline
generator = pipeline('text-generation', model='./gpt2-finetuned', tokenizer='./gpt2-finetuned')
prompt = "Once upon a time"
output = generator(prompt, max_length=100, num_return_sequences=1)

print(output[0]['generated_text'])

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Once upon a time, the game would be a small game with small rules to learn from and few to master.

The game is very easy and there is a lot of replay value in it.

The players make use of a lot of small game building and some basic rules as well.

The game is the beginning of a great adventure.

The game will teach you about all the areas and the game will give you a lot of information about how to build the game.

The game is a challenging game.

There are some interesting features to the game.

The game uses a lot of information and some of it is not very interesting.

The game has a lot of rules and it is very easy to learn.

The game is a great game.

I don't have a lot of experience with video games, but this game has a lot of lessons to learn.

The rules are a good way to learn the game.

The game is a fun game.

The game is a good resource.

The game helps you to develop your character.

The game is a good way to learn the game.

The game is a great way to learn the game.


