<a href="https://colab.research.google.com/github/SumeshSurendran12/ITAI-3377-A.I.-at-the-Edge-IIOT-Env/blob/main/GPT2_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers datasets



In [2]:
# Create a tiny training corpus
with open("tiny_corpus.txt", "w") as f:
    f.write("AI is transforming the world.\n")
    f.write("Machine learning enables computers to learn from data.\n")
    f.write("Natural language processing helps machines understand human language.\n")

In [9]:
from transformers import GPT2Tokenizer, GPT2Config, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Create small GPT2 config
config = GPT2Config(
    vocab_size=tokenizer.vocab_size,
    n_positions=128,
    n_ctx=128,
    n_embd=128,
    n_layer=2,
    n_head=2
)

# Initialize model
model = GPT2LMHeadModel(config)

# Manually load the dataset from the text file
with open("tiny_corpus.txt", "r") as f:
    text_data = f.readlines()

# Create a Dataset object from the loaded data
dataset = Dataset.from_dict({"text": text_data})

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./mini-gpt2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1,  # Save more frequently for a small dataset
    save_total_limit=2,
    logging_steps=100,
    report_to="none",  # Disable Weights & Biases logging
    save_on_each_node=True # Ensure saving happens
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets
)

# Train
trainer.train()

# Save model
trainer.save_model("./mini-gpt2")
tokenizer.save_pretrained("./mini-gpt2")

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss




('./mini-gpt2/tokenizer_config.json',
 './mini-gpt2/special_tokens_map.json',
 './mini-gpt2/vocab.json',
 './mini-gpt2/merges.txt',
 './mini-gpt2/added_tokens.json')

Now that the model is trained, you can load it and the tokenizer to generate text.

In [12]:
import os
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned model and tokenizer
model_path = "./mini-gpt2"

# Check if the directory and files exist
if not os.path.exists(model_path):
    print(f"Error: Directory {model_path} not found.")
elif not os.listdir(model_path):
    print(f"Error: Directory {model_path} is empty.")
else:
    print(f"Files in {model_path}: {os.listdir(model_path)}")

    # Load model and tokenizer
    model = GPT2LMHeadModel.from_pretrained(model_path)
    tokenizer = GPT2Tokenizer.from_pretrained(model_path)
    tokenizer.pad_token = tokenizer.eos_token

    # Move model to CPU
    device = torch.device("cpu")
    model.to(device)

    # Generate text
    prompt = "AI is"
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    # Generate text using model.generate
    # Set pad_token_id to eos_token_id to handle padding during generation
    output = model.generate(
        input_ids,
        max_length=20,  # Limit length for tiny model
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(generated_text)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Files in ./mini-gpt2: ['checkpoint-3', 'checkpoint-2', 'generation_config.json', 'merges.txt', 'vocab.json', 'config.json', 'training_args.bin', 'special_tokens_map.json', 'runs', 'tokenizer_config.json', 'model.safetensors']
AI is protagonist protagonist blue blue Factor Factor unfocused unfocused Peter Peter portfolios portfolios Research ResearchOfficeOfficescientscient
