<a href="https://colab.research.google.com/github/Mahemaran/Colab-notebooks/blob/main/GPT2_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fine-tuning GPT-2 involves training it on domain-specific data to adapt its language generation capabilities for a specific task (e.g., story generation, customer support, medical reports).

### **Load**

In [15]:
!pip install datasets



In [16]:
from datasets import load_dataset
dataset = load_dataset('text', data_files={'train': '/content/Maran.txt'})
print(dataset['train'][10])

{'text': "Maran's dream home is a beautiful haven, combining both comfort and functionality. He imagines an expansive living space, where he can design his own gym setup, making it easy to stay committed to his fitness goals. With high-end equipment, a spacious layout, and the perfect atmosphere, it would be a place where fitness and strength are nurtured."}


In [17]:
print(dataset.shape)

{'train': (17, 1)}


### **Preprocessing**

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
tokenizer.pad_token = tokenizer.eos_token
tokenized_dataset = dataset.map(tokenize_function, batched=True)

In [19]:
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 17
    })
})


### **Model Building**

In [22]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling

# Load tokenizer and model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Prepare dataset
train_path = "/content/Maran.txt"
def load_dataset(file_path):
    return TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=128)

train_dataset = load_dataset(train_path)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset
)
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss


TrainOutput(global_step=3, training_loss=3.2354507446289062, metrics={'train_runtime': 124.8553, 'train_samples_per_second': 0.096, 'train_steps_per_second': 0.024, 'total_flos': 783876096000.0, 'train_loss': 3.2354507446289062, 'epoch': 3.0})

In [23]:
model.save_pretrained("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")

('./fine_tuned_gpt2/tokenizer_config.json',
 './fine_tuned_gpt2/special_tokens_map.json',
 './fine_tuned_gpt2/vocab.json',
 './fine_tuned_gpt2/merges.txt',
 './fine_tuned_gpt2/added_tokens.json')

### **Text Generation**

In [27]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2Tokenizer
import torch

# Load fine-tuned GPT-2 model and tokenizer
model_name = "/content/fine_tuned_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move model to device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Input prompt
seed_text = "Maran's passion for AI spans"

# Tokenize input and move to device
input_ids = tokenizer.encode(seed_text, return_tensors="pt").to(device)

# Generate text (experiment with parameters)
output = model.generate(
    input_ids=input_ids,
    max_length=150,
    min_length=100,
    temperature=0.8,  # Adjusted temperature
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True,
    top_p=0.85,      # Adjusted top_p
    do_sample=True,
    repetition_penalty=1.2,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode and print output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:")
print(generated_text)

Generated Text:
Maran's passion for AI spans a wide range of disciplines, including computer science, engineering, and mathematics. He is also a graduate of the University of Illinois at Urbana-Champaign, where he earned his B.A. in Computer Science from the College of Engineering and Applied Sciences.

In addition to his research interests, he holds a Bachelor of Science in Artificial Intelligence from Stanford University and a Ph.D. degree in Machine Learning from UC Berkeley. His work has been published in the Journal of Computational Biology, the journal of Computer Vision and Pattern Recognition.
