## Fine Tuning GPT-*2*

GPT-2 is already good at generating language. But suppose you want it to:

- Write poems like Nepali literature
- Answer legal or medical questions accurately
- Generate code comments for your Python functions

Then, fine-tuning helps it learn your style or domain.

During fine-tuning:

- You give it your data (e.g., train.txt) with examples.

- It re-adjusts its weights using your data (via backpropagation).

- Now it’s better at doing what your data shows.

# Install the required packages

In [1]:
!pip install transformers




[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install -U PyPDF2
!pip install python-docx

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.2.0-py3-none-any.whl (252 kB)
Installing collected packages: python-docx
Successfully installed python-docx-1.2.0



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# Import Library

In [3]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

In [4]:
import os

print(os.listdir("/content/"))


FileNotFoundError: [WinError 3] The system cannot find the path specified: '/content/'

# Data Preprocessing

In [None]:
with open('train.txt', 'r', encoding='utf-8') as file:
    content = file.read()


print(content[:200])
text_data = re.sub(r'\n+', '\n', content).strip()  # Remove excess newline characters


Static routing is a technique in computer networks where routes are manually configured by the network administrator. Unlike dynamic routing, which adapts to network changes automatically, static rout


What we will do ?
1. Loads a GPT-2 tokenizer and model.
2. Loads your custom training dataset.
3. Prepares the model and training arguments.
4. Trains the GPT-2 model on your dataset.
5. Saves the trained model and tokenizer.

# Library for the GPT2 finetune

In [None]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

In [None]:
def load_dataset(file_path, tokenizer, block_size = 64):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [None]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator

In [None]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  tokenizer.pad_token = tokenizer.eos_token  # fix: avoid attention_mask warning

  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)

  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )

  trainer.train()
  trainer.save_model()

In [None]:

train_file_path = "/content/train.txt"
model_name = 'gpt2'
output_dir = '/content/'
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 1000
save_steps = 50000

In [None]:
# Train
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)



Step,Training Loss
500,0.0944
1000,0.0087


# Inference

In [None]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [None]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

def generate_text(model_path, sequence, max_length):

    model = load_model(model_path)
    tokenizer = load_tokenizer(model_path)
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

This model got trained on the entire text and took much longer to train, and yet it fails to give meaningful results.

In [None]:
model1_path = "/content/checkpoint-1000"
sequence1 = "[Question] what is router?"
max_len = 50
generate_text(model1_path, sequence1, max_len)

[Question] what is router? IPv4 uses 32-bit addresses, while IPv6 uses 128-bit addresses to accommodate the growing number of devices on the internet.

Dynamic routing protocols like OSPF (Open Shortest Path First)


The following model was trained on 100 questions and answers based on the original text and it trained in a few seconds (50 epochs). It gives very meaningful results.

In [None]:
model1_path = "/content/checkpoint-1000"
sequence1 = "[Question] what faculty does hcoe provide?"
max_len = 100
generate_text(model1_path, sequence1, max_len)

[Question] what faculty does hcoe provide?

The HEC-10 is a dynamic updateable data transmission network that ensures reliable data transfer across interconnected networks. It improves efficiency in data transmission and enhances security by isolating segments of the network.

The Transmission Control Protocol (TCP) is a connection-oriented protocol that ensures reliable data transfer across interconnected networks. It improves efficiency in data transmission across interconnected networks and enhances security by isolating segments of the network.

The Transmission


In [None]:
model1_path = "/content/checkpoint-1500"
sequence1 = "[Question] what is the full form of hcoe?"
max_len = 50
generate_text(model1_path, sequence1, max_len)

In [None]:
model1_path = "/content/checkpoint-1500"
sequence1 = "[Question] who is the principal of himalayan college of engineering?"
max_len = 50
generate_text(model1_path, sequence1, max_len)

In [None]:
!mv checkpoint-* drive/MyDrive/gen-ai-training-dataset/gpt2-finetune-for-300epoch/