# Fine Tuning With GPT-Neo

Data Handling: The code fetches text data and labels, tokenizes them, and converts them into a dataset.
Model Preparation: Loads a GPT-Neo model configured for sequence classification and adapts it for handling padded inputs.
Training: Sets up training arguments and uses the Trainer class to fine-tune the model on the IMDb dataset.
Saving: The fine-tuned model and tokenizer are saved for later deployment or further use.

In [1]:
!pip install transformers peft datasets accelerate
!pip install transformers datasets peft

from IPython.display import clear_output
clear_output()

AutoTokenizer: This is used to tokenize the text data. In this case, it's specifically for GPT-Neo.
GPTNeoForSequenceClassification: This loads the GPT-Neo model configured for sequence classification tasks.
Trainer & TrainingArguments: These are utilities provided by the Hugging Face library to help with model training.
Dataset: A utility from the datasets library for handling and processing data.
requests: A Python library used to make HTTP requests, like fetching data from a URL.
The URL points to a dataset hosted on the Hugging Face Dataset server.
requests.get(url) sends a GET request to the URL.
response.json() parses the returned data from JSON format into a Python dictionary.

In [5]:
from transformers import GPTNeoForSequenceClassification, Trainer, TrainingArguments
import requests
from datasets import Dataset
from transformers import AutoTokenizer

# Fetch data from the URL
url = "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp%2Fimdb&config=plain_text&split=train&offset=0&length=100"
response = requests.get(url)
data = response.json()

texts: A list comprehension that extracts the 'text' field from each row in the dataset.
labels: Another list comprehension that maps 'pos' labels to 1 and 'neg' labels to 0. This is necessary because the model expects numerical labels.

In [None]:
# Extract texts and labels from the fetched data
texts = [row['row']['text'] for row in data['rows']]
labels = [1 if row['row']['label'] == 'pos' else 0 for row in data['rows']]  # Assuming 'pos' and 'neg' labels

AutoTokenizer.from_pretrained: Loads the pre-trained tokenizer associated with the GPT-Neo model.
Adding a Padding Token: Checks if the tokenizer has a padding token. If not, it adds one. Padding tokens are used to ensure all sequences in a batch have the same length.

In [None]:
# Load the tokenizer for GPT-Neo
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

tokenizer: Converts the text data into token IDs that the model can process.
truncation=True: If the text exceeds max_length, it will be truncated.
padding=True: Pads shorter sequences to ensure all inputs in a batch are the same length.
max_length=512: Sets the maximum token length for each sequence.
Dataset.from_dict: Converts the tokenized inputs and labels into a Hugging Face Dataset object, which is a convenient format for model training.

In [None]:

# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Tokenize the data
encodings = tokenizer(texts, truncation=True, padding=True, max_length=512)

# Convert the data to a Hugging Face Dataset
dataset = Dataset.from_dict({
    'input_ids': encodings['input_ids'],
    'attention_mask': encodings['attention_mask'],
    'labels': labels
})

train_test_split: Splits the dataset into 80% training and 20% testing sets.
train_dataset & eval_dataset: These are the resulting training and testing datasets.

In [7]:
# Prepare the dataset by splitting it into training and testing sets
train_test_split = dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

from_pretrained: Loads a pre-trained GPT-Neo model configured for sequence classification.
num_labels=2: Specifies that the classification task has two possible labels (positive and negative).
pad_token_id: Explicitly sets the model's pad_token_id to match the one used by the tokenizer. This ensures that the model correctly handles padded sequences.
resize_token_embeddings: If new tokens were added to the tokenizer (e.g., padding token), the model’s embedding layer needs to be resized to accommodate the additional tokens.

In [None]:
# Load GPT-Neo model for sequence classification
model = GPTNeoForSequenceClassification.from_pretrained("EleutherAI/gpt-neo-125M", num_labels=2)

# Set the padding token ID in the model's configuration
model.config.pad_token_id = tokenizer.pad_token_id

# If you added a new padding token, resize the model embeddings
model.resize_token_embeddings(len(tokenizer))


### TrainingArguments: Specifies the parameters for training:

  output_dir: Directory to save model outputs.
  evaluation_strategy: Evaluate the model at the end of every epoch.
  learning_rate: The learning rate for the optimizer.
  per_device_train_batch_size and per_device_eval_batch_size: Batch sizes for training and evaluation, respectively.
  num_train_epochs: Number of times to iterate over the entire dataset during training.
  weight_decay: Regularization technique to avoid overfitting.
  logging_dir: Directory to store training logs.
  logging_steps: How often to log training information.
  save_total_limit: Limit the number of saved model checkpoints.

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,   # Adjust based on your GPU memory
    per_device_eval_batch_size=4,    # Adjust based on your GPU memory
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=2,
)

Trainer: The Trainer class is a high-level API for training Hugging Face models. It handles the training loop, evaluation, and other tasks.
trainer.train(): Initiates the fine-tuning process.

save_pretrained: Saves the fine-tuned model and tokenizer to the specified directory for future use.

In [6]:
# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model and tokenizer
model.save_pretrained("./fine_tuned_gpt_neo")
tokenizer.save_pretrained("./fine_tuned_gpt_neo")

Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at EleutherAI/gpt-neo-125M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.0038,0.000439
2,0.0,0.000114
3,0.0,0.000103


('./fine_tuned_gpt_neo/tokenizer_config.json',
 './fine_tuned_gpt_neo/special_tokens_map.json',
 './fine_tuned_gpt_neo/vocab.json',
 './fine_tuned_gpt_neo/merges.txt',
 './fine_tuned_gpt_neo/added_tokens.json',
 './fine_tuned_gpt_neo/tokenizer.json')