Step 1: Load the Pre-trained Language Model and Tokenizer

The first step is to load the pre-trained language model and its corresponding tokenizer. For this example, we’ll use the ‘distillery-base-uncased’ model, a lighter version of BERT.

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Load the pre-trained tokenizer. 
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Load the pre-trained model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Step 2: Prepare the Sentiment Analysis Dataset

We need a labeled dataset with text samples and corresponding sentiments for sentiment analysis. Let’s create a small dataset for illustration purposes:

In [None]:

texts = ["I loved the movie. It was great!",
         "The food was terrible.",
         "The weather is okay."]
sentiments = ["positive", "negative", "neutral"

Next, we’ll use the tokenizer to convert the text samples into token IDs, and attention masks the model requires.

In [None]:
# Tokenize the text samples
encoded_texts = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

# Extract the input IDs and attention masks
input_ids = encoded_texts['input_ids']
attention_mask = encoded_texts['attention_mask']

# Convert the sentiment labels to numerical form
sentiment_labels = [sentiments.index(sentiment) for sentiment in sentiments]

Step 3: Add a Custom Classification Head

The pre-trained language model itself doesn’t include a classification head. We must add one to the model to perform sentiment analysis. In this case, we’ll add a simple linear layer.

In [None]:
import torch.nn as nn

# Add a custom classification head on top of the pre-trained model
num_classes = len(set(sentiment_labels))
classification_head = nn.Linear(model.config.hidden_size, num_classes)

# Replace the pre-trained model's classification head with our custom head
model.classifier = classification_head

Step 4: Fine-Tune the Model

With the custom classification head in place, we can now fine-tune the model on the sentiment analysis dataset. We’ll use the AdamW optimizer and CrossEntropyLoss as the loss function.

In [None]:
import torch.optim as optim

# Define the optimizer and loss function
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

# Fine-tune the model
num_epochs = 3
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(input_ids, attention_mask=attention_mask, labels=torch.tensor(sentiment_labels))
    loss = outputs.loss
    loss.backward()
    optimizer.step()

Instruction Finetuning Process

What if we could go beyond traditional fine-tuning and provide explicit instructions to guide the model’s behavior? Instruction fine-tuning does that, offering a new level of control and precision over model outputs. Here we will explore the process of instruction fine-tuning large language models for sentiment analysis.

Step 1: Load the Pre-trained Language Model and Tokenizer

To begin, let’s load the pre-trained language model and its tokenizer. We’ll use GPT-3, a state-of-the-art language model, for this example.

In [None]:
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification

# Load the pre-trained tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load the pre-trained model for sequence classification
model = GPT2ForSequenceClassification.from_pretrained('gpt2')

Step 2: Prepare the Instruction Data and Sentiment Analysis Dataset

For instruction fine-tuning, we need to augment the sentiment analysis dataset with explicit instructions for the model. Let’s create a small dataset for demonstration:

In [None]:
texts = ["I loved the movie. It was great!",
         "The food was terrible.",
         "The weather is okay."]
sentiments = ["positive", "negative", "neutral"]
instructions = ["Analyze the sentiment of the text and identify if it is positive.",
                "Analyze the sentiment of the text and identify if it is negative.",
                "Analyze the sentiment of the text and identify if it is neutral."]

Next, let’s tokenize the texts, sentiments, and instructions using the tokenizer:

In [None]:
# Tokenize the texts, sentiments, and instructions
encoded_texts = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
encoded_instructions = tokenizer(instructions, padding=True, truncation=True, return_tensors='pt')

# Extract input IDs, attention masks, and instruction IDs
input_ids = encoded_texts['input_ids']
attention_mask = encoded_texts['attention_mask']
instruction_ids = encoded_instructions['input_ids']

Step 3: Customize the Model Architecture with Instructions

To incorporate instructions during fine-tuning, we need to customize the model architecture. We can do this by concatenating the instruction IDs with the input IDs:

In [None]:
import torch

# Concatenate instruction IDs with input IDs and adjust attention mask
input_ids = torch.cat([instruction_ids, input_ids], dim=1)
attention_mask = torch.cat([torch.ones_like(instruction_ids), attention_mask], dim=1)

Step 4: Fine-Tune the Model with Instructions

With the instructions incorporated, we can now fine-tune the GPT-3 model on the augmented dataset. During fine-tuning, the instructions will guide the model’s sentiment analysis behavior.

In [None]:
import torch.optim as optim

# Define the optimizer and loss function
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()

# Fine-tune the model
num_epochs = 3
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(input_ids, attention_mask=attention_mask, labels=torch.tensor(sentiments))
    loss = outputs.loss
    loss.backward()
    optimizer.step()

Finetuning with PEFT

While freezing most pre-trained LLMs, PEFT only approaches fine-tuning a few model parameters, significantly lowering the computational and storage costs. This also resolves the problem of catastrophic forgetting, which was seen during LLMs’ full fine-tuning.

In low-data regimes, PEFT approaches have also been demonstrated to be superior to fine-tuning and to better generalize to out-of-domain scenarios.

Loading the Model

Let’s load the opt-6.7b model here; its weights on the Hub are roughly 13GB in half-precision( float16). It will require about 7GB of memory if we load them in 8-bit.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-6.7b", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")

Postprocessing On the Model

Let’s freeze all our layers and cast the layer norm in float32 for stability before applying some post-processing to the 8-bit model to enable training. We also cast the final layer’s output in float32 for the same reasons.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

Using LoRA

Load a PeftModel, we will use low-rank adapters (LoRA) using the get_peft_model utility function from Peft.

The function calculates and prints the total number of trainable parameters and all parameters in a given model. Along with the percentage of trainable parameters, providing an overview of the model’s complexity and resource requirements for training.

In [None]:
def print_trainable_parameters(model):
 
    # Prints the number of trainable parameters in the model.
   
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || 
          trainable%: {100 * trainable_params / all_param}"
    )

This uses the Peft library to create a LoRA model with specific configuration settings, including dropout, bias, and task type. It then obtains the trainable parameters of the model and prints the total number of trainable parameters and all parameters, along with the percentage of trainable parameters. 

In [None]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

Training the Model

This uses the Hugging Face Transformers and Datasets libraries to train a language model on a given dataset. It utilizes the ‘transformers.Trainer’ class to define the training setup, including batch size, learning rate, and other training-related configurations and then trains the model on the specified dataset.

In [None]:
import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=200, 
        learning_rate=2e-4, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()