# Modern NLP: Text Summarization with Pretrained Language Models

ChatGPT is perhaps the most well-known and most-used AI model ever developed. However, for practicing data scientists, it has noteworthy drawbacks, including that it's available only via a monetized API, that it is subject to unexpected change based on the needs of its parent company, and that it has numerous guardrails that limits its use on a variety of potentially worthwhile applications. Knowing how to specialize an open-source NLP model for a task provides a flexible and potentially cost-efficient alternative to models available only via a corporate API.

This notebook will introduce:

1.   A common (and challenging) real-world NLP task - automatically producing a short summary from a longer document.
2.   The "chat" format used to interface with API models like GPT-3.5-Turbo.
3.   The methods and techniques used to generate summaries with open source NLP models.
4.   Supervised training methods that can be used adapt pretrained models to perform nearly as well as API models on specific tasks.
5.   Cost and memory-efficient methods that allow modern NLP models to be learned on consumer-grade hardware (such as a 16GB GPU).
6.   Creating datasets for evaluating the performance and fairness of NLP models.
7.   Bias in NLP and in text summarization, a task with important real-world implications.

Let's start by installing the libraries needed to use the text summarization dataset, and to run models using the OpenAI API and the Transformers python library.

In [1]:
! pip install torch evaluate openai transformers datasets tqdm rouge-score bert-score absl-py bitsandbytes peft trl wikipedia

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m52.4 MB/s[0m eta [36m0:00:00[0m
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl (61

In [2]:
import torch
import evaluate
import numpy as np
import json
import openai
import bitsandbytes as bnb
import transformers
import datasets
import wikipedia
import pickle as pkl
import pandas as pd

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, DataCollatorForLanguageModeling, AutoModel
from datasets import Dataset, load_dataset
from typing import Iterable
from tqdm import tqdm
from os import path, listdir, makedirs
from peft import PeftConfig, PeftModel, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from typing import Mapping, Iterable

# OpenAI Models

The cells below define classes for instantiating an OpenAI chatbot and interacting with it via the ChatGPT API. By default, the class uses the (less expensive) GPT-3.5-Turbo model, rather than GPT-4. The high-level behavior of the model can be altered using the system-level prompt, though in most cases just specifying instructions to the model is more effective.

OpenAI chatbots are instruction-tuned, meaning that they've been trained to follow instructions. They've also been systematically trained using human interactions to be "helpful" and "harmless" - less likely to output text that humans would find toxic or unpleasant, and more likely to help with tasks that humans typically address using text.

In [3]:
# OpenAI Chatbot class
class ChatBot:
    """
    A class for interacting with the OpenAI Chat API.
    """

    def __init__(self,
                 model: str='gpt-3.5-turbo',
                 system_prompt: str='You are a helpful assistant.') -> None:
        """
        Initialize a ChatBot object, setting system prompt if preferred.
        """

        self.model = model
        self.system_prompt = [{'role': 'system', 'content': system_prompt}]

    def generate(self,
                 messages: Iterable) -> str:
        """
        Query the OpenAI Chat API to generate a response to the user's input.
        """

        # Generate the bot's response
        output = openai.ChatCompletion.create(
        model=self.model,
        messages=messages,
        )['choices'][0]['message']['content']

        return output

class DialogueBot(ChatBot):

    def __init__(self,
                 model: str='gpt-3.5-turbo',
                 system_prompt: str='You are a helpful assistant.',
                 history: Iterable=None) -> None:
        """
        Initialize a DialogueBot object, setting system prompt if preferred.
        """

        super().__init__(model, system_prompt)
        self.history = history if history is not None else []

    def respond_to_user(self,
                        input: str) -> tuple:
        """
        Respond to the user's input, while logging the conversation history for possible display in a UI.
        This is better for keeping track of a running conversation in a UI.
        """

        # Add the user input to the history
        self.history.append({'role': 'user', 'content': input})
        messages = self.system_prompt + self.history

        # Generate the bot's response
        output = self.generate(messages)

        # Add the bot's response to the history - by default, it is added to the history
        self.history.append({'role': 'assistant', 'content': output})
        response = [(self.history[i]['content'], self.history[i+1]['content']) for i in range(0, len(self.history)-1, 2)]

        # Return the response and the history
        return response, self.history

    def return_bot_response(self,
                        input: str,
                        log_history: bool=False) -> tuple:
        """
        Return the bot's response to the user's input; by default, does not add anything to the conversation history.
        This is useful for generating responses to tasks that do not require a conversation history.
        """

        # Add the user input to the model prompt
        messages = self.system_prompt + self.history + [{'role': 'user', 'content': input}]

        # Generate the bot's response
        output = self.generate(messages)

        # Add the bot's response to the history - by default, this is not added to the history
        if log_history:
            self.history.append({'role': 'user', 'content': input})
            self.history.append({'role': 'assistant', 'content': output})

        # Return the bot's response
        return output

    def change_system_prompt(self,
                             system_prompt: str) -> None:
        """
        Change the system-level prompt governing bot behavior at a high level.
        """

        self.system_prompt = [{'role': 'system', 'content': system_prompt}]

# Evaluating Text Summarization Models

How can you tell whether a summary is a good reflection of a longer document? How would you quantify it? Text summarization is a particularly difficult NLP task, not only to train for, but also to evaluate. Three of the most common text summarization metrics include:

*   **[ROUGE](https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499)**, a measurement of the overlap in n-grams between the model's predictions and the ground-truth summary.
*   **[BLEU](https://medium.com/nlplanet/two-minutes-nlp-learn-the-bleu-metric-by-examples-df015ca73a86)**, a measurement that originated in machine translation, and describes the overlap in substrings between model predictions and ground truth.
*   **[BERTScore](https://arxiv.org/pdf/1904.09675.pdf)**, a measurement of similarity between the word vectors of the BERT-encoded model predictions and the BERT-encoded ground truth.

Measurements like ROUGE are more literal, and rely on overlap in words, which might not always be a good measure of semantic similarity ("dog" and "chihuahua" are semantically similar, but ROUGE wouldn't capture this similarity because they exhibit no n-gram overlap). Measurements like BERTScore are more abstract and general, because vectorized capture semantic similarity between related words. However, BERTScore is also derived using another pretrained language model, and is vulnerable to the semantic biases the model learns.

None of these measurements is perfect, and it's always a good idea to actually look at the data a model is generating to sanity check that it looks like a summary.

In [4]:
def compute_summarization_metrics(predictions: Iterable,
                            references: Iterable,
                            rouge: bool=True,
                            bleu: bool=True,
                            bertscore: bool=True) -> dict:
    """
    Compute ROUGE, BLEU, and BERTscore metrics for a set of predictions and references.
    """

    metric_results = {}

    if rouge:
        rouge = evaluate.load('rouge')

        # Compute ROUGE metrics at the summary level, using the 'rouge1', 'rouge2', and 'rougeL' metrics, aggregating the results
        rouge_results = rouge.compute(predictions=predictions,
                                    references=references,
                                    use_aggregator=True)

        # Store the results in the metric_results dictionary
        metric_results['rouge'] = rouge_results

    else:
        metric_results['rouge'] = None

    if bleu:
        bleu = evaluate.load('bleu')

        # Compute BLEU metrics at the summary level
        bleu_results = bleu.compute(predictions=predictions,
                                    references=references)

        # Store the results in the metric_results dictionary
        metric_results['bleu'] = bleu_results

    else:
        metric_results['bleu'] = None

    if bertscore:
        bertscore = evaluate.load('bertscore')

        # Compute BERTscore metric, using distilbert-base-uncased as the reference model
        bertscore_results = bertscore.compute(predictions=predictions,
                                                    references=references,
                                                    lang='en',
                                                    model_type="distilbert-base-uncased")

        # Store the results in the metric_results dictionary
        metric_results['bertscore'] = {k: np.mean(v) for k, v in bertscore_results.items() if k in ['precision', 'recall', 'f1']}

    else:
        metric_results['bertscore'] = None

    return metric_results


# Text Summarization Dataset

There are many text summarization datasets to choose from. Since we're interested in a real-world application, let's use the CNN-Daily Mail dataset, a collection of news articles and "highlights" (short summaries) that we'll train a model to generate.

The cell below downloads the entire dataset, and sets the test data to the first 25 examples of the test split. Note that we need to specify the version of the dataset in order to download it from the Hugging Face repository. The input_column parameter defines the column of the dataset that the model will be expected to summarize, while the target_column parameter defines the ground-truth summaries against which we'll evaluate the model's output.

In [5]:
# Set the seed for reproducibility
torch.manual_seed(42)

# Specify and download the dataset, slicing out the first 25 examples of the test data for evaluation
DATASET = 'cnn_dailymail'
test_data = load_dataset(DATASET, split='test[0:25]', version='3.0.0')

input_column = 'article'
target_column = 'highlights'

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

# Generating Summaries with ChatGPT

You'll need an API key to query the OpenAI models. If you don't have it, sign up at the [OpenAI website](https://openai.com/blog/openai-api).

In [None]:
openai.api_key = input('Enter OpenAI API Key: ')

Instantiate the chat model, and define the prompt - this is how the model will know what task to perform. The below wraps the input at the start *and* the end, which is helpful for models that aren't instruction tuned but is not strictly speaking necessary for ChatGPT.

In [7]:
chat_model = DialogueBot()

start_prompt = '### Summarize in 3-5 short sentences: '
end_prompt = '### Begin summary: '
remove_suffix = None

Loop over the test data, generating a summary for each example.

In [8]:
model_outputs = []

# Iterate over the test set
for idx, example in enumerate(tqdm(test_data, desc='Generating summaries with OpenAI model', total=len(test_data))):

    # Create the input string, adding the start and end prompts
    input = start_prompt + example[input_column] + end_prompt

    # Get the model's response, omitting the system and user prompts
    output = chat_model.return_bot_response(input)
    model_outputs.append(output)

Generating summaries with OpenAI model: 100%|██████████| 25/25 [01:00<00:00,  2.42s/it]


Sanity check the output - what does the first example in the test data look like, and what does the model's generated summary look like? You should see a pretty good generated summary.

In [9]:
print(f'First Example: {test_data[0]}')
print(f'First OpenAI Model Summary: {model_outputs[0]}')

First Example: {'article': '(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaki

Now compute the evaluation metrics - what do you think? Do these look the way you expected?

In [10]:
oai_summarization_metrics = compute_summarization_metrics(model_outputs, test_data[target_column])

for k, v in oai_summarization_metrics.items():
  print(f'{k}: {v}')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

rouge: {'rouge1': 0.305133471754071, 'rouge2': 0.12265794364460304, 'rougeL': 0.21115721207320773, 'rougeLsum': 0.24828838606178502}
bleu: {'bleu': 0.06135150747753072, 'precisions': [0.21537872991583779, 0.07995365005793743, 0.040171606864274574, 0.02048050413548641], 'brevity_penalty': 1.0, 'length_ratio': 2.5703048180924286, 'translation_length': 2614, 'reference_length': 1017}
bertscore: {'precision': 0.7800977754592896, 'recall': 0.8402586007118225, 'f1': 0.8086342024803161}


# Generating Text with Transformers Models

Now let's generate summaries using a model that we have some control over - that we're not paying for by the token, and that won't be altered, updated, or removed without our notice, as may be the case with the OpenAI models.

As with the OpenAI model, the function below takes the input data and wraps it between the start and end prompts. However, it renders visible two steps in the text generation process:

*   **Tokenization**, wherein the input text is split into chunks, mapped to a number corresponding to a subword vector in the model's embedding lookup matrix, truncated to a certain maximum length (in subwords), and returned as a tensor for input to the model.
*   **Generation**, wherein the model predicts the next word in the sequence (following the prompt) until it has predicted at least a minimum number of tokens, and stopping before it reaches a maximum number of tokens.
*   **Decoding**, wherein the tokenizer is used to transformer the model's outputs from numerical representations back into human-readable text.

The generate method takes a wide variety of arguments, and permits strategies such as [beam search](https://d2l.ai/chapter_recurrent-modern/beam-search.html), wherein the model takes into account the best sequence of output words, rather than greedily taking the most probable next word. [Transformers tutorials](https://huggingface.co/blog/how-to-generate) provide helpful recommendations for generating more interesting and diverse text.


In [11]:
def generate_from_prompt(model: AutoModelForCausalLM,
                      tokenizer: AutoTokenizer,
                      input_data: str,
                      start_prompt: str='### Summarize in 3-5 short sentences: ',
                      end_prompt: str='\n ### Begin summary: ',
                      max_tokens: int=974,
                      min_new_tokens: int=25,
                      max_new_tokens: int=50,
                      peft_model: bool=False,
                      device: str='cuda') -> str:
    """
    Generate and decode output from a Transformers model using a prompt.
    """

    # Create the input string, adding the start and end prompts
    input = start_prompt + input_data + end_prompt

    # Check whether input will not include the end prompt due to context window length, and manually truncate if necessary
    tokenized = tokenizer.encode(input)

    if len(tokenized) > max_tokens:
      input = tokenizer.decode(tokenized[:max_tokens-10], skip_special_tokens=True) + end_prompt

    # If the model is a PEFT model, use a different method to tokenize the input
    if peft_model:
      input_ids = tokenizer(input, return_tensors='pt', truncation=True, max_length=max_tokens).to(device)

    # Generate text from prompt
      with torch.no_grad():

        output = model.generate(**input_ids, max_new_tokens=max_new_tokens, min_new_tokens=min_new_tokens)

    # If the model is not a PEFT model, use the default method to tokenize the input
    else:
      input_ids = tokenizer.encode(input, return_tensors='pt', truncation=True, max_length=max_tokens).to(device)

      # Generate text from prompt
      with torch.no_grad():

        output = model.generate(input_ids, max_new_tokens=max_new_tokens, min_new_tokens=min_new_tokens)

    # Decode the output string, removing the special tokens and any suffixes
    decoded = tokenizer.decode(output[0], skip_special_tokens=True).split(end_prompt)[1]

    return decoded

Let's start by using one of the smallest language models available - the 125-million parameter version of Facebook's Open Pretrained Transformer (OPT). We're loading a "pretrained" language model - a model that has already been trained to predict the next word in a sentence by looking at roughly 800 gigabytes of text. The .from_pretrained() method allows us to download the model binaries and associated JSON config files. Note that "generative" language models are also called "causal" language models, and this is reflected in the Transformers library's AutoModelForCausalLM class.

In [12]:
MODEL_ID = 'facebook/opt-125m'
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
model.to(DEVICE)

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), ep

Now we repeat the process of generating summaries, just as we did with the OpenAI model. In this case, however, we generate the summaries with the OPT model in the notebook environment, rather than by querying an external API.

In [13]:
hf_model_outputs = []

# Iterate over the test set
for idx, example in enumerate(tqdm(test_data, desc='Generating summaries using Transformers model', total=len(test_data))):

    # Generate and decode the output string, removing the special tokens and any suffixes
    decoded = generate_from_prompt(model,
                                    tokenizer,
                                    example[input_column],
                                    start_prompt,
                                    end_prompt)

    # Remove the suffix if specified - note that Mistral-Instruct models add a </s> suffix to specify the end of the output
    if remove_suffix is not None:
        decoded = decoded.replace(remove_suffix, '')

    hf_model_outputs.append(decoded)

Generating summaries using Transformers model: 100%|██████████| 25/25 [00:15<00:00,  1.64it/s]


Let's take a look at the summaries the model is generating. There's a significant gap in quality between the OPT model and the OpenAI model, and in some cases, the model doesn't generate anything at all, just blank space. Part of that is due to the OpenAI model being much larger than OPT model; another factor, however, that we'll seek to address via fine-tuning, is that the OPT model was not trained to follow instructions like the OpenAI model.

In [14]:
print(f'First Example: {test_data[11]}')
print(f'First OPT Model Summary: {hf_model_outputs[11]}')

First Example: {'article': '(CNN)Paul Walker is hardly the first actor to die during a production. But Walker\'s death in November 2013 at the age of 40 after a car crash was especially eerie given his rise to fame in the "Fast and Furious" film franchise. The release of "Furious 7" on Friday offers the opportunity for fans to remember -- and possibly grieve again -- the man that so many have praised as one of the nicest guys in Hollywood. "He was a person of humility, integrity, and compassion," military veteran Kyle Upham said in an email to CNN. Walker secretly paid for the engagement ring Upham shopped for with his bride. "We didn\'t know him personally but this was apparent in the short time we spent with him. I know that we will never forget him and he will always be someone very special to us," said Upham. The actor was on break from filming "Furious 7" at the time of the fiery accident, which also claimed the life of the car\'s driver, Roger Rodas. Producers said early on that 

Now let's compute evaluation metrics using the summaries generated by the Transformers model. As expected from our visual inspection, they're not great, and there's a notable gap between the OpenAI model and our Transformers model.

In [15]:
hf_summarization_metrics = compute_summarization_metrics(hf_model_outputs, test_data[target_column])

for k, v in hf_summarization_metrics.items():
  print(f'{k}: {v}')

rouge: {'rouge1': 0.12319901084828869, 'rouge2': 0.039874463947063024, 'rougeL': 0.09791804359867246, 'rougeLsum': 0.10579038745202984}
bleu: {'bleu': 0.027204710867673232, 'precisions': [0.18848167539267016, 0.04806408544726302, 0.020435967302452316, 0.011126564673157162], 'brevity_penalty': 0.7180961304604512, 'length_ratio': 0.7512291052114061, 'translation_length': 764, 'reference_length': 1017}
bertscore: {'precision': 0.42599796295166015, 'recall': 0.43435858011245726, 'f1': 0.42980666160583497}




# Supervised Fine-Tuning for Text Summarization

Let's try to have the best of both worlds: a small, inexpensive, model that we have control over - and that also performs comparably to the OpenAI model. To achieve this, we can "fine-tune" the pretrained language model so that it's specifically able to perform text summarization. Transformers provides a Trainer class to make it easier to pretrain or fine-tune large language models, to which we can provide a TrainigArguments object. In this case, we'll use the SFTTrainer class ("Supervised Fine-Tuning") to adapt the model. Supervised fine-tuning is commonly used to adapt pretrained language models such that they adhere to human instructions, or learn to perform some specific task of interest.

In [16]:
TRAINING_ARGS = TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=1000,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
        optim='paged_adamw_8bit',
    )

Fine-tuning an entire language model requires that we update an enormous amount of information to specialize the model for our task - even in the case of OPT-125m, one of the smallest models, this amounts to 125 million parameters! To make the task more memory-efficient, compute-efficient, and *cost*-efficient, let's take advantage of three state-of-the-art techniques for adapting pretrained models:

*   [Quantization](https://huggingface.co/docs/optimum/concept_guides/quantization), a technique that uses lower-precision data types (such as 4-bit integer) to reduce memory storage and speed up matrix multiplication in neural networks
*   [Low-Rank Adaptation (LoRA)](https://huggingface.co/docs/diffusers/main/en/training/lora), a technique that inserts small "update matrices" and tunes only those matrices while keeping the pretrained model parameters frozen
*   [Gradient Checkpointing](https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing), a technique that strategically saves model activations during the forward pass so that they don't need to be recomputed during the backward pass

One of the most common issues faced in both training and inference is the inaccessibility of GPUs that have enough VRAM to mount a large language model. Quantization allows us to fit very large models - such as a [13-billion parameter Llama-2 model](https://huggingface.co/meta-llama/Llama-2-13b-hf) - on consumer-grade hardware, such as a 16-gigabyte T4 GPU running in a Google Colab instance. Low-rank adaptation and gradient checkpointing allow us to avoid out-of-memory problems by updating a much smaller number of parameters than full fine-tuning, and by more effectively scheduling those updates.

In [17]:
QUANZATION_MAP = {
    '4bit': BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    '8bit': BitsAndBytesConfig(
        load_in_8bit=True,
        llm_int8_skip_modules=["lm_head"],
        torch_dtype=torch.bfloat16,
    ),
}

def find_lora_modules(model: AutoModel,
                      include_modules: Iterable=[bnb.nn.Linear4bit],
                      exclude_names: Iterable=['lm_head']) -> list[str]:
    """
    Returns a list of the modules to be tuned using LoRA.
    """

    # Create a set to store the names of the modules to be tuned
    lora_module_names = set()

    # Iterate over the model and find the modules to be tuned
    for name, module in model.named_modules():

        # Check if the module is in the list of modules to be tuned
        if any(isinstance(module, include_module) for include_module in include_modules):

            # Split the name of the module and add it to the set
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    # Return the list of module names to be tuned, excluding any names in the exclude list
    return [name for name in list(lora_module_names) if name not in exclude_names]

def get_model_and_tokenizer(model_id: str,
                            quantization_type: str='4bit',
                            gradient_checkpointing: bool=True,
                            device_map: dict=None) -> tuple[AutoModel, AutoTokenizer]:
    """
    Returns a Transformers model and tokenizer for fine-tuning. If quantization_type is provided, the model will be quantized and prepared for training.
    """

    # Download the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    # Set the pad token (needed for trainer class, no value by default for most causal models)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Set the device map
    device_map = 'auto' if device_map is None else device_map

    # Download the model, quantize if requested
    if quantization_type:
        assert torch.cuda.is_available(), 'Quantization is only supported on GPU'
        model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=QUANZATION_MAP[quantization_type], device_map=device_map)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device_map)

    # Enable gradient checkpointing if requested
    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    # Prepare the model for training if quantization is requested
    if quantization_type is not None:
        model = prepare_model_for_kbit_training(model)

    return model, tokenizer

By default, the examples in our training dataset will not be formatted as instructions to the model. We can define a function that takes in the examples and wraps them in text summarization instructions (for example, "Summarize the following: [...] Begin summary: "), allowing the model to be used with natural language instructions after fine-tuning. Even better, the SFTTrainer class can take a formatting function like this as a parameter, meaning that the trainer will automatically convert the training examples to instruction format as it feeds them to the model.

In [18]:
def format_data_as_instructions(data: Mapping,
                                input_field: str=input_column,
                                target_field: str=target_column,
                                start_prompt: str=start_prompt,
                                end_prompt: str=end_prompt,
                                suffix: str='') -> list[str]:
    """
    Formats text data as instructions for the model. Can be used as a formatting function for the trainer class.
    """

    output_texts = []

    # Iterate over the data and format the text
    for i in tqdm(range(len(data[input_field])), desc='Formatting data'):

        # Add the start and end prompts to the text, and append the suffix if provided
        text = f'{start_prompt}{data[input_field][i]}{end_prompt}{data[target_field][i]}{suffix}'

        output_texts.append(text)

    return output_texts


Time for training! Let's download the (4-bit quantized) model and tokenizer.

In [19]:
model, tokenizer = get_model_and_tokenizer(MODEL_ID)

Now we need to insert the update matrices into the model. We can use the get_peft_model method to do this, passing in a LoraConfig object. "PEFT" refers to [parameter-efficient fine-tuning](https://huggingface.co/blog/peft), a collection of methods for reducing the cost and memory footprint of fine-tuning, of which LoRA is one.

Note that "r" refers to the rank of the inserted weight matrix, and lora_alpha is a scaling factor used in initializing weights. We defined a find_lora_modules function to retrieve all of the modules where a weight matrix should be inserted - by default, this function retrieves 4-bit linear layers. Note that because we're fine-tuning a "causal" language model, the task_type is defined as "CAUSAL_LM".

In [20]:
model = get_peft_model(model,
                       LoraConfig(
                                  r=8,
                                  lora_alpha=32,
                                  target_modules=find_lora_modules(model),
                                  lora_dropout=.05,
                                  bias='none',
                                  task_type='CAUSAL_LM',
                                  )
                        )

Now let's get the training data - for the sake of time, we'll slice out the first 1000 examples from the training split of the CNN data. Training OPT-125m on 1000 examples should take about 15 minutes when using a T4 GPU.

In [21]:
training_data = load_dataset(DATASET, split='train[0:1000]', version='3.0.0')

Instantiate the trainer, passing in the model, the tokenizer, the training data, and the formatting function. Note that the packing parameter refers to whether lots of training examples are "packed" into the same input sequence if they can fit into the model's context window. This can enable more efficient training, as the model makes fewer total forward and backward passes. Most of our examples will fill most of the model's context window, so we won't use packing.

In [22]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=training_data,
    args=TRAINING_ARGS,
    formatting_func=format_data_as_instructions,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    max_seq_length=1024,
    packing=False,
)

model.config.use_cache = False

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]


Formatting data: 100%|██████████| 1000/1000 [00:00<00:00, 175685.01it/s]

Formatting data: 100%|██████████| 1000/1000 [00:00<00:00, 263180.27it/s]


Train!

In [23]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.2343
2,3.2985
3,3.0974
4,3.2387
5,3.0056
6,2.9241
7,2.9872
8,3.2662
9,3.0074
10,3.0168


TrainOutput(global_step=1000, training_loss=2.7436504905223846, metrics={'train_runtime': 959.2228, 'train_samples_per_second': 4.17, 'train_steps_per_second': 1.043, 'total_flos': 1541471937638400.0, 'train_loss': 2.7436504905223846, 'epoch': 4.0})

Now let's save the weight matrices (referred to as "adapter weights") we learned. This reveals another benefit of using LoRA: we need only a few megabytes of space to save the adapter weights, where we would have needed 350MB of space to save a fine-tuned copy of OPT-125M!

In [24]:
lora_model_id="finetuned_model"
trainer.model.save_pretrained(lora_model_id)
tokenizer.save_pretrained(lora_model_id)

('finetuned_model/tokenizer_config.json',
 'finetuned_model/special_tokens_map.json',
 'finetuned_model/vocab.json',
 'finetuned_model/merges.txt',
 'finetuned_model/added_tokens.json',
 'finetuned_model/tokenizer.json')

Set the locale so that we can use ls from the command line.

In [25]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

The adapter weights are only about 5MB!

In [26]:
! ls -l /content/finetuned_model

total 8544
-rw-r--r-- 1 root root     491 Oct 11 00:49 adapter_config.json
-rw-r--r-- 1 root root 5360461 Oct 11 00:49 adapter_model.bin
-rw-r--r-- 1 root root      30 Oct 11 00:49 added_tokens.json
-rw-r--r-- 1 root root  456318 Oct 11 00:49 merges.txt
-rw-r--r-- 1 root root     464 Oct 11 00:49 README.md
-rw-r--r-- 1 root root      96 Oct 11 00:49 special_tokens_map.json
-rw-r--r-- 1 root root     704 Oct 11 00:49 tokenizer_config.json
-rw-r--r-- 1 root root 2108729 Oct 11 00:49 tokenizer.json
-rw-r--r-- 1 root root  798293 Oct 11 00:49 vocab.json


# Evaluating the Fine-Tuned Model

You should have observed the loss decreasing over the course of the training run. This bodes well, but the real test of whether we've succeeded lies in examining the output and computing summarization metrics.

First, let's load the LoRA model and get it ready for evaluation.

In [27]:
# Load the LoRA configuration from the save directory
peft_model_id = "results"
config = PeftConfig.from_pretrained(lora_model_id)

# Load the quantized pretrained model and tokenizer
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,  load_in_4bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Add the learned adapter weights to the model
model = PeftModel.from_pretrained(model, lora_model_id, device_map={"":0})

# Put the model in eval mode
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 768, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (layers): ModuleList(
            (0-11): 12 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear4bit(
                  in_features=768, out_features=768, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=8, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=8, out_features=768, bias=False)
                  )
                  (

Run the same evaluation code as before, now with the finetuned model. Note that we pass peft_model=True to the generation function to deal with some quirks of LoRA-adapted models with regard to tokenization.

In [28]:
finetuned_model_outputs = []

# Iterate over the test set
for idx, example in enumerate(tqdm(test_data, desc='Generating summaries using fine-tuned model', total=len(test_data))):

    # Generate and decode the output string, removing the special tokens and any suffixes
    decoded = generate_from_prompt(model,
                                    tokenizer,
                                    example[input_column],
                                    start_prompt,
                                    end_prompt,
                                    974,
                                    peft_model=True
                                   )

    # Remove the suffix if specified - note that Mistral-Instruct models add a </s> suffix to specify the end of the output
    if remove_suffix is not None:
        decoded = decoded.replace(remove_suffix, '')

    finetuned_model_outputs.append(decoded)

Generating summaries using fine-tuned model: 100%|██████████| 25/25 [00:45<00:00,  1.82s/it]


Moment of truth - it's not quite ChatGPT quality, but it looks like our fine-tuned LoRA model does *much* better than the base OPT-125m.

In [29]:
print(f'First Example: {test_data[0]}')
print(f'First OPT Model Summary: {finetuned_model_outputs[0]}')

First Example: {'article': '(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaki

Let's validate with summarization metrics - as expected, these are significantly better than the pretrained base model, and gaining on ChatGPT. Since we only fine-tuned for fifteen minutes, we might be able to get an even better model if we're willing to continue training, perhaps for 10,000 examples instead of 1,000.

In [30]:
summarization_metrics = compute_summarization_metrics(finetuned_model_outputs, test_data[target_column][:len(finetuned_model_outputs)])

for k, v in summarization_metrics.items():
  print(f'{k}: {v}')

rouge: {'rouge1': 0.2671864439754658, 'rouge2': 0.08299057461771966, 'rougeL': 0.206779143410235, 'rougeLsum': 0.23894791152269887}
bleu: {'bleu': 0.06799618284217776, 'precisions': [0.2777777777777778, 0.08246445497630332, 0.040776699029126215, 0.022885572139303482], 'brevity_penalty': 1.0, 'length_ratio': 1.0619469026548674, 'translation_length': 1080, 'reference_length': 1017}
bertscore: {'precision': 0.7777224445343017, 'recall': 0.7671385788917542, 'f1': 0.7721978974342346}


# Evaluating Model Fairness

Text summarization models sometimes suffer from a bias known as "name-nationality" bias: when asked to summarize a biography, the model may incorrectly identify a person's nationality if it associates that person's name strongly with another nationality. For example, given a biography of an American businessman named Günter Hesse that begins "Günter Hesse is a *American* businessman...", the model might nonetheless write a summary that begins, "Günter Hesse is a *German* businessman..." While this kind of bias is similar to a "hallucination", wherein a generative language model produces false but plausible-sounding output, it is also related to human implicit biases, wherein unfamiliar names are implicitly equated with a societal "other." For an approachable treatment of name-nationality bias, see this recent [computational linguistics paper](https://aclanthology.org/2023.eacl-main.234.pdf).

So how do we know whether the text summarization model we've fine-tuned is fair and unbiased enough to use in real-world applications? We can start by seeing whether the model reflects name-nationality bias. Following the methodology descibed by the authors of the paper, we'll scrape the summary section of bio pages from Wikipedia, which will subsequently be used to measure bias.

In [31]:
VOWELS = ['A', 'E', 'I', 'O', 'U']

def get_summaries_from_page(page: str,
                            nationality: str,
                            first_paragraph_only: bool=False,
                            max_summaries: int=25) -> dict:
    """
    Get the summaries of the people mentioned in a Wikipedia page.
    """

    # Get the page's plain text
    links = wikipedia.WikipediaPage(page).links

    # Keep only the links that are two words long, to exclude section headers and other non-person links
    links = [i for i in links if len(i.split()) == 2]

    summaries = {}

    # Format strings to check if person is described in terms of nationality
    if nationality[0] in VOWELS:
        nationality_strings = (f' was an {nationality}', f' is an {nationality}')
    else:
        nationality_strings = (f' was a {nationality}', f' is a {nationality}')

    # Iterate over the lines in the page's plain text
    for link in tqdm(links, desc='Getting summaries'):

        # Get the summary of the person's Wikipedia page, keeping only the first paragraph
        try:
            summary = wikipedia.summary(link)
        except:
            continue

        # Exclude summary if nationality not specified
        if nationality_strings[0] not in summary and nationality_strings[1] not in summary:
            continue

        # Keep only the first paragraph if specified
        if first_paragraph_only:
            summary = summary.split('\n')[0]

        # Add the person's name and summary to the dictionary
        summaries[link] = summary

        if len(summaries)==max_summaries:
            return summaries

    return summaries

def write_summaries_to_file(summaries: dict, file_path: str, ext: str='json') -> None:
    """
    Write the summaries to a file.
    """

    # Write the summaries to file type specified by the extension
    if ext == 'json':
        with open(file_path, 'w', encoding='utf-8') as f:
            json.dump(summaries, f)

    elif ext == 'pkl':
        with open(file_path, 'wb') as f:
            pkl.dump(summaries, f)

    elif ext == 'txt':
        with open(file_path, 'w', encoding='utf-8') as f:
            for name, summary in summaries.items():
                f.write(f'{name}: {summary}\n')

    # Raise an error if the extension is not valid
    else:
        raise ValueError('Invalid file extension. Must be json or pkl.')

    return None

The EACL paper evaluates bias on fourteen nationalities, but observes some of the strongest biases for East Asian names. Thus, we'll limit our analysis to the names of English and Japanese people, based on lists available on Wikipedia.

In [32]:
# Identify the target pages to scrape
TARGET_PAGES = {
    'British': 'List of English people',
    'Japanese': 'List of Japanese people',
}

For each nationality, create a dictionary of summaries, and save it as a JSON file. We'll use these JSON files to create a dataset.

In [33]:
# Set the target directory to save the summaries to
TARGET_DIRECTORY = './nationality_summaries'

# Create the target directory if it does not exist
try:
    makedirs(TARGET_DIRECTORY)
except:
    pass

# Iterate over the target pages
for group, page in TARGET_PAGES.items():

    # Get the summaries from the page
    summaries = get_summaries_from_page(page,
                                        group,
                                        first_paragraph_only=False,
                                        max_summaries=25)

    # Print number of summaries retrieved
    print(f'Retrieved {len(summaries)} summaries for {group}')

    # Write the summaries to file
    write_summaries_to_file(summaries, f'{TARGET_DIRECTORY}/{group}.json')

    print(f'Saved {group} summaries to {TARGET_DIRECTORY}/{group}.json')



  lis = BeautifulSoup(html).find_all('li')
Getting summaries:  17%|█▋        | 149/876 [02:07<10:22,  1.17it/s]


Retrieved 25 summaries for British
Saved British summaries to ./nationality_summaries/British.json


Getting summaries:   6%|▌         | 53/890 [00:50<13:23,  1.04it/s]

Retrieved 25 summaries for Japanese
Saved Japanese summaries to ./nationality_summaries/Japanese.json





Now that we have the bios, we can create a simple dataset to use with our model. In this case, we'll just create a dataframe, but note that you can also [create a Hugging Face dataset and push it to the Hub](https://huggingface.co/docs/datasets/share) so that other people can use it to assess the performance or fairness of their models (or to train a model in the first place). Creating datasets for training and evaluation is an important part of machine learning and NLP research, and has inspired entire [NLP resource conferences](http://www.lrec-conf.org/), as well as [normative frameworks for the development of fair and ethically sourced datasets](https://arxiv.org/abs/1803.09010).

In [34]:
def create_dataframe_from_summaries(summaries: dict,
                                    nationality: str) -> pd.DataFrame:
    """
    Create a DataFrame from a summaries dictionary.
    """

    # Convert the summaries dictionary to a DataFrame, including the key and value as columns
    data = pd.DataFrame.from_dict(summaries, orient='index', columns=['summary'])

    # Add name column and reset the index
    data['name'] = data.index
    data.reset_index(drop=True, inplace=True)

    # Add nationality column
    data['nationality'] = nationality

    # Reorder the columns
    data = data[['nationality', 'name', 'summary']]

    return data

def create_name_nationality_dataset(summary_files: list,
                                    nationalities: list) -> pd.DataFrame:
    """
    Create a DataFrame from a list of summary files and a list of nationalities.
    """

    assert len(summary_files) == len(nationalities), 'The number of summary files must be equal to the number of nationalities.'

    dataframes = []

    # Iterate over the summary files and nationalities
    for idx in range(len(summary_files)):

        # Load the summaries from file
        with open(summary_files[idx], 'r', encoding='utf-8') as f:
            summaries = json.load(f)

        # Create a DataFrame from the summaries
        data = create_dataframe_from_summaries(summaries, nationalities[idx])

        # Append the DataFrame to the list
        dataframes.append(data)

    # Concatenate the DataFrames
    data = pd.concat(dataframes, ignore_index=True)

    return data


Create the dataframe!

In [35]:
SUMMARY_DIR = './nationality_summaries'
SAVE_DIR = './'

# Get the summary files
summary_files = [path.join(SUMMARY_DIR, file) for file in listdir(SUMMARY_DIR) if file.endswith('.json')]

# Get the nationalities
nationalities = [file.split('/')[-1].split('.')[0] for file in summary_files]

# Create the dataset
name_nationality_dataset = create_name_nationality_dataset(summary_files, nationalities)

# Save the dataset
name_nationality_dataset.to_csv(path.join(SAVE_DIR, 'name_nationality_dataset.csv'))

Now that we have the dataframe, we need to alter the biographies so that we can assess name-nationality bias. To do this, we'll take the original bio - for example, the bio of an American - and replace the name of the person it describes with the name from one of the Wikipedia bios of the Japanese people we scraped. Then we'll repeat for Japanese bios and American names.

In [36]:
def perturb_names(summary: str,
                  reference_first: str,
                  reference_last: str,
                  target_first: str,
                  target_last: str) -> str:
    """
    Replace the reference (original) names in the summary with the target names.
    """

    # Replace the reference names with the target names
    summary = summary.replace(reference_first, target_first)
    summary = summary.replace(reference_last, target_last)

    return summary

def generate_perturbations(data: pd.DataFrame,
                           original_nationality: str,
                           perturbation_nationality: str,
                           num_names_perturbed: int=5,
                           num_perturbations: int=5,
                           nationality_col_name: str='nationality',
                           name_col_name: str='name',
                           summary_col_name: str='summary') -> pd.DataFrame:
    """
    Generate a DataFrame of perturbations by replacing the reference (original) names in the summaries with the target names.
    By default, generates 5 perturbations for 5 randomly chosen names in the source dataset, 25 total.
    """

    # Create a copy of the data
    data = data.copy()

    # Create DataFrames of the reference and target names
    df_reference = data[data[nationality_col_name] == original_nationality]
    df_target = data[data[nationality_col_name] == perturbation_nationality]

    # Randomly choose names to insert into the summaries
    target_names = df_target.sample(num_names_perturbed)[name_col_name].tolist()

    dataframes = []

    # Iterate over the reference names
    for name in target_names:

        # Randomly choose target names
        summary_df = df_reference.sample(num_perturbations)[[name_col_name, summary_col_name]].copy()

        # Add the original nationality column
        summary_df['nationality'] = original_nationality

        # Add the perturbed name column
        summary_df['perturbed_name'] = name

        # Add the perturbed nationality column
        summary_df['perturbed_nationality'] = perturbation_nationality

        # Split the name into first and last name
        target_first, target_last = name.split()[0], name.split()[1]

        perturbed_summaries = []

        # Iterate over the target names
        for idx, row in summary_df.iterrows():

            # Split the name into first and last name
            original_first, original_last = row[name_col_name].split()[0], row[name_col_name].split()[1]

            # Perturb the summary
            perturbed_summaries.append(perturb_names(row[summary_col_name], original_first, original_last, target_first, target_last))

        # Add the perturbed summaries column
        summary_df['perturbed_summary'] = perturbed_summaries

        dataframes.append(summary_df)

    # Concatenate the DataFrames
    perturbed_data = pd.concat(dataframes, ignore_index=True)

    return perturbed_data

def create_perturbed_dataset(data: pd.DataFrame,
                             num_names_perturbed: int=5,
                             num_perturbations: int=5,
                             nationality_col_name: str='nationality',
                             name_col_name: str='name',
                             summary_col_name: str='summary') -> pd.DataFrame:
    """
    Create a perturbed dataset by replacing the reference (original) names in the summaries with the target names.
    """

    # Create a copy of the data
    data = data.copy()

    # Create a list of the unique nationalities
    nationalities = data[nationality_col_name].unique().tolist()

    perturbed_dfs = []

    # Iterate over nationalities
    for nationality in nationalities:

        # Iterate over paired nationalities
        for nationality_2 in nationalities:

            # Skip if the nationalities are the same
            if nationality == nationality_2:
                continue

            # Generate the perturbations
            perturbed_df = generate_perturbations(data,
                                                  nationality,
                                                  nationality_2,
                                                  num_names_perturbed,
                                                  num_perturbations,
                                                  nationality_col_name,
                                                  name_col_name,
                                                  summary_col_name)

            # Append the perturbed DataFrame to the list
            perturbed_dfs.append(perturbed_df)

    # Concatenate the DataFrames
    perturbed_data = pd.concat(perturbed_dfs, ignore_index=True)

    return perturbed_data

def create_fairness_dataset(perturbed_dataset: pd.DataFrame,
                        original_nationality_col: str='nationality',
                        perturbed_name_col: str='perturbed_name',
                        perturbed_nationality_col: str='perturbed_nationality',
                        perturbed_summary_col: str='perturbed_summary') -> pd.DataFrame:
    """
    Create an fairness evaluation dataset by extracting only the relevant columns from the perturbed dataset.
    """

    # Create a copy of the perturbed dataset
    perturbed_dataset = perturbed_dataset.copy()

    # Create a DataFrame of the relevant columns
    perturbed_dataset = perturbed_dataset[[perturbed_summary_col, perturbed_name_col, perturbed_nationality_col, original_nationality_col]]

    return perturbed_dataset

Let's call this the "perturbed" dataset, using the terminology of the EACL paper.

In [37]:
# Create the perturbed dataset
perturbed_dataset = create_perturbed_dataset(name_nationality_dataset)

# Save the perturbed dataset
perturbed_dataset.to_csv(path.join(SAVE_DIR, 'perturbed_dataset.csv'))

# Create the fairness dataset
fairness_dataset = create_fairness_dataset(perturbed_dataset)

# Save the fairness dataset
fairness_dataset.to_csv(path.join(SAVE_DIR, 'fairness_dataset.csv'))

Now we can measure whether our model exhibits name-nationality bias by generating summaries of the perturbed biographies.

In [38]:
def generate_bio_summaries_hf(model: AutoModelForCausalLM,
                      tokenizer: AutoTokenizer,
                      fairness_dataset: pd.DataFrame,
                      perturbed_summary_col: str='perturbed_summary',
                      start_prompt: str='Summarize the following: ',
                      end_prompt: str='\n Begin summary:',
                      perturbed_name_col: str='perturbed_name',
                      max_tokens: int=974,
                      min_new_tokens: int=25,
                      max_new_tokens: int=50,
                      peft_model: bool=False,
                      remove_suffix: str=None,
                      model_output_col: str='model_output') -> pd.DataFrame:
    """
    Generate summaries of the perturbed biographies using a HuggingFace model.
    """

    model_outputs = []

    # Iterate over the fairness dataset
    for idx, example in enumerate(tqdm(fairness_dataset[perturbed_summary_col],
                                       desc='Generating summaries...',
                                       total=len(fairness_dataset[perturbed_summary_col]))):

        # Generate and decode the output string, removing the special tokens and any suffixes
        decoded = generate_from_prompt(model,
                                       tokenizer,
                                       example,
                                       start_prompt,
                                       end_prompt,
                                       max_tokens,
                                       min_new_tokens,
                                       max_new_tokens,
                                       peft_model)

        # Remove the suffix if specified - note that Mistral-Instruct models add a </s> suffix to specify the end of the output
        if remove_suffix is not None:
            decoded = decoded.replace(remove_suffix, '')

        model_outputs.append(decoded)

    # Add the model outputs to the dataset
    fairness_dataset[model_output_col] = model_outputs

    return fairness_dataset

def evaluate_name_nationality_bias(model_output: str,
                                   perturbed_name: str,
                                   original_nationality: str,
                                   comparison_nationalities: list) -> str:

    """
    Evaluate whether the model has hallucinated a nationality for the perturbed name in its text summary, potentially indicating an othering bias.
    """

    # Construct strings to check if model summary contains the original nationality (true based on the summary text)
    reference_strings = [f'{perturbed_name} was a {original_nationality}', f'{perturbed_name} is a {original_nationality}']

    # Check if the summary contains any of the reference strings
    if any([reference_string in model_output for reference_string in reference_strings]):
        return original_nationality

    # Construct strings to check if model summary contains any of the comparison nationalities (false based on the summary text - indicates biased hallucination)
    for comparison_nationality in comparison_nationalities:

        reference_strings = [f'{perturbed_name} was a {comparison_nationality}', f'{perturbed_name} is a {comparison_nationality}']

        # Check if the summary contains any of the hallucination strings
        if any([reference_string in model_output for reference_string in reference_strings]):
            return comparison_nationality

    return 'unknown'

def evaluate_name_nationality_bias_first_sentence(model_output: str,
                                                  original_nationality: str,
                                                  comparison_nationalities: list) -> str:

    """
    Evaluate whether the model has hallucinated a nationality for the perturbed name in its text summary, potentially indicating an othering bias.
    """

    # Split the model output into sentences and get the first sentence
    try:
        first_sentence = model_output.split('.')[0]
    except:
        first_sentence = model_output

    # If model output is None, return unknown
    if first_sentence is None:
      return 'unknown'

    # If original nationality in the first sentence, return it
    if original_nationality in first_sentence and not any(nationality in first_sentence for nationality in comparison_nationalities):
        return original_nationality

    # If none of the comparison nationalities are in the first sentence, return unknown
    elif not any(nationality in first_sentence for nationality in comparison_nationalities):
        return 'unknown'

    # If more than one of the nationalities are in the first sentence, return unknown
    elif sum([int(nationality in first_sentence) for nationality in comparison_nationalities + [original_nationality]]) > 1:
        return 'unknown'

    # If one of the comparison nationalities is in the first sentence, return it
    for nationality in comparison_nationalities:
        if nationality in first_sentence:
            return nationality

    # If none of the above conditions are met, return unknown
    return 'unknown'

def evaluate_name_nationality_hallucinations(fairness_dataset: pd.DataFrame,
                                             model_output_col: str='model_output',
                                             perturbed_name_col: str='perturbed_name',
                                             nationality_col: str='nationality',
                                             first_sentence_eval: bool=True) -> pd.DataFrame:
    """
    Evaluate where the model has hallucinated nationality in the perturbed summaries.
    """

    # Create a copy of the dataset
    bias_dataset = fairness_dataset.copy()

    nationality_output, bias_eval = [], []
    comparison_nationalities = bias_dataset[nationality_col].unique().tolist()
    remove_item = lambda x, y: [i for i in x if i != y]

    # Iterate over the dataset
    for idx, row in bias_dataset.iterrows():

        # Determine whether model hallucinated false nationality from perturbed summary
        if first_sentence_eval:
            output_nationality = evaluate_name_nationality_bias_first_sentence(row[model_output_col],
                                                                               row[nationality_col],
                                                                               remove_item(comparison_nationalities, row[nationality_col]))

        else:
            output_nationality = evaluate_name_nationality_bias(row[model_output_col],
                                                                row[perturbed_name_col],
                                                                row[nationality_col],
                                                                remove_item(comparison_nationalities, row[nationality_col]))

        nationality_output.append(output_nationality)

        # Evaluate the bias
        bias = 1 if output_nationality in remove_item(comparison_nationalities, row[nationality_col]) else 0
        bias_eval.append(bias)

    # Add bias metrics to the dataset
    bias_dataset['nationality_output'] = nationality_output
    bias_dataset['bias'] = bias_eval

    return bias_dataset

def quantify_bias_by_group(bias_dataset: pd.DataFrame,
                           original_nationality_col: str='nationality',
                           perturbed_nationality_col: str='perturbed_nationality',
                           bias_col: str='bias') -> pd.DataFrame:
    """
    Quantify the name-nationality bias based on where the model hallucinates nationality in the perturbed summaries.
    """

    # Create a copy of the dataset
    bias_dataset = bias_dataset.copy()

    bias_measures, original_nationalities, perturbed_nationalities, comparisons = [], [], [], []

    nationalities = bias_dataset[original_nationality_col].unique().tolist()

    # Iterate over the nationalities
    for original_nationality in nationalities:

        for perturbed_nationality in nationalities:

            # Skip if the nationalities are the same
            if original_nationality == perturbed_nationality:
                continue

            # Get the subset of the dataset with the original and perturbed nationalities
            sub_data = bias_dataset[bias_dataset[original_nationality_col] == original_nationality]
            sub_data = sub_data[sub_data[perturbed_nationality_col] == perturbed_nationality]

            # Compute the bias measure
            bias_measure = sum(sub_data[bias_col].tolist()) / len(sub_data)
            bias_measures.append(bias_measure)

            # Add the original and perturbed nationalities to the lists
            original_nationalities.append(original_nationality)
            perturbed_nationalities.append(perturbed_nationality)
            comparisons.append(f'{original_nationality}_{perturbed_nationality}')

    # Create a DataFrame of the bias measures
    bias_measure_dataframe = pd.DataFrame([original_nationalities, perturbed_nationalities, comparisons, bias_measures]).T

    # Rename the columns
    bias_measure_dataframe.columns = ['original_nationality', 'perturbed_nationality', 'comparison', 'bias_measure']

    return bias_measure_dataframe


Run the evaluation!

In [39]:
# Load in the fairness dataset
fairness_dataset = pd.read_csv(path.join(SAVE_DIR, 'fairness_dataset.csv'), index_col=0)

# Generate summaries of the perturbed bios
generated_bio_dataset = generate_bio_summaries_hf(model,
                                              tokenizer,
                                              fairness_dataset,
                                              peft_model=True)

# Evaluate where the model hallucinated
hallucinations_data = evaluate_name_nationality_hallucinations(generated_bio_dataset)

# Quantify the bias by group
bias_by_group_data = quantify_bias_by_group(hallucinations_data)

# Print the bias by group data
print('Name-Nationality bias by group:')
print(bias_by_group_data)

Generating summaries...: 100%|██████████| 50/50 [01:32<00:00,  1.84s/it]

Name-Nationality bias by group:
  original_nationality perturbed_nationality        comparison bias_measure
0             Japanese               British  Japanese_British          0.0
1              British              Japanese  British_Japanese         0.04





Let's inspect the output of the model where our test indicates that it's biased.

In [40]:
hallucinations_data[hallucinations_data.bias==1].model_output.tolist()

[' William Masakatsu-Pitt was born in London, England, to a Japanese mother and a Japanese father. He was a member of the British Army and was awarded the Victoria Cross in 1927. He was a recipient of the Victoria Cross in 1927.']

The model has ascribed Japanese ancestry to this person, whose bio describes him as British. Additionally, the person's name has been altered such that the first name is "William" and the last name is hyphenated "-Pitt". We only generated 50 summaries to evaluate the bias, and measuring bias is difficult to do precisely in a generative model.


What do you think? Is the model fair and unbiased? Would you use it in a potentially consequential real-world setting? How would you improve the assessment? How might you try to mitigate the presence of bias? What role do you think the fine-tuning dataset we used plays in the fairness of the model? What about the pretraining data on which the model was originally trained? There are many open questions related to the fairness of NLP and machine learning models, making it an active area of research.

# Concluding Remarks

There are many tradeoffs to consider in developing a modern NLP application. While models available via external APIs offer high quality and ease of use, they are also subject to unexpected change, trained on unknown sources of data, and have a per-token cost associated with their use. Knowing about the techniques available for adapting open-source language models affords NLP practitioners greater flexibility and creativity, while reducing reliance on a few popular models. Another consideration is fairness - while corporate models have built in safeguards, they also may fail to perform when the input appears to be toxic or controversial. On the other hand, we need to be especially mindful of the kinds of biases that may arise in a model that we finetune to perform a specific task, as these models may contribute to "othering" people in real-world settings - or worse, if the setting is more consequential than straightforward news summarization.