Multi-label Classification

Adds a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores for a number of labels for every example in the batch.

Based on https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb

Uses Lora for PEFT

# Setup

In [1]:
# Base Model

base_model_id = "mistralai/Mistral-7B-v0.1"

In [2]:
seed = 2024
use_lora = True
use_fp16 = True
use_gradient_checkpointing = True,  # Save some memory at the expense of training
# See https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

hf_site_id = '2024-mcm-everitt-ryan'
dataset_id = f'{hf_site_id}/job-bias-synthetic-human-benchmark'
base_model_name = base_model_id.split('/')[-1]
model_id = f'{base_model_name}-job-bias-classifier'


In [3]:
!pip install -q transformers datasets sentencepiece accelerate evaluate

In [4]:
!pip install -q peft

# Dataset

In [5]:
from datasets import load_dataset

dataset = load_dataset(dataset_id)
column_names = dataset['train'].column_names

print(f"Columns: {dataset.num_columns}")
print(f"Rows: {dataset.num_rows}")
print(f"Column Names: {column_names}")

In [6]:
example = dataset['train'][0]
example['text']

In [7]:
text_col = 'text'
label_cols = [col for col in column_names if col.startswith('label_')]

labels = [label.replace("label_", "") for label in label_cols]

id2label = {idx: label for idx, label in enumerate(labels)}
label2id = {label: idx for idx, label in enumerate(labels)}
print(f"Text column: {text_col}")
print(f"Label columns: {label_cols}")
print(f"Labels: {labels}")

In [8]:
# Remove all columns apart from the two needed for multi-class classification
keep_columns = ['context_id', 'synthetic', text_col] + label_cols
for split in ["train", "val", "test"]:
    dataset[split] = dataset[split].remove_columns(
        [col for col in dataset[split].column_names if col not in keep_columns])


In [9]:
import pandas as pd

# Merge train,val, test into one dataframe
df = pd.concat([
    dataset['train'].to_pandas(),
    dataset['val'].to_pandas(),
    dataset['test'].to_pandas()])

print(f"{df.synthetic.value_counts().to_string()}")
for col in label_cols:
    print(f"\n{df[col].value_counts().to_string()}")

In [10]:
# Longest phrase
longest_text = df[text_col].apply(lambda x: (len(x), x)).max()[1]
longest_text

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_prefix_space=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

tokenizer

In [12]:
max_char = len(longest_text)
max_words = len(longest_text.split())
max_tokens = len(tokenizer.encode(longest_text))

print(f'Max characters: {max_char}')
print(f'Max words: {max_words}')
print(f'Max tokens: {max_tokens}')

In [13]:
tokenizer_max_length = min(max_tokens, tokenizer.model_max_length)
tokenizer_max_length

In [14]:
import numpy as np


def preprocess_data(sample):
    # take a batch of texts
    text = sample[text_col]
    # encode them
    encoding = tokenizer(text, truncation=True, max_length=tokenizer_max_length, padding="max_length")
    #encoding = tokenizer(text, truncation=True, max_length=tokenizer_max_length, padding=True)
    # add labels
    labels_batch = {k: sample[k] for k in sample.keys() if k in label_cols}
    # create numpy array of shape (batch_size, num_labels)
    labels_matrix = np.zeros((len(text), len(label_cols)))
    # fill numpy array
    for idx, label in enumerate(label_cols):
        labels_matrix[:, idx] = labels_batch[label]

    encoding["labels"] = labels_matrix.tolist()

    return encoding

In [15]:
#ds_train = ds_train.map(tokenize, batched=True, batch_size=len(ds_train))
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

In [16]:
example = encoded_dataset['train'][0]
print(example.keys())

In [17]:
tokenizer.decode(example['input_ids'])

In [18]:
example['labels']

In [19]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

In [20]:
encoded_dataset.set_format("torch")

# Model

Here we define a model that includes a pre-trained base (i.e. the weights) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(label_cols)` output neurons, and we set the id2label and label2id mappings.

In [21]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(base_model_id,
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(label_cols),
                                                           id2label=id2label,
                                                           label2id=label2id)
model

In [22]:
model.config

In [23]:
from peft import get_peft_model, LoraConfig, TaskType

if use_lora:
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_CLS, lora_alpha=16, lora_dropout=0.1, bias="none",
        r=2,
#        target_modules='all-linear'
#        target_modules=[
#            "q_proj",
#            "v_proj",
#        ],
    )
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()


Let's verify a batch as well as a forward pass:

In [24]:
encoded_dataset['train'][0]['labels'].type()

In [25]:
encoded_dataset['train']['input_ids'][0]

In [26]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0),
                labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

# Define Metrics

In [27]:
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch


# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
# added extras
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels

    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)

    f1_micro = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    f1_macro = f1_score(y_true=y_true, y_pred=y_pred, average='macro')
    f1_samples = f1_score(y_true=y_true, y_pred=y_pred, average='samples')
    f1_weighted = f1_score(y_true=y_true, y_pred=y_pred, average='weighted')

    precision_micro = precision_score(y_true=y_true, y_pred=y_pred, average='micro')
    recall_micro = recall_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc_micro = roc_auc_score(y_true=y_true, y_score=y_pred, average='micro')
    # return as dictionary
    metrics = {
        'accuracy': accuracy,
        f'f1_micro': f1_micro,
        f'f1_macro': f1_macro,
        f'f1_samples': f1_samples,
        f'f1_weighted': f1_weighted,
        f'precision_micro': precision_micro,
        f'recall_micro': recall_micro,
        f'roc_auc_micro': roc_auc_micro}
    return metrics


def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

# Train

In [29]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from huggingface_hub import HfFolder

batch_size = 8
gradient_accumulation_steps=4
metric_name = "f1_micro"
optimiser = 'paged_adamw_8bit'  # Use paged optimizer to save memory
#learning_rate = 4e-5  # Use value slightly smaller than pretraining lr value & close to LoRA standard
#learning_rate = 5e-5
learning_rate = 1e-3
epochs=10

args = TrainingArguments(
    model_id,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3, #to prevent running out of disk space
    learning_rate=learning_rate,
    #optim=optimiser,
    #lr_scheduler_type="cosine",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=epochs,
    weight_decay=0.01,
    #weight_decay=0.001,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    fp16=use_fp16,
    gradient_checkpointing=use_gradient_checkpointing,
    #push_to_hub=True,
    #output_dir=repository_id,
    #logging_dir=f"{model_id}/logs",
    #logging_strategy="steps",
    logging_steps=10,
    #warmup_steps=500,
    #warmup_ratio=0.1,
    #max_grad_norm=0.3,
    #save_total_limit=2,
    #report_to="tensorboard",
    #push_to_hub=True,
    #hub_strategy="every_save",
    #hub_model_id=hub_model_id,
    #hub_token=HfFolder.get_token(),
)

#early_stop = transformers.EarlyStoppingCallback(10, 1.15)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["val"],
    # For padding a batch of examples to the maximum length seen in the batch
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
    #tokenizer=tokenizer,
    #   callbacks=[early_stop]
)

model.config.use_cache = False  # Silence the warnings.
model.config.pad_token_id = model.config.eos_token_id
if not use_lora:
    # Freeze the pre-trained model's parameters
    for param in model.base_model.parameters():
        param.requires_grad = False

trainer.train()

# Evaluate

In [30]:
test_results = trainer.evaluate(eval_dataset=encoded_dataset['test'])

In [31]:
print(f'evaluation (test) results: {test_results}')

In [32]:
import pandas as pd
df = pd.DataFrame(list(test_results.items()), columns=['Metric', 'Value'])
print(df.to_string(index=False))

In [33]:
# Function to free up memory
def free_up_memory():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
    import gc
    gc.collect()

# Ensure model is in evaluation mode
model.eval()

# Move model to the appropriate device (GPU or CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Free up memory
free_up_memory()


In [34]:
#The following contains age and disability bias
text = "Responsibilities: Oversee daily warehouse operations, including receiving, storing, and distributing products. Manage inventory control processes and ensure accurate record-keeping. Develop and implement warehouse policies and procedures to improve efficiency and safety. Lead and mentor a team of warehouse staff, fostering a positive and productive work environment. Coordinate with other departments to ensure smooth workflow and timely order fulfillment. Monitor performance metrics and prepare reports for senior management. Ensure compliance with health and safety regulations. Requirements: Bachelor's degree in logistics, supply chain management, or a related field. Minimum of 5 years of experience in warehouse management. Strong leadership and organizational skills. Excellent communication and interpersonal skills. Proficiency in warehouse management software and Microsoft Office Suite. Ability to work in a fast-paced environment and handle multiple tasks simultaneously. Must be under 40 years old to ensure a fit with our energetic and fast-paced team culture. Preferred Qualifications: Experience with lean warehouse operations and continuous improvement methodologies. Certification in warehouse management or related disciplines. Knowledge of industry-specific regulations and best practices. Physical Requirements: Ability to lift up to 50 pounds. Ability to stand and walk for extended periods. Young and dynamic individuals preferred to keep up with the physical demands of the job. Benefits: Health, dental, and vision insurance. Retirement savings plan with company match. Paid time off and holidays. Opportunities for professional development and career advancement. How to Apply: Interested candidates are invited to submit their resume and cover letter to [email@example.com]. ABC Logistics is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees."

with torch.no_grad():
    encoding = tokenizer(text, return_tensors="pt")
    encoding = {k: v.to(model.device) for k,v in encoding.items()}

    outputs = model(**encoding)

In [35]:
logits = outputs.logits
logits.shape
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)