### Stratified Splitting & Hyperparameter Search

In this notebook, we aim to demonstrate two advanced machine learning techniques using the Hugging Face Trainer API. Specifically, we'll focus on:

- Multilabel Iterative Stratified Splitting: This method is used for more equitable division of imbalanced datasets across multiple labels, making sure that each fold in a k-fold cross-validation retains the same (similar) multilabel distribution as the complete dataset.

- Hyperparameter Search: We will walk through how to conduct a systematic hyperparameter search to fine-tune models for optimal performance.


In [1]:
#!pip install scikit-multilearn

In [2]:
from collections import Counter
from itertools import chain
import re

import numpy as np
import torch
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, f1_score, roc_auc_score
from sklearn.preprocessing import MultiLabelBinarizer
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Text Preprocessing
def preprocess_text(text: str) -> str:
    """Remove numbers, newlines, and special characters from text."""
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Find Single Appearance Labels
def find_single_appearance_labels(y):
    """Find labels that appear only once in the dataset."""
    all_labels = list(chain.from_iterable(y))
    label_count = Counter(all_labels)
    single_appearance_labels = [label for label, count in label_count.items() if count == 1]
    return single_appearance_labels

# Remove Single Appearance Labels from Dataset
def remove_single_appearance_labels(dataset, single_appearance_labels):
    """Remove samples with single-appearance labels from both train and test sets."""
    for split in ['train', 'test']:
        dataset[split] = dataset[split].filter(lambda x: all(label not in single_appearance_labels for label in x['topics']))
    return dataset

In [4]:
def multi_label_metrics(predictions, labels, threshold=0.5):
    sigmoid = torch.nn.Sigmoid()
    
    probs = sigmoid(torch.Tensor(predictions))
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    
    y_true = labels
    
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

### Load dataset

Note that we are using `ModApte` split in this case.

In [5]:
# Load Dataset
dataset = load_dataset("reuters21578", "ModApte")

Downloading builder script: 100%|██████████| 17.9k/17.9k [00:00<00:00, 24.9MB/s]
Downloading readme: 100%|██████████| 16.0k/16.0k [00:00<00:00, 3.76MB/s]


### Preprocess data

- Find out single appearance labels and remove them from train and test split
- Combine title and text together as `text` column
- Transform topics into multihot encoding as `labels` column
- Tokenize dataset

In [6]:
# Find and Remove Single Appearance Labels
print("Finding single appearance labels...")
y_train = [item['topics'] for item in dataset['train']]
single_appearance_labels = find_single_appearance_labels(y_train)
print(f"Single appearance labels: {single_appearance_labels}")

print("Removing samples with single-appearance labels...")
dataset = remove_single_appearance_labels(dataset, single_appearance_labels)

Finding single appearance labels...
Single appearance labels: ['lin-oil', 'rye', 'red-bean', 'groundnut-oil', 'citruspulp', 'rape-meal', 'corn-oil', 'peseta', 'cotton-oil', 'ringgit', 'castorseed', 'castor-oil', 'lit', 'rupiah', 'skr', 'nkr', 'dkr', 'sun-meal', 'lin-meal', 'cruzado']
Removing samples with single-appearance labels...


In [7]:
print("Combine title and text together")
dataset = dataset.map(
    lambda x: {"text": x["title"] + " " + x["text"]}
)

Combine title and text together


In [8]:
dataset

DatasetDict({
    test: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 3292
    })
    train: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 9588
    })
    unused: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 722
    })
})

### Create train/validation sets

In [9]:
from skmultilearn.model_selection import iterative_train_test_split
from scipy.sparse import csr_matrix

X = np.array(dataset["train"]["text"]).reshape(-1, 1)

mlb = MultiLabelBinarizer()
y = mlb.fit_transform(dataset["train"]["topics"])
y_sparse = csr_matrix(y)

X_train, y_train, X_val, y_val = iterative_train_test_split(X, y_sparse, test_size=0.5)

In [10]:
y_train = mlb.inverse_transform(y_train)
y_val = mlb.inverse_transform(y_val)

y_train = [list(tup) for tup in y_train]
y_val = [list(tup) for tup in y_val]

# Convert to Python list of strings
X_train = [item[0] for item in X_train.tolist()]
X_val = [item[0] for item in X_val.tolist()]

In [11]:
X_train[:2], y_train[:2]

(["USX &lt;X> DEBT DOWGRADED BY MOODY'S Moody's Investors Service Inc said it\nlowered the debt and preferred stock ratings of USX Corp and\nits units. About seven billion dlrs of securities is affected.\n    Moody's said Marathon Oil Co's recent establishment of up\nto one billion dlrs in production payment facilities on its\nprolific Yates Field has significant negative implications for\nUSX's unsecured creditors.\n    The company appears to have positioned its steel segment\nfor a return to profit by late 1987, Moody's added.\n    Ratings lowered include those on USX's senior debt to BA-1\nfrom BAA-3.\n Reuter\n",
  'CHAMPION PRODUCTS &lt;CH> APPROVES STOCK SPLIT Champion Products Inc said its\nboard of directors approved a two-for-one stock split of its\ncommon shares for shareholders of record as of April 1, 1987.\n    The company also said its board voted to recommend to\nshareholders at the annual meeting April 23 an increase in the\nauthorized capital stock from five mln to 25 

In [12]:
from datasets import Dataset

train_dataset = Dataset.from_dict({"text": X_train, "topics": y_train})
val_dataset = Dataset.from_dict({"text": X_val, "topics": y_val})

In [13]:
dataset["train"] = train_dataset
dataset["validation"] = val_dataset

In [14]:
dataset

DatasetDict({
    test: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 3292
    })
    train: Dataset({
        features: ['text', 'topics'],
        num_rows: 4794
    })
    unused: Dataset({
        features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
        num_rows: 722
    })
    validation: Dataset({
        features: ['text', 'topics'],
        num_rows: 4794
    })
})

### Sanity check on the label ratio in train/val set

Looking good, the label splitting is done in a balanced way. There are only three cases where a label appears in the validation set but is missing in the training set. This is negligible because the label count in the validation set is very small and less than 3.

In [15]:
label_count_train = Counter(list(chain.from_iterable(dataset["train"]["topics"])))
label_count_validation = Counter(list(chain.from_iterable(dataset["validation"]["topics"])))

unique_labels = set(label_count_validation.keys())

In [16]:
print(f'{"label":<15} - {"TRAIN ":^10} : {"VAL":^10}')
print("*"*50)
for label in unique_labels:
    print(f'{label:<15} - {label_count_train.get(label, "MISSING"):^10} : {label_count_validation.get(label, "MISSING"):^10}')

label           -   TRAIN    :    VAL    
**************************************************
money-fx        -    267     :    266    
l-cattle        -     3      :     3     
wpi             -     9      :     10    
gas             -     19     :     18    
wheat           -    105     :    105    
trade           -    184     :    185    
gnp             -     50     :     50    
palladium       -     1      :     1     
linseed         -  MISSING   :     1     
housing         -     8      :     8     
tea             -     5      :     4     
wool            -     1      :     1     
heat            -     7      :     7     
oat             -     4      :     3     
nat-gas         -     38     :     37    
soybean         -     37     :     37    
pork-belly      -     1      :     2     
interest        -    173     :    174    
jet             -     2      :     2     
cpu             -     1      :     2     
naphtha         -     1      :     1     
cornglutenfeed  -  MISSIN

In [17]:
# Check number of unique labels 
unique_labels = set(chain.from_iterable(dataset['train']["topics"]))
print(f"We have {len(unique_labels)} unique labels:\n{unique_labels}")

# Transform topics into multi-hot encoding format
mlb = MultiLabelBinarizer()
mlb.fit(dataset['train']['topics'])
dataset = dataset.map(
    lambda x: {"labels": torch.from_numpy(mlb.transform(x["topics"])).float()}, batched=True)

labels = mlb.classes_
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
num_labels = len(id2label)

assert num_labels == len(unique_labels) 

We have 92 unique labels:
{'money-fx', 'l-cattle', 'wpi', 'gas', 'wheat', 'gnp', 'trade', 'palladium', 'housing', 'tea', 'heat', 'wool', 'oat', 'nat-gas', 'soybean', 'pork-belly', 'interest', 'jet', 'cpu', 'naphtha', 'plywood', 'cocoa', 'propane', 'orange', 'retail', 'cotton', 'groundnut', 'sugar', 'zinc', 'palmkernel', 'tin', 'veg-oil', 'lumber', 'lei', 'grain', 'dmk', 'bop', 'yen', 'money-supply', 'dlr', 'earn', 'pet-chem', 'crude', 'nickel', 'soy-meal', 'platinum', 'saudriyal', 'tapioca', 'potato', 'rand', 'stg', 'dfl', 'silver', 'rice', 'sun-oil', 'reserves', 'cpi', 'copra-cake', 'copper', 'livestock', 'rape-oil', 'strategic-metal', 'corn', 'fishmeal', 'jobs', 'rapeseed', 'hog', 'acq', 'fuel', 'rubber', 'oilseed', 'sunseed', 'ipi', 'instal-debt', 'lead', 'alum', 'can', 'meal-feed', 'coconut', 'palm-oil', 'income', 'iron-steel', 'sorghum', 'ship', 'coconut-oil', 'barley', 'inventories', 'gold', 'austdlr', 'carcass', 'soy-oil', 'coffee'}


Map: 100%|██████████| 3292/3292 [00:00<00:00, 54491.90 examples/s]
Map: 100%|██████████| 4794/4794 [00:00<00:00, 102420.47 examples/s]
Map: 100%|██████████| 722/722 [00:00<00:00, 39493.56 examples/s]
Map: 100%|██████████| 4794/4794 [00:00<00:00, 104085.75 examples/s]


In [18]:
# sanity check:
for idx, label in id2label.items():
    if idx>=10:
        break
    
    print(f"{idx}: {label}")

0: acq
1: alum
2: austdlr
3: barley
4: bop
5: can
6: carcass
7: cocoa
8: coconut
9: coconut-oil


In [19]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

# Tokenize and remove unwanted columns
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=512)

columns = dataset["train"].column_names
columns.remove("text")
columns.remove("labels")
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=columns)

Map: 100%|██████████| 3292/3292 [00:01<00:00, 3289.53 examples/s]
Map: 100%|██████████| 4794/4794 [00:01<00:00, 4302.52 examples/s]
Map: 100%|██████████| 722/722 [00:00<00:00, 3308.04 examples/s]
Map: 100%|██████████| 4794/4794 [00:01<00:00, 4175.98 examples/s]


In [20]:
tokenized_dataset 

DatasetDict({
    test: Dataset({
        features: ['text', 'text_type', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 3292
    })
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 4794
    })
    unused: Dataset({
        features: ['text', 'text_type', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 722
    })
    validation: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 4794
    })
})

In [21]:
example = tokenized_dataset['train'][1]
print(example.keys())

dict_keys(['text', 'labels', 'input_ids', 'attention_mask'])


In [22]:
tokenizer.decode(example['input_ids'])

'[CLS] CHAMPION PRODUCTS & lt ; CH > APPROVES STOCK SPLIT Champion Products Inc said its board of directors approved a two - for - one stock split of its common shares for shareholders of record as of April 1, 1987. The company also said its board voted to recommend to shareholders at the annual meeting April 23 an increase in the authorized capital stock from five mln to 25 mln shares. Reuter [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

In [23]:
print(example['labels'])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [24]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['earn']

In [25]:
tokenized_dataset.set_format("torch")

In [26]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-cased", 
        num_labels=num_labels, 
        problem_type="multi_label_classification",
        id2label=id2label,
        label2id=label2id
    )

### Model Tuning

In this example, we focus on optimizing the learning rate for our machine learning model. Using Optuna for hyperparameter optimization, we will search for the best learning rate in the range of 2e-5 to 5e-5. The goal is to identify the learning rate that yields the best model performance within this specified range. For demonstration purposes, we limit the training epochs to 5 and set n_trial to 2 as well. Feel free to increase these numbers to a larger range for better results. Additionally, you are encouraged to experiment with other parameters to fine-tune the model further

In [27]:
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 2e-5, 5e-5, log=True),
    }

In [28]:
args = TrainingArguments(
    f"hyperparameter-search-distilbert-reuters21578",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    save_total_limit=2,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    greater_is_better=True,
)

In [29]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
# Optuna hyperparameter search
best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    compute_objective=lambda x: x["eval_f1"],
    hp_space=optuna_hp_space,
    n_trials=2,
)

[I 2023-09-08 19:56:18,700] A new study created in memory with name: no-name-b48b9cc9-5c4c-4b5b-b438-a4e3b69b0729
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.356,0.132036,0.0,0.5,0.189612
2,0.0837,0.055436,0.0,0.5,0.189612
3,0.0522,0.045973,0.061551,0.515876,0.221318
4,0.046,0.042393,0.329145,0.598496,0.386316
5,0.0434,0.041176,0.341244,0.602882,0.395077


[I 2023-09-08 20:06:17,398] Trial 0 finished with value: 0.34124372076909754 and parameters: {'learning_rate': 3.262125845083782e-05}. Best is trial 0 with value: 0.34124372076909754.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.3274,0.102568,0.0,0.5,0.189612
2,0.0687,0.049549,0.0,0.5,0.189612
3,0.0477,0.042749,0.300976,0.588573,0.3665
4,0.0427,0.039329,0.367416,0.612593,0.414059
5,0.0401,0.038065,0.375805,0.615828,0.420108


[I 2023-09-08 20:16:15,006] Trial 1 finished with value: 0.37580481192815995 and parameters: {'learning_rate': 3.77934555883044e-05}. Best is trial 1 with value: 0.37580481192815995.


In [31]:
best_trial

BestRun(run_id='1', objective=0.37580481192815995, hyperparameters={'learning_rate': 3.77934555883044e-05}, run_summary=None)