# Chapter 9:Training with Little/No Data

In [None]:
import pandas as pd

dataset_url = "https://git.io/nlp-with-transformers"
df_issues = pd.read_json(dataset_url, lines=True)
print(f"DataFrame shape: {df_issues.shape}")

In [None]:
cols = ["url", "id", "title", "user", "labels", "state", "created_at", "body"]
df_issues.loc[2, cols].to_frame()

In [None]:
# issues is a list of json objects with metadata on each label
df_issues.loc[1, 'user']

In [None]:
# overwrite labels with just the names as thats all we're interested in
df_issues["labels"] = (df_issues["labels"].apply(lambda x: [meta["name"] for meta in x]))
df_issues[["labels"]].head()

In [None]:
# compute length of each row to find num of labels per issue
df_issues["labels"].apply(lambda x: len(x)).value_counts().to_frame().T

Majority of issues 0 or one label.

In [None]:
# loop at top 10 most frequent labels in dataset
df_counts = df_issues["labels"].explode().value_counts()
print(f"Number of labels: {len(df_counts)}")
# top-8 label categories
df_counts.to_frame().head(8).T

Large class imbalance here.. We'll turn our classification problem to focusing on a tagger for a subset of the labels; some labels such as "good first issue" or "help wanted" are very difficult to predict from description..

In [None]:
label_map = {"Core: Tokenization": "tokenization",
 "New model": "new model",
 "Core: Modeling": "model training",
 "Usage": "usage",
 "Core: Pipeline": "pipeline",
 "TensorFlow": "tensorflow or tf",
 "PyTorch": "pytorch",
 "Examples": "examples",
 "Documentation": "documentation"}

def filter_labels(x):
    return [label_map[label] for label in x if label in label_map]

In [None]:
df_issues["labels"] = df_issues["labels"].apply(filter_labels)
all_labels = list(label_map.values())

In [None]:
df_counts = df_issues["labels"].explode().value_counts()
df_counts.to_frame().T

In [None]:
# create new col for labeled or not
df_issues["split"] = "unlabeled"
mask = df_issues["labels"].apply(lambda x: len(x)) > 0
df_issues.loc[mask, "split"] = "labeled"
df_issues["split"].value_counts().to_frame()

In [None]:
for col in ["title", "body", "labels"]:
    print(f"{col}: {df_issues[col].iloc[26][:500]}\n")

In [None]:
# concat title with body
df_issues["text"] = (df_issues.apply(lambda x: x["title"] + "\n\n" + x["body"], axis=1))

In [None]:
len_before = len(df_issues)
# remove duplicates
df_issues = df_issues.drop_duplicates(subset="text")
print(f"Removed {(len_before - len(df_issues)) / len_before:.2%} duplicates.")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
(df_issues["text"].str.split().apply(len).hist(bins=np.linspace(0,500,50),grid=False,edgecolor="C0"))
plt.title("Words per issue")
plt.xlabel("Number of words")
plt.ylabel("Number of issues")
plt.show()

Hopefully truncating 512 or longer won't have too much of an impact.

## Creating Training Sets

Multilabel problems; slightly trickier because there is no guaranteed balance for all labels. We can approximate, however with scikit-multilearn library.

First need to tsfm our set of labels into a form that the model can process. Use MultiLabelBinarizer which takes a list of label names and creates a vector with 0's for absent labels and 1's for present labels.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit([all_labels]) # 1's corresponds to present labels
mlb.transform([["tokenization", "new model"], ["pytorch"]])

Use `iterative_train_test_split()` function to create splits which creates train/test splits iteratively to achieve balanced labels. Wrap in a function that we can apply to DataFramess. As it expects a 2D feature matrix, we add a dimension to the possible indices before splitting.

In [None]:
from skmultilearn.model_selection import iterative_train_test_split

def balanced_split(df, test_size=0.5):
    # add dimension
    ind = np.expand_dims(np.arange(len(df)), axis=1)
    # get labels
    labels = mlb.transform(df["labels"])
    # iteratively split
    ind_train, _, ind_test, _ = iterative_train_test_split(ind, labels, test_size)
    
    return df.iloc[ind_train[:, 0]], df.iloc[ind_test[:, 0]]

In [None]:
from sklearn.model_selection import train_test_split

# split to supervised and unsupervised datasets
df_clean = df_issues[["text", "labels", "split"]].reset_index(drop=True).copy()
df_unsup = df_clean.loc[df_clean["split"] == "unlabeled", ["text", "labels"]]
df_sup = df_clean.loc[df_clean["split"] == "labeled", ["text", "labels"]]

# then create balanced training, val and test for supervised part
np.random.seed(0)
df_train, df_tmp = balanced_split(df_sup, test_size=0.5)
df_valid, df_test = balanced_split(df_tmp, test_size=0.5)

In [None]:
from datasets import Dataset, DatasetDict

# create DatasetDict with splits to easily tokenize dataset
# and integrate with Trainer
ds = DatasetDict({
    "train": Dataset.from_pandas(df_train.reset_index(drop=True)),
    "valid": Dataset.from_pandas(df_valid.reset_index(drop=True)),
    "test": Dataset.from_pandas(df_test.reset_index(drop=True)),
    "unsup": Dataset.from_pandas(df_unsup.reset_index(drop=True))
})

### Create Training Slices

We'd like to investigate sparse labeled data and multilabel classification. Only 220 examples to train with in training set. Start with eight samples per label and build up until the slide covers the full training set using `iterative_train_test_split()`

In [None]:
np.random.seed(0)
all_indices = np.expand_dims(list(range(len(ds["train"]))), axis=1)
indices_pool = all_indices
labels = mlb.transform(ds["train"]["labels"])
train_samples = [8, 16, 32, 64, 128]
train_slices, last_k = [], 0

for i, k in enumerate(train_samples):
    # split samples necessary to fill the gap to the next split size
    indices_pool, labels, new_slice, _ = iterative_train_test_split(
        indices_pool, labels, (k-last_k)/len(labels)
    )
    last_k = k
    if i==0: train_slices.append(new_slice)
    else: train_slices.append(np.concatenate((train_slices[-1], new_slice)))
        
# add full dataset as last slice
train_slices.append(all_indices), train_samples.append(len(ds["train"]))
train_slices = [np.squeeze(train_slice) for train_slice in train_slices]

See that we approximately split samples to desired sizes.

In [None]:
print("Target split sizes:")
print(train_samples)
print("Actual split sizes:")
print([len(x) for x in train_slices])

## Implementing a Naive Baseline


Why?
1. Simple baseline based on regex or rules may solve the problem really well, so no need to introduce big models like transformers etc..
2. Baselines provide quick checks as we explore more complex models. Gives us an understanding of the results to expect

Naive Bayes is a good baseline for text classification as it is simple, quick to train and fairlyrobust.

Use scikit-learn multilearn to cast problem as one vs rest classification.

In [None]:
def prepare_labels(batch):
    batch["labels_ids"] = mlb.transform(batch["labels"])
    return batch

ds = ds.map(prepare_labels, batched=True)

In [None]:
# create defaultdict with list to store scores per split

from collections import defaultdict

macro_scores, micro_scores = defaultdict(list), defaultdict(list)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.feature_extraction.text import CountVectorizer

for train_slice in train_slices:
    
    # get training slice and test data
    ds_train_sample = ds["train"].select(train_slice)
    y_train = np.array(ds_train_sample["labels_ids"])
    y_test = np.array(ds["test"]["labels_ids"])
    
    # use simple count vectoriser to encode texts as token counts (BoW approach)
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(ds_train_sample["text"])
    X_test_counts = count_vect.transform(ds["test"]["text"])
    
    # create and trian our model
    classifier = BinaryRelevance(classifier=MultinomialNB())
    classifier.fit(X_train_counts, y_train)
    
    # generate predictions and evaluate
    y_pred_test = classifier.predict(X_test_counts)
    clf_report = classification_report (
        y_test, y_pred_test, target_names=mlb.classes_, zero_division=0,
        output_dict=True
    )
    
    # Store metrics (micro and macro F1 scores)
    macro_scores["Naive Bayes"].append(clf_report["macro avg"]["f1-score"])
    micro_scores["Naive Bayes"].append(clf_report["micro avg"]["f1-score"])

In [None]:
# plot results from experiment
import matplotlib.pyplot as plt

def plot_metrics(micro_scores, macro_scores, sample_sizes, current_model):
    fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(10,4), sharey=True)
    
    for run in micro_scores.keys():
        if run == current_model:
            ax0.plot(sample_sizes, micro_scores[run], label=run, linewidth=2)
            ax1.plot(sample_sizes, macro_scores[run], label=run, linewidth=2)
        else:
            ax0.plot(sample_sizes, micro_scores[run], label=run, linestyle="dashed")
            ax1.plot(sample_sizes, macro_scores[run], label=run, linestyle="dashed")
    
    ax0.set_title("Micro F1 scores")
    ax1.set_title("Macro F1 scores")
    ax0.set_ylabel("Test set F1 score")
    ax0.legend(loc="lower right")
    
    for ax in [ax0, ax1]:
        ax.set_xlabel("Number of training samples")
        ax.set_xscale("log")
        ax.set_xticks(sample_sizes)
        ax.set_xticklabels(sample_sizes)
        ax.minorticks_off()
    plt.tight_layout()
    plt.show()
    
plot_metrics(micro_scores, macro_scores, train_samples, "Naive Bayes")

Note plot number of samples on log scale. Can see micro and macro F1 scores improve as we increase numbere of training samples.

With so few samples to train on, the results are also slightly noisy since each slice can have a different class distribution. What's important here though is the trend!

Now have a look with transformer-based approaches.

### Working with No Labeled Data

First technique: zero-shot classification, suitable for settings with no labeled data at all.

First load BERT-base in fill-mask pipeline which uses masked language model to predict the content of masked tokens..

In [None]:
from transformers import pipeline

pipe = pipeline("fill-mask", model="bert-base-uncased")

In [None]:
movie_desc = "The main characters of the movie madagascar are a lion, a zebra, a giraffe, and a hippo."
prompt = "The movie is about [MASK]."

output = pipe(movie_desc + prompt)
for element in output:
    print(f"Token {element['token_str']}: \t{element['score']:.3f}%")

can also query pipeline for probability of a few given tokens

In [None]:
output = pipe(movie_desc + prompt, targets=["animals", "cars"])
for element in output:
    print(f"Token {element['token_str']}: \t {element['score']:.3f}%")

Can also try on a description closer to cars.

In [None]:
movie_desc = "In the movie transformers aliens can morph into a wide range of vehicles."

output = pipe(movie_desc + prompt, targets=["animals", "cars"])
for element in output:
    print(f"Token {element['token_str']}: \t {element['score']:.3f}%")

It works! Let's see if we can do better by adapting a model that has been fine-tuned on a task closer to text classification: *natural language inference (NLI)*.

There is a proxy task called *text entailment*, where the model needs to determine whether two passages are likely to follow or contradict each other. 

Each asmple has three parts: a premise, a hypothesis, and a label; which can be entailment, neutral or contradiction. 

We can hijack a model trained on MNLI dataset to build a classifier without needing any labels at all! The trick is to treat the text we wish to classify as the premise then formulate the hypothesis as:

`"This example is about {label}."`

Where we insert class name for label; entailment score tells us how likely is the premise to be about that topic, and we can run for any number of classes sequentially. Downside: We need to execute a forward pass for each class, making it less efficient than a standard classifier.

Tricky also, choice of label names can have a large impact on accuracy, and choosing labels with a semantic meaning is generally the best approach.

In [None]:
# transformers have a MNLI model for zero-shot classification built in
# we can initialise it via pipeline as follows
from transformers import pipeline

# device=0 to run model on GPU instead of CPU
pipe = pipeline("zero-shot-classification", device=0)

In [None]:
sample = ds["train"][0]
print(f"Labels: {sample['labels']}")

# mult label = True to return all scores and not max for single-label classification
output = pipe(sample["text"], all_labels, multi_label=True)
print(output["sequence"][:400])
print(f"\nPredictions:")
for label, score in zip(outputs["labels"], output["scores"]):
    print(f"{label}, {score:.2f}")

> Since we are using a subword tokenizer, we can even pass code to the model! Tokenization may not be efficient as only a small fraction of pretraining dataset has code snippets, but since code has lots of natural language this is note an issue!

> Also code block may contain important information, such as the framework (PyTorch or TensorFlow)..

Model is confident this text is about a new model, also relatively high scores for other labels. This is quite a challenging task for model as it is quite technical and about coding.

In [None]:
# function to feed single example through zero-shot pipeline
# scale out to whole validation set by running map()
def zero_shot_pipeline(example):
    output = pipe(example["text"], all_labels, multi_label=True)
    example["predicted_labels"] = output["labels"]
    example["scores"] = output["scores"]
    return example

ds_zero_shot = ds["valid"].map(zero_shot_pipeline)

With the scores, next step is to determine which sets of labels to be assigned to each example. A few options we can experiment with:
- Define threshold and select all labels above the threshold
- Pick top-k labels with the k highest scores

In [None]:
# applies one of approaches to retrieve predictions
def get_preds(example, threshold=None, topk=None):
    preds = []
    if threshold:
        for label, score in zip(example["predicted_labels"], example["scores"]):
            if score >= threshold:
                preds.append(label)
    elif topk:
        for i in range(topk):
            preds.append(example["predicted_labels"][i])
    else:
        raise ValueError("Set either `threshold` or `topk`.")
    
    return {"pred_label_ids": list(np.squeeze(mlb.transform([preds])))}

In [None]:
# returns sklearn clf report from dataset with predicted labels
def get_clf_report(ds):
    y_true = np.array(ds["label_ids"])
    y_pred = np.array(ds["pred_label_ids"])
    return classification_report(
        y_true, y_pred, target_names=mlb.classes_, zero_division=0, output_dict=True
    )

- Macro weighs (divides by) each class equally
- Micro weighs each sample equally

If we ha+e equal number of samples per class, the micro and macro will be the same.

In [None]:
macros, micros = [], []
topks = [1, 2, 3, 4]
for topk in topks:
    ds_zero_shot = ds_zero_shot.map(get_preds, batched=False, fn_kwargs={'topk':topk})
    clf_report = get_clf_report(ds_zero_shot)
    micros.append(clf_report['micro avg']['f1-score'])
    macros.append(clf_report['macro_avg']['f1-score'])
    
plt.plot(topks, micros, label='Micro F1')
plt.plot(topks, macros, label='Macro F1')
plt.xlabel('Top-K')
plt.ylabel('F1-score')
plt.legend(loc='best')
plt.show()

In [None]:
best_t, best_micro = thresholds[np.argmax(micros)], np.max(micros)
print(f'Best threshold (micro): {best_t} with F1-score {best_micro:.2f}.')
best_t, best_macro = thresholds[np.argmax(macros)], np.max(macros)
print(f'Best threshold (micro): {best_t} with F1-score {best_macro:.2f}.')

We can see the trade-off: Threshold too low has too many predictions, leading to low precision. If threshold too high, then hardly any predictions so low recall. So threshold around 0.8 is about the sweet-spot between the two.

Compare top-1 of zero-shot classification (since its best performing) to compare with Naive Bayes on test set:

In [None]:
ds_zero_shot = ds['test'].map(zero_shot_pipeline)
ds_zero_shot = ds_zero_shot.map(get_preds, fn_kwargs={'topk': 1})
clf_report = get_clf_report(ds_zero_shot)

for train_slice in train_slices:
    macro_scores['Zero Shot'].append(clf_report['macro avg']['f1-score'])
    micro_scores['Zero Shot'].append(clf_report['micro avg']['f1-score'])
    
plot_metrics(micro_scores, macro_scores, train_samples, "Zero Shot")

Observations:
- < 50 samples, zero-shot barely outperforms baseline
- \> 50 samples, the zero-shot performance is superior when considering both micro and macro F1-scores. Results for micro F1-score tells us baseline performs well on frequent classes, while the zero-shot pipeline excels at those since it doesn't require any examples to learn from.


> In a real use-case, it makes sense to gather a handful of labeled examples to do some quick evaluations. Important point is that we did not adapt the model parameters (fine-tune) with data, instead, just adapt some hyperparameters.


If struggle to get good resultst on own dataset, some things to improve the zero-shot pipeline:
- Pipeline is very sensitive to name of labels. If the names don't make much sense or are not easily connected to the texts, the pipeline will probably perform poorly. Either try using different names or try using several names in parallel and aggregate them in an extra step.
- Can improve the form of the hypothesis. This may improve performance depending on tthe use-case.

## Working with a Few Labels

### Data Augmentation
Generate new training examples from existing ones by slightly perturbing them. Can be tricky as perturbations can change their meaning. In text, generally two types:
- *Back Translation*: Take a text in source language, translate it to a language using machine translation and translate it back. Tends to work best for high-resource languages or corpora that do not have too many domain specific language
- *Token Perturbations*: Randomly choose and perform simple transformations like random synonym replacement, word insertion, swap or deletion to a text

Here we will focus on synonym replacement, as it is simple to implement and gets across the main idea behind data augmentation.

In [None]:
from transformers import set_seed
from nlpaug.augmenter.word as naw

set_seed(3)
# leverage contextual word embeddings of DistilBERT for synonym replacements
aug = naw.ContextualWordEmbsAug(
    model_path="distilbert-base-uncased", device="cpu", action="substitute"
)
text = "Transformers are the most popular toys"
print(f"Original text: {text}")
print(f"Augmented text: {aug.augment(text)}")

So applies a replacement to generate a new synthetic training example.

In [None]:
# wrap augmentation in a simple function
def augment_text(batch, transformations_per_example=1):
    text_aug, label_ids = [], []
    for text, labels in zip(batch["text"], batch["label_ids"]):
        text_aug += [text]
        label_ids += [labels]
        for _ in range(transformations_per_example):
            text_aug += [aug.augment(text)]
            label_ids += [labels]
    return {"text": text_aug, "label_ids": label_ids}

In [None]:
# can generate any num of new examples 
# train NB clf with one line after we select slice
ds_train_sample = ds_train_sample.map(
    augment_text, batched=True, remove_columns=ds_train_sample.column_names
).shuffle(seed=42)

In [None]:
plot_metrics(micro_scores, macro_scores, train_samples, "Naive Bayes + Aug")

Small amount of data augmentation improves $F_1 - score$ of NB by ~5 pts, and overtakes zero-shot for macro scores after ~170 training samples!

Now look at method using embeddings of large language models.

### Using Embeddings as a Lookup Table

Because large language models can encode information, embeddings can be used as a semantic search engine, find similar documents/comments, or even classify text. It is usually a three step process:
1. Use Language model to embed all labeled texts
2. NN search over stored embeddings
3. Aggregate labels of NN's to get a prediction

Important to calibrate num neighbours to search, as too few may be noisy and too many might mix in neighbouring groups.

Pros: No model fine-tuning to leverage the few available data points. Just need to select a appropriate model pretrained on a similar domain to our dataset.

Use GPT2 for this technique, and pool layers together; make sure we avoid the padded layers so use attention mask to help handle that.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel

model_ckpt = "miguelvictor/python-gpt2-large"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

def mean_pooling(model_output, attention_mask):
    # Extract token embeddings
    token_embeddings = model_output[0]
    
    # compute attention mask
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    # sum embeddings, ignore masked tokens
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    # return avg as a single vector
    return sum_embeddings / sum_mask

def embed_text(examples):
    # can get embeddings for each split with this fn
    inputs = tokenizer(examples["text"], padding=True, truncation=True,
                      max_length=128, return_tensors="pt")
    with torch.no_grad():
        model_output = model(**inputs)
    pooled_embeds = mean_pooling(model_output, inputs["attention_mask"])
    return {"embedding": pooled_embeds.cpu().numpy()}

GPT models don't have padding token, so need to add one before we can get embeddings in a batched fashion. Just recycle end-of-string token for this..

In [None]:
tokenizer.pad_token = tokenizer.eos_token
embs_train = ds["train"].map(embed_text, batched=True, batch_size=16)
embs_valid = ds["valid"].map(embed_text, batched=True, batch_size=16)
embs_test = ds["test"].map(embed_text, batched=True, batch_size=16)

Use *FAISS* index to query embeddings. Create index.

In [None]:
embds_train.add_faiss_indedx("embedding")

Query nearest neighbours.

In [None]:
i, k = 0, 3 # select first query and 3 NN
rn, nl = "\r\n\r\n", "\n" # used to remove newlines in text for compact display

query = np.array(embs_valid[i]["embedding"], dtype=np.float32)
scores, samples = embs_train.get_nearest_examples("embedding", query, k=k)

print(f"QUERY LABELS: {embs_valid[i]['labels']}")
print(f"QUERY TEXT: \n{embs_valid[i]['text'][:200].replace(rn, nl) [...]\n}")
print("=" * 50)
print(f"Retrieved documents:")

for score, label, text in zip(scores, samples["labels"], samples["text"]):
    print("="*50)
    print(f"TEXT:\n{text[:200].replace(rn, nl)} [...]")
    print(f"SCORE: {score:.2f}")
    print(f"LABELS: {label}")

Retrieved documents around adding new and efficient transformer models. Next question: What is the best value for k? Also, how to aggregate labels of retrieved documents?

Try several values of *k*, and vary threshold m < k for label assignment. Record macro and micro performance for each so can decide later which run performed best. Instead of looping over each sample in validation set, we can make use of function `get_nearest_examples_batch()` which accepts a batch of queries:

In [None]:
def get_sample_preds(sample, m):
    return (np.sum(sample["label_ids"], axis=0) >= m).astype(int)

In [None]:
def find_best_k_m(ds_train, valid_queries, valid_labels, max_k=17):
    max_k = min(len(ds_train), max_k)
    perf_micro = np.zeros((max_k, max_k))
    perf_macro = np.zeros((max_k, max_k))
    for k in range(1, max_k):
        for m in range(1, k+1):
            _, samples = ds_train.get_nearest_examples_batch(
                "embdding", valid_queries, k=k)
            y_pred = np.array([get_sample_preds(s, m) for s in samples])
            clf_report = classification_report(
                valid_labels, y_pred, target_names=mlb.classes_, 
                zero_division=0, output_dict=True)
            perf_micro[k, m] = clf_report["micro avg"]["f1-score"]
            perf_macro[k, m] = clf_report["macro avg"]["f1-score"]
    return perf_micro, perf_macro

Find best values with all training samples and visualise scores for all *k* and *m* configurations.

In [None]:
valid_labels = np.array(embs_valid["label_ids"])
valid_queries = np.array(embs_valid["embedding"], dtype=np.float32)
perf_micro, perf_macro = find_best_k_m(embs_train, valid_queries, valid_labels)

fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(10, 3.5), sharey=True)
ax0.imshow(perf_micro)
ax1.imshow(perf_macro)

ax0.set_title("micro scores")
ax0.set_ylabel("k")
ax1.set_title("macro scores")
for ax in [ax0, ax1]:
    ax.set_xlim([0.5, 17 - 0.5])
    ax.set_ylim([17 - 0.5, 0.5])
    ax.set_xlabel("m")
plt.show()

Optimal ratio of $m/k = 1/3$

In [None]:
k, m = np.unravel_index(perf_micro.argmax(), perf_micro.shape)
print(f"Best k: {k}, bestt m: {m}")

Best with k=15 and m=5; so when we retrieve 15 nearest neighbors and assign labels that occured at least 5 times.

Before we can slice the dataset, we must remove the FAISS index as we cannot slice that. We need to slice for Naive Bayes where we go through slices of the training set and evaluate the performance.

In [None]:
embs_train.drop_index("embedding")
test_labels = np.array(embs_test["label_ids"])
test_queries = np.array(embs_test["embedding"], dtype=np.float32)

for train_slice in train_slices:
    # create FAISS index from training slice
    embs_train_tmp = embs_train.select(train_slice)
    embs_train_tmp.add_faiss_index("embedding")
    
    # get best k, m values with validation set
    perf_micro, _ = find_best_k_m(embs_train_tmp, valid_queries, valid_labels)
    k, m = np.unravel_index(perf_micro.argmax(), perf_micro.shape)
    
    # get predictions on test set
    _, samples = embs_train_tmp.get_nearest_examples_batch(
        "embedding", test_queries, k=int(k))
    y_pred = np.array([get_sample_preds(s, m) for s in samples])
    
    # evaluate predictions
    clf_report = classification_report(
        test_labels, y_pred, target_names=mlb.classes_, zero_division=0, output_dict=True
    )
    
    macro_scores["Embedding"].append(clf_report["macro avg"]["f1-score"])
    micro_scores["Embedding"].append(clf_report["micro avg"]["f1-score"])
    
plot_metrics(micro_scores, macro_scores, train_samples, "Embedding")

Embedding lookup is competitive on micro scores while only having two "learnable" parameters, k and m, but performs slightly worse on the macro scores.

Method working depends also on the domain; consider model domain also.

> FAISS partitions initially with k-means clustering; and we get a centroid vector. So we go from searching `n` to `k + n/k`. If *k* is too small, there are many samples we need to compare against in seecond step, if too large then there are many centroids we need to search through. So we look for the minimum of $f(k) = k + n/k$ WRT *k*. We find $k = \sqrt{n}$. We can also use GPUs for spedeup and can compress vectors with advanced quantisation schemes. Credit goes to Facebook for developing this.

We will compare the embedding lookup to fine-tuning a model.

### Fine-Tuning a Vanilla Transformer

If we have labeled data, the first thing we can do is fine-tune a pretrained transformer. Start with standard BERT, then later see what effect fine-tuning a language model has on performance.

> For many applications, there are models already pretrained on that given domain. So it's worth having a look!

In [None]:
import torch
from transformers import (
    AutoTokenizer, AutoConfig, AutoModelForSequenceClassification)

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=128)

ds_enc = ds.map(tokenize, batched=True)
ds_enc = ds_enc.remove_columns(['labels', 'text'])

In [None]:
ds_enc.set_format("torch")
# must be float as multi-label loss also allows discrete labels for class probabilities
ds_enc =  ds_enc.map(
    lambda x: {"label_ids_f": x["label_ids"].to(torch.float)}, remove_columns=["label_ids"])
# workaround: Create new col with labels; format of col is inferred from first one
# then delete the original column and rename the new one to replace the old one
# since changing format of column element wise doesn't work well with Arrow's typed format
ds_enc = ds_enc.rename_column("label_ids_f", "label_ids")

In [None]:
from transformers import Trainer, TrainingArguments

training_args_fine_tune = TrainingArguments(
    output_dir="./results", num_train_epochs=20, learning_rate=3e-5,
    lr_scheduler_type='constant', per_device_train_batch_size=4,
    per_device_eval_batch_size=32, weight_decay=0.0,
    evaluation_strategy="epoch", save_strategy="epoch", logging_strategy="epoch",
    load_best_model_at_end=True, metric_for_best_model='micro f1',
    save_total_limit=1, log_level='error'
)

Need $F_1$ score to choose best model. However, model returns logits so need to normalise predictions with signmoid before binarising with simple threshold. Then return scores interested in from the classification report.

In [None]:
from scipy.special import expit as sigmoid

def compute_metrics(pred):
    y_true = pred.label_ids
    y_pred = sigmoid(pred.predictions)
    y_pred = (y_pred > 0.5).astype(float)
    
    clf_dict = classification_report(y_true, y_pred, target_names=all_labels,
                                    zero_division=0, output_dict=True)
    
    return {"micro f1": clf_dict["micro avg"]["f1-score"],
           "macro f1": clf_dict["macro avg"]["f1-score"]}

For each training set slice, we'll train a classifier from scratch, load the best model at the end of the training loop and store the results on the test set.

In [None]:
config = AutoConfig.from_pretrained(model_ckpt)
config.num_labels = len(all_labels)
config.problem_type = "multi_label_classification"

for train_slice in train_slices:
    model = AutoModelForSequenceClassification.from_pretrained(
        model_ckpt, config=config)
    
    trainer = Trainer(
        model=model, tokenizer=tokenizer, 
        args=training_args_fine_tune,
        compute_metrics=compute_metrics,
        train_dataset=ds_enc["train"].select(train_slice),
        eval_dataset=ds_enc["valid"]
    )
    
    trainer.train()
    pred = trainer.predict(ds_enc["test"])
    metrics = compute_metrics(pred)
    
    macro_scores["Fine-tune (vanilla)"].append(metrics["macro f1"])
    micro_scores["Fine-tune (vanilla)"].append(metrics["micro f1"])
    
plot_metrics(micro_scores, macro_scores, train_samples, "Fine-tune (vanilla)")

Can see that fine-tuning vanilla BERT on dataset leads to competitive results after ~64 examples; and before 64 examples the behaviour is erroatic, in that a model trained on a small sample can be unbalanced. 

Take a look at another promising approach for language models in few-shot domain.

### In-Context and Few-Shot Learning with Prompts

Large language model presents the ability of models to learn effectively from examples presented in the prompt; and the larger a language model is scaled, the better it is at using in-context examples leading to significant performance boosts. This means we can get reasonable results without having to further train.

An alternative is to use labeled data to create examples and fine-tune the language model head on the examples. 

Next we will look at how to make good use of few labeled examples we have, and also the large volume of unlabeled data we have.

## Leveraging Unlabeled Data

Better if downstream task is closer to domain we previously trained the model on.

We use domain adaptation instead, continue training a model on data from our domain; predicting masked words, so no need for labeled data; then can load the adapted model as a classifier and fine-tune it, leveraging the unlabeled data.

Great thing is, unlabeled data is abundantly available, and the adapted model can be reused for many use-cases. After domain adaptation, can apply entity recognition, or another classification task like sentiment analysis since the approach is agnostic to downstream tasks.

### Fine-Tuning a Language Model

We want to make sure we don't train the model to predict special tokens [CLS] and [SEP]; so apply a mask when tokenising by setting `return_speecical_tokens_mask=True`. 

In [None]:
def tokenize(batch):
    return tokenizer(
        batch["text"], truncation=True, max_length=128, return_special_tokens_mask=True)

ds_mlm = ds.map(tokenize, batched=True)
ds_mlm = ds_mlm.remove_columns(["labels", "text", "label_ids"])

We're missing the mechanism to mask tokens in the input sequence and have the target tokens in the outputs. One way to do is to create a function that masks random tokens and creates labels for these sequences, but this would double the size of the dataset as we also store the target sequence in the dataset and we would use the same masking of a sequence every epoch.

More elegant solution is to use a data collator; acts as the bridge between the dataset and the model calls. A batch is sampled from the dataset, and the data collator prepares elements in the batch to feed to the model. It concatenates the tensors of each element into a single tensor. We can use it to do masking and label generation on the fly; so don't need to store labels and get new masks every time we sample.

Use `DataCollatorForLanguageModeling`; initialise with model's tokeniser and fraction of tokens we want to mask via `mlm_probability` argument. We'll mask 15% of the tokens, following the procedure in the BERT paper.

In [None]:
from transformers import DataCollatorForLanguageModeling, set_seed

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
set_seed(3)
data_collator.return_tensors = "np" # switch return format to numpy
inputs = tokenizer("Transformers are awesome!", return_tensors="np")
outputs = data_collator([{"input_ids": inputs["input_ids"][0]}])

pd.DataFrame({
    "Original tokens": tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
    "Masked tokens": tokenizer.convert_ids_to_tokens(outputs["input_ids"][0]),
    "Original input_ids": original_input_ids,
    "Masked input_ids": masked_input_ids,
    "Labels": outputs["labels"][0]
}).T

Replaced ! with mask token; also data collator returns a label array, -100 for original tokens and the token ID for masked tokens. Entries containing -100 are ignored when calculating loss..

In [None]:
# switch data collator format back to PyTorch
data_collator.return_tensors = "pt"

In [1]:
from transformers import AutoModelForMaskedLM

# fine-tune masked language model
training_args = TrainingArguments(
    output_dir = f"{model_ckpt}-issues-128", per_device_train_batch_size=32,
    logging_strategy="epoch", evaluation_strategy="epoch", save_strategy="no",
    num_train_epochs=16, push_to_hub=True, log_level="error", report_to="none"
)

trainer = Trainer(
    model=AutoModelForMaskedLM.from_pretrained('bert-base-uncased'),
    tokenizer=tokenizer, args=training_args, data_collator=data_collator,
    train_dataset=ds_mlm["unsup"], eval_dataset=ds_mlm["train"]
)

trainer.train()
trainer.push_to_hub("Training complete!")

NameError: name 'TrainingArguments' is not defined

In [None]:
# use trainers history to look at training and val losses of model
# all stored in trainer.state.log_history as list of dicts; load as dataframe
# training and val loss are recorded at different steps so we have gaps in df
df_log = pd.DataFrame(trainer.state.log_history)

(df_log.dropna(subset=["eval_loss"]).reset_index()["eval_loss"]
.plot(label="Validation"))
df_log.dropna(subset=["loss"]).reset_index()["loss"].plot(label="Train")

plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend(loc="upper right")
plt.show()

Both train and val loss went down considerably! See if we can also see an improvement when we fine-tune a classifier based on this model...

### Fine-Tuning a Classifier

In [None]:
# repeat fine-tuning, but we will load the custom ckpt

model_ckpt = f"{model_ckpt}-issues-128"
config = AutoConfig.from_pretrained(model_ckpt)
config.num_labels = len(all_labels)
config.problem_type = "multi_label_classification"

for train_slice in train_slices:
    model = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt, config=config)
    
    trainer = Trainer(
        model=model, tokenizer=tokenizer, args=training_args_fine_tune,
        compute_metrics=compute_metrics,
        train_dataset=ds_enc["train"].select(train_slice),
        eval_dataset=ds_enc["valid"]   
    )
    trainer.train()
    pred = trainer.predict(ds_enc["test"])
    metrics = compute_metrics(pred)
    
    # DA refers to domain adaptation
    macro_scores["Fine-tune (DA)"].append(metrics['macro f1'])
    micro_scores["Fine-tune (DA)"].append(metrics['micro f1'])

In [None]:
plot_metrics(micro_scores, macro_scores, train_samples, "Fine-tune (DA)")

Advantage especially in low data domain; also gain percentage points in regime where there is more labeled data available.

So domain adaptation can provide a boost to model performance with unlabeled data and a little effort. Consider volumes of data, which would also impact what we get with this method.

Finally.. Some final tricks to take advantage of unlabeled data.

### Advanced Methods

#### Unsupervised Data Augmentation

A model's predictions should be consistent for an unlabeled example and a slightly distorted one. Consistency is then enforced by minimising KL divergence between predictions of original and distorted examples. 

Consistency requirement is incorporated by augmenting the cross-entropy loss with an additional term from unlabeled examples. So one trains a model on labeled data with supervised approach, but constrains the model to make consistent predictions on the unlabeled data.

Impressive results, a handful of labeled examples for BERT gets similar performance to a model trained on thousands of examples. Downside is that you would need a data augmentation pipeline and training takes much longer since you need multiple forward passes to generate the predicted distributions on the unlabeled and augmented examples.

#### Uncertainty-aware self-training

Train teacher model on the labeled data then use that model to create pseudo-labels on unlabeled data; then a student is trained on pseudo-labeled data and after training becomes the teacher for the next iteration.

To get an uncertainty measure of a model's predictions, the same input is fed several times through the model with dropout turned on; the variancec in predictions gives a proxy for the certainty of the model on a specific sample.

With the uncertainty measure the pseudo-labels are sampled with a method called Bayesian Active Learning by Disagreement (BALD). And the teacher constantly gets better at creating pseudo-labels; this increasing the model's performance. And in the end gets a few % of models trained using the full training data with thousands of samples and even beats UDA on several datasets.

## Conclusion

Methods
- One shot pretrianed model
- Domain adaptation

Depends on:
- Amount of labeled data
- How noisy is data
- How close is data to pretraining corpus etc..

Best to set up a pipeline and iterate quickly!

Worth to consider trade-off between more complex approaches like UDA and UST vs getting more data. Good to build a validation and test set early on; and can gather more labeled data at every step of the way.