# Embroid demonstration

This notebook walks through a quick demonstration of how to apply Embroid to a set of predictions. For the purposes of this demonstration, we'll assume that (1) LM few-shot predictions have already been computed, and (2) embeddings for samples have already been computed. 

This demonstration is for the task "DBPedia Animals," which requires an LM to classify if an entity is an animal based on the DBPedia description of that entity.

In [1]:
import numpy as np

# Load labels
labels = np.load("data/labels.npy")

# Load votes
all_votes = np.load("data/votes.npy")

# Convert to -1/1 space
all_votes = all_votes*2 - 1
labels = labels*2 - 1

n_samples, n_sources = all_votes.shape
print(f"Number of sources: {n_sources}. Number of samples: {n_samples}")

Number of sources: 1. Number of samples: 2000


We use embeddings from three sources: RoBERTa, SentenceBERT, and BERT. We precomputed embeddings of the dataset under each of these embedding functions, along with nearest-neighbor information. The nearest-neighbor information is stored as arrays in pickle format, where arr[i, j] is the index for the jth closest sample to sample i.

In [2]:
import pickle

nns = []
for model in ["roberta", "sbert", "bert"]:
    with open(f"data/{model}_embeddings.pickle", 'rb') as handle:
        nns.append(pickle.load(handle))
        
# Compress into a single array
nns = np.array(nns)
print(nns.shape)

(3, 2000, 20)


Let's compute the performance of our single LM, using the original predictions.

In [3]:
from sklearn.metrics import f1_score
f1 = f1_score(labels, all_votes, average="macro")
print(f"Macro F1 for initial prompt: {f1:.2f}")

Macro F1 for initial prompt: 0.72


Next, we run Embroid. As the original paper recommends, we use $\tau^+ = P(\lambda = 1)$ and $\tau^- = P(\lambda = -1)$ to set the thresholds when computing the neighborhood vote. The practical effect of this is: the neighborhood vote will agree with a source's vote if the source's vote is more concentrated in the neighborhood relative to its global concentration, and will vote against the source otherwise.

In [5]:
from embroid import run_embroid

pos_frac = (all_votes[:, 0] == 1).mean()
thresholds = [[1-pos_frac, pos_frac]]
corrected_predictions = run_embroid(all_votes, nns, knn=10, thresholds=thresholds)
f1 = f1_score(labels, corrected_predictions, average="macro")
print(f"Macro F1 for Embroid: {f1:.2f}")

Macro F1 for Embroid: 0.77
