
There are 2 stages to this experiment. First is to bucket the dataset and second is to train based on entropy score. 
Bucketing the dataset involves the following:
1. using ``google-bert/bert-base-uncased``, we will tokenize the text.
2. On default, ``bert-base-uncased`` creates embeddings that are in 768 dimensions.
3. Embeddings are then reduced in size to around 50.
4. Extract n amount from each bucket to create the first training stage dataset.

Training based on entropy score involves the following:
1. Remove the previously trained data from the dataset pool.
2. Predict on the rest of the dataset pool using the previously trained data in stage 1.
3. Convert prediction scores to entropy
4. Sort by highest entropy, seperate into buckets again.
5. Extract m amount from each bucket to create the second training stage dataset.
  
Entropy based training can be done multiple times until sufficient.

For each stage, we can compare it to a model trained on a randomly selected amount of data from the dataset pool. 

Since we are trying to prove that it is possible to create a high quality model without a huge amount of data, for each comparison, we can select the same amount of data used to train the bucketing model up to that point.

# Load and preprocess dataset
We are trying to prove that bucketing and entropy based training can reduce the amount of data needed to reduce a well balanced model. This is a case of one-shot model training. To reduce external factors such as bad data quality, etc; we will be utilizing StanfordNLP's SST2 dataset which  is a standard NLP benchmark for sentiment classification. The dataset will be loaded from huggingface via ``stanfordnlp/sst2``

In [1]:
from datasets import load_dataset
ds = load_dataset("stanfordnlp/sst2", cache_dir="caches/")

  from .autonotebook import tqdm as notebook_tqdm


Let's split into 3 variables

In [3]:
train_ds = ds["train"]
validation_ds = ds["validation"]
test_ds = ds["test"]

Perform very light cleaning on the dataset

In [4]:
def clean_text(row):
  text = str(row["sentence"])
  text = text.lower()
  text = text.strip()
  return {"sentence": text}  

In [5]:
train_ds = train_ds.map(clean_text)
validation_ds = validation_ds.map(clean_text)
test_ds = test_ds.map(clean_text)

In [42]:
train_ds

Dataset({
    features: ['idx', 'sentence', 'label', 'mean_pooled_embeddings'],
    num_rows: 67349
})

# Generate embeddings and reduce size
As mentioned before, embeddings are generated using ```google-bert/bert-base-uncased``` model. As bert embeddings are usually massive (768 dimensions), we will reduce it to around 50

Define the tokenizer function

In [6]:
import torch
from transformers import AutoTokenizer, AutoModel

device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")


model = AutoModel.from_pretrained("google-bert/bert-base-uncased", cache_dir="caches/")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", cache_dir="caches/")
model.to(device)
model.eval()



BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [7]:
import torch
def get_mean_pooled_embeddings(batch):
  inputs = tokenizer(
    batch["sentence"],
    truncation=True,
    padding="max_length",
    return_tensors="pt"
  )
  inputs = {k: v.to(device) for k, v in inputs.items()}
  with torch.no_grad():
    outputs = model(**inputs)
  last_hidden = outputs.last_hidden_state
  mask = inputs["attention_mask"].unsqueeze(-1)
  mean_pool = (last_hidden * mask).sum(dim = 1) / mask.sum(dim = 1)
  return {
    "mean_pooled_embeddings": mean_pool.cpu().numpy()
  }
  

In [8]:
train_ds = train_ds.map(get_mean_pooled_embeddings, batch_size=16, batched=True)

Map: 100%|██████████| 67349/67349 [54:30<00:00, 20.59 examples/s]


In [37]:
import numpy as np
np.save("bert_mean_pooled_embeddings.npy", train_ds["mean_pooled_embeddings"])

In [61]:
mean_pooled_bert_embeddings = np.load("bert_mean_pooled_embeddings.npy")

In [63]:
from umap import UMAP
pca = UMAP(n_components=50, random_state=42)
reduced_embeddings = pca.fit_transform(train_ds["mean_pooled_embeddings"])

  warn(


In [64]:
from sklearn.preprocessing import normalize
normalized_embeddings = normalize(reduced_embeddings)

In [65]:
from sklearn.metrics import davies_bouldin_score, calinski_harabasz_score
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans


for k in range(2, 21):
    kmeans = KMeans(n_clusters=k, random_state=42).fit(normalized_embeddings)
    labels = kmeans.labels_
    
    sil_score = silhouette_score(normalized_embeddings, labels)
    db_score = davies_bouldin_score(normalized_embeddings, labels)
    ch_score = calinski_harabasz_score(normalized_embeddings, labels)
    
    print(f"k={k} | Silhouette: {sil_score:.4f} | Davies-Bouldin: {db_score:.4f} | Calinski-Harabasz: {ch_score:.4f}")


k=2 | Silhouette: 0.3800 | Davies-Bouldin: 1.0437 | Calinski-Harabasz: 52930.6392
k=3 | Silhouette: 0.3063 | Davies-Bouldin: 1.3173 | Calinski-Harabasz: 40233.3084
k=4 | Silhouette: 0.2598 | Davies-Bouldin: 1.3586 | Calinski-Harabasz: 34546.7694
k=5 | Silhouette: 0.2464 | Davies-Bouldin: 1.3461 | Calinski-Harabasz: 29691.6326
k=6 | Silhouette: 0.2266 | Davies-Bouldin: 1.4336 | Calinski-Harabasz: 28159.9143
k=7 | Silhouette: 0.2251 | Davies-Bouldin: 1.3965 | Calinski-Harabasz: 25222.0860
k=8 | Silhouette: 0.2120 | Davies-Bouldin: 1.4910 | Calinski-Harabasz: 23692.4232
k=9 | Silhouette: 0.2199 | Davies-Bouldin: 1.4527 | Calinski-Harabasz: 22377.6393
k=10 | Silhouette: 0.2127 | Davies-Bouldin: 1.4501 | Calinski-Harabasz: 20965.0288
k=11 | Silhouette: 0.2205 | Davies-Bouldin: 1.4122 | Calinski-Harabasz: 20405.1481
k=12 | Silhouette: 0.2188 | Davies-Bouldin: 1.4270 | Calinski-Harabasz: 19272.5394
k=13 | Silhouette: 0.2237 | Davies-Bouldin: 1.3693 | Calinski-Harabasz: 19044.0544
k=14 | Silho