
There are 2 stages to this experiment. First is to bucket the dataset and second is to train based on entropy score. 
Bucketing the dataset involves the following:
1. using ``google-bert/bert-base-uncased``, we will tokenize the text.
2. On default, ``bert-base-uncased`` creates embeddings that are in 768 dimensions.
3. Embeddings are then reduced in size to around 50.
4. Extract n amount from each bucket to create the first training stage dataset.

Training based on entropy score involves the following:
1. Remove the previously trained data from the dataset pool.
2. Predict on the rest of the dataset pool using the previously trained data in stage 1.
3. Convert prediction scores to entropy
4. Sort by highest entropy, seperate into buckets again.
5. Extract m amount from each bucket to create the second training stage dataset.
  
Entropy based training can be done multiple times until sufficient.

For each stage, we can compare it to a model trained on a randomly selected amount of data from the dataset pool. 

Since we are trying to prove that it is possible to create a high quality model without a huge amount of data, for each comparison, we can select the same amount of data used to train the bucketing model up to that point.

# Load and preprocess dataset
We are trying to prove that bucketing and entropy based training can reduce the amount of data needed to reduce a well balanced model. This is a case of one-shot model training. To reduce external factors such as bad data quality, etc; we will be utilizing StanfordNLP's SST2 dataset which  is a standard NLP benchmark for sentiment classification. The dataset will be loaded from huggingface via ``stanfordnlp/sst2``

In [3]:
from datasets import load_dataset
ds = load_dataset("stanfordnlp/sst2", cache_dir="caches/")

  from .autonotebook import tqdm as notebook_tqdm


We can inspect the dataset a little

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

Let's split into 3 variables

In [5]:
train_ds = ds["train"]
validation_ds = ds["validation"]
test_ds = ds["test"]

Perform very light cleaning on the dataset

In [6]:
def clean_text(row):
  text = str(row["sentence"])
  text = text.lower()
  text = text.strip()
  return {"sentence": text}  

In [7]:
train_ds = train_ds.map(clean_text)
validation_ds = validation_ds.map(clean_text)
test_ds = test_ds.map(clean_text)

# Generate embeddings and reduce size
As mentioned before, embeddings are generated using ```google-bert/bert-base-uncased``` model. As bert embeddings are usually massive (768 dimensions), we will reduce it to around 50

Define the tokenizer function

In [8]:
import torch
from transformers import AutoTokenizer, AutoModel

device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")


model = AutoModel.from_pretrained("google-bert/bert-base-uncased", cache_dir="caches/")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", cache_dir="caches/")
model.to(device)
model.eval()



BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [16]:
def tokenize(row):
  inputs = tokenizer(
    row["sentence"],
    truncation=True,
    padding="max_length",
    return_tensors="pt"
  ).to(device)
  with torch.no_grad():
    output = model(**inputs)
  embeddings = output.last_hidden_state
  input_mask_expanded = inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float()
  sum_embeddings = torch.sum(embeddings * input_mask_expanded, dim=1)
  sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
  
  mean_pooled = sum_embeddings / sum_mask
  return {
    "mean_pooled": mean_pooled
  }

In [17]:
train_ds = train_ds.map(tokenize)
validation_ds = validation_ds.map(tokenize)
test_ds = test_ds.map(tokenize)

Map:  10%|▉         | 6644/67349 [05:30<50:18, 20.11 examples/s]  


KeyboardInterrupt: 