# <center> Compressed Bert Model

This model is a reduced BERT model that is trained differently. Though the ReducedBERT model performs well on the GLUE tasks when compared to the full-size BERT model, it has some limitations. 

1. The embeddings from the ReducedBERT model did not cluster well. Where a set of full-sized embeddings originally clustered into around 250 different groups, the reduced embeddings were clustered into only 2-3 groups. Adding some dimensionality reduction to the already reduced embeddings resulted in more clusters, but the clustering did not appear to be meaningful.
2. Intuitively, while training the ReducedBERT reduction head on language understanding may be helpful, the base BERT model has already been trained for language understanding. Since we freeze the BERT weights anyway, we should focus on training our model head for reduction instead of MLM and NSP, trusting that the understanding from BERT will be transferred to the reduced embeddings.

Our new model proposes adding both a contrastive learning objective and a reconstruction objective to our previous ReducedBERT model in place of the MLM and NSP model heads. These new objectives are designed to preserve to structure of the BERT embedding space while retaining the information present in the full-size BERT embeddings, effectively helping the embeddings cluster better. 

For contrastive learning, the idea is to push items that are similar together, and push items that are different apart from each other in the reduced embedding space. While this would be a potentially difficult problem with unlabeled text, we have the advantage of having the original BERT embeddings to compare against. Thus, for our contrastive loss, we get the cosine similarity between all pairs of original BERT embeddings in a training batch, and then compare those similarities against the cosine similarities of pairs of the corresponding reduced embeddings. We then push the consine similarities of the reduced embeddings to match the consine similarities of the original embeddings using an MSE loss. This essentially teaches the reduction head to keep the structure of the original embedding space in the reduced embedding space.

For the reconstruction objective, we add a set of decompression layers to the end of the model into order to extract information from the reduced embeddings and recreate the full-size embeddings. We compare these reconstructed embeddings to the true full-size BERT embeddings and apply an MSE loss to try to get the reconstruction as close to the original as possible. This also teaches the reduction heads to reduce the full-size embeddings in such a way where the reduced embeddings contain all the same information as the full-size embeddings.

In [1]:
# TODO: Test the intermediate reduction layer embeddings to see how they perform on GLUE tasks compared to the fully reduced model.
# TODO: Determine how to train such that clusters are better
#    - Maybe try using self-supervised contrastive learning, where the full-size embeddings are used as a baseline. (https://encord.com/blog/guide-to-contrastive-learning/#:~:text=NLP%20deals%20with%20the%20processing,semantic%20information%20and%20contextual%20relationships.)
#    - Check out the papers here (https://github.com/ryanzhumich/Contrastive-Learning-NLP-Papers?tab=readme-ov-file#4-contrastive-learning-for-nlp)
# TODO: Check if its best to train the first reduction layer first, then the second, etc., freezing the previous layers as you go.

# TODO: Fix defaults from BertReducedConfig in the BertReducedForPreTraining class (and potentially other model classes).

In [2]:
from torch.nn import functional as F
from torch import nn
from collections import OrderedDict

from transformers import MPNetModel
from reduced_encoders import MPNetReducedPreTrainedModel, DimReduce
from reduced_encoders.models.mpnet_reduced.modeling_sbert import SBertPooler
from reduced_encoders.modeling_utils import compressed_contrastive_loss
from reduced_encoders.modeling_reduced import DimReduceLayer
from reduced_encoders.modeling_outputs import CompressedModelForPreTrainingOutput

class Decoder(nn.Sequential):
    """A module used during training of a compressed model that transforms the 
    reduced model embeddings back into full-size model embeddings with the goal
    of matching the original embeddings as closely as possible.

    The structure of the model closely matches that of the DimReduce module
    """
    def __init__(self, config, modules=None):
        input_size = config.reduced_size
        self.decoding_sizes = config.reduction_sizes[-2::-1] + [config.hidden_size]
        
        if modules is None:
            modules = OrderedDict()
            for i, decoding_size in enumerate(self.decoding_sizes):   
                modules[str(i)] = DecoderLayer(input_size, decoding_size, config)
                input_size = decoding_size
        elif not isinstance(modules, OrderedDict):
            modules = OrderedDict([(str(idx), module) for idx, module in enumerate(modules)])
    
        super().__init__(modules)

# For now, let the DecoderLayer be the same as the DimReduceLayer
DecoderLayer = DimReduceLayer


class MPNetCompressedForPretraining(MPNetReducedPreTrainedModel):
    def __init__(self, config=None, base_model=None, reduce_module=None, alpha=1, beta=1, do_contrast=True, 
                    do_reconstruction=True, **kwargs):
        super().__init__(config)

        kwargs['add_pooling_layer'] = False     # We use our own pooling instead
        self.mpnet = base_model or MPNetModel(self.config, **kwargs)
        self.pooler = SBertPooler(self.config)
        self.reduce = reduce_module or DimReduce(self.config)

        self.do_contrast = do_contrast
        self.do_reconstruction = do_reconstruction

        if do_reconstruction:
            self.decoder = Decoder(self.config)

        self.alpha = alpha
        self.beta = beta
        
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, 
                inputs_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.mpnet(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
        )

        sequence_output = outputs[0]
        pooled_output = self.pooler(sequence_output, attention_mask)  
        reduced_pooled = self.reduce(pooled_output)

        # TODO: There are ways to compute the loss at each layer of the reduction, is that something possible/something we want to do?

        # Compute contrastive loss
        contrastive_loss = 0
        if self.do_contrast:
            contrastive_loss = compressed_contrastive_loss(pooled_output, reduced_pooled)

        # Compute reconstruction loss
        reconstruction_loss = 0
        if self.do_reconstruction:
            decoded_reduced_pooled_output = self.decoder(reduced_pooled)
            reconstruction_loss = F.mse_loss(pooled_output, decoded_reduced_pooled_output)  

        # Compute total loss
        total_weighted_loss = self.alpha*contrastive_loss + self.beta*reconstruction_loss

        if not return_dict:
            return (total_weighted_loss, contrastive_loss, reconstruction_loss, )

        return CompressedModelForPreTrainingOutput(
            loss=total_weighted_loss,
            contrastive_loss=contrastive_loss,
            reconstruction_loss=reconstruction_loss,
            pooled_output=pooled_output,
            reduced_pooled_output=reduced_pooled,
            reconstructed_pooled_output=decoded_reduced_pooled_output if self.do_reconstruction else None,
        )

In [None]:
class MPNetCompressedModel(MPNetReducedPreTrainedModel):
    def __init__(self, config=None, base_model=None, reduce_module=None, **kwargs):
        super().__init__(config)

        kwargs['add_pooling_layer'] = False     # We use our own pooling instead
        self.mpnet = base_model or MPNetModel(self.config, **kwargs)
        self.pooler = SBertPooler(self.config)
        self.reduce = reduce_module or DimReduce(self.config)
        
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, 
                inputs_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.mpnet(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
        )

        sequence_output = outputs[0]
        pooled_output = self.pooler(sequence_output, attention_mask)  
        reduced_pooled = self.reduce(pooled_output)

        if not return_dict:
            return (embeddings, pooled_embeddings) + outputs[2:]

        return CompressedModelForPreTrainingOutput(
            loss=loss,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

In [3]:
from transformers import AutoTokenizer
from transformers import AutoModel

checkpoint = "sentence-transformers/all-mpnet-base-v2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
mpnet = AutoModel.from_pretrained(checkpoint, add_pooling_layer=False)

In [7]:
from transformers import AutoConfig
from reduced_encoders import MPNetReducedConfig


base_config = AutoConfig.from_pretrained(checkpoint)
config = MPNetReducedConfig.from_config(base_config, reduction_sizes=[512,256,128,64,48])
config

MPNetReducedConfig {
  "_name_or_path": "sentence-transformers/all-mpnet-base-v2",
  "architectures": [
    "MPNetForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "mpnet_reduced",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "pooling_mode": "mean",
  "reduced_size": 48,
  "reduction_sizes": [
    512,
    256,
    128,
    64,
    48
  ],
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.31.0.dev0",
  "vocab_size": 30527
}

In [8]:
compressed_model = MPNetCompressedForPretraining(config=config, base_model=mpnet)
compressed_model

MPNetCompressedForPretraining(
  (mpnet): MPNetModel(
    (embeddings): MPNetEmbeddings(
      (word_embeddings): Embedding(30527, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): MPNetEncoder(
      (layer): ModuleList(
        (0-11): 12 x MPNetLayer(
          (attention): MPNetAttention(
            (attn): MPNetSelfAttention(
              (q): Linear(in_features=768, out_features=768, bias=True)
              (k): Linear(in_features=768, out_features=768, bias=True)
              (v): Linear(in_features=768, out_features=768, bias=True)
              (o): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [17]:
from reduced_encoders.debug_utils import compare_weights

compare_weights(mpnet, compressed_model.mpnet)

True

In [4]:
text = ['This is a test sentence that is meant to determine whether I can run text through my new compressed SBERT model. Did it work?',
        'This is also a test sentence, but it is different from the first one. I hope this works!',
        'A feral cat walked down the street, hoping to find a place to rest for the night',
        'The last sentence was significantly different from the others to see where the embedding lands']

In [5]:
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

In [11]:
with torch.no_grad():
    outputs = compressed_model(**inputs)

tensor([0.4352, 0.0172, 0.3200, 0.0596, 0.3839, 0.0697])
tensor([0.9052, 0.8000, 0.7335, 0.8493, 0.7925, 0.7880])


In [12]:
outputs

CompressedModelForPreTrainingOutput(loss=tensor(0.3852), hidden_states=None, attentions=None)

In [13]:
compressed_model.config.architectures = [compressed_model.__class__.__name__]
compressed_model.config

MPNetReducedConfig {
  "_name_or_path": "sentence-transformers/all-mpnet-base-v2",
  "architectures": [
    "MPNetCompressedForPretraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "mpnet_reduced",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "pooling_mode": "mean",
  "reduced_size": 48,
  "reduction_sizes": [
    512,
    256,
    128,
    64,
    48
  ],
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.31.0.dev0",
  "vocab_size": 30527
}

In [15]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [16]:
compressed_checkpoint = "cayjobla/all-mpnet-base-v2-compressed"
compressed_model.push_to_hub(compressed_checkpoint)
tokenizer.push_to_hub(compressed_checkpoint)

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/cayjobla/all-mpnet-base-v2-compressed/commit/805dcba348f4af29642076a2c2c4266593448b4c', commit_message='Upload tokenizer', commit_description='', oid='805dcba348f4af29642076a2c2c4266593448b4c', pr_url=None, pr_revision=None, pr_num=None)

## Test the Implemented model on a set of test inputs

In [1]:
from reduced_encoders import MPNetCompressedForPretraining

compressed_checkpoint = "cayjobla/all-mpnet-base-v2-compressed"
compressed_ae_model = MPNetCompressedForPretraining.from_pretrained(compressed_checkpoint, revision="autoencoder")

In [2]:
compressed_ae_model

MPNetCompressedForPretraining(
  (mpnet): MPNetModel(
    (embeddings): MPNetEmbeddings(
      (word_embeddings): Embedding(30527, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): MPNetEncoder(
      (layer): ModuleList(
        (0-11): 12 x MPNetLayer(
          (attention): MPNetAttention(
            (attn): MPNetSelfAttention(
              (q): Linear(in_features=768, out_features=768, bias=True)
              (k): Linear(in_features=768, out_features=768, bias=True)
              (v): Linear(in_features=768, out_features=768, bias=True)
              (o): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [3]:
from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.en", split="train[:5]")
dataset

Found cached dataset wikipedia (/home/cayjobla/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 5
})

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(compressed_checkpoint)

In [5]:
inputs = tokenizer(dataset['text'], padding=True, truncation=True, return_tensors="pt")
inputs

{'input_ids': tensor([[    0,  9621, 11144,  ...,  2265,  2111,     2],
        [    0, 19469,  2007,  ...,  6024,  1014,     2],
        [    0,  2636, 28763,  ...,  2001,  2000,     2],
        [    0,  1041,  1014,  ...,  1004,  2869,     2],
        [    0,  6045,  1010,  ...,  6045,  1009,     2]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])}

In [6]:
outputs = compressed_ae_model(**inputs)

decoded_reduced_pooled_output shape: torch.Size([5, 768])
pooled_output shape: torch.Size([5, 768])


In [7]:
outputs

CompressedModelForPreTrainingOutput(loss=tensor(0.7705, grad_fn=<AddBackward0>), contrastive_loss=tensor(0.7636, grad_fn=<MseLossBackward0>), reconstruction_loss=tensor(0.0069, grad_fn=<MseLossBackward0>), pooled_output=tensor([[-0.0648,  0.0762,  0.0211,  ..., -0.0104, -0.0436,  0.0447],
        [ 0.0031, -0.0040, -0.0704,  ...,  0.0212, -0.0345,  0.0325],
        [ 0.1005, -0.1929, -0.0396,  ..., -0.0476,  0.0810,  0.0176],
        [-0.0144, -0.1616, -0.0612,  ..., -0.0994,  0.0422, -0.0320],
        [ 0.0237,  0.0074,  0.0287,  ..., -0.0278,  0.0005,  0.0005]],
       grad_fn=<DivBackward0>), reduced_pooled_output=tensor([[ 3.5362e-02,  4.5518e-02,  5.2559e-02,  6.2311e-02, -3.6791e-03,
          3.8930e-02,  3.8788e-02, -1.4240e-02, -5.5849e-02,  5.9786e-02,
          4.9462e-02,  8.7500e-03, -5.0010e-02, -2.8244e-02, -2.7903e-02,
         -3.7485e-02,  3.3196e-03,  4.1013e-03, -1.2987e-02,  5.1692e-02,
          3.6447e-02, -3.9238e-02, -5.7299e-02,  8.5411e-02, -4.6529e-02,
     

## Check the Loss values for accuracy

In [4]:
from reduced_encoders import MPNetCompressedForPretraining

compressed_checkpoint = "cayjobla/all-mpnet-base-v2-compressed"
compressed_ae_model = MPNetCompressedForPretraining.from_pretrained(compressed_checkpoint, revision="autoencoder")

In [10]:
text = ['This is a test sentence that is meant to determine whether I can run text through my new compressed SBERT model. Did it work?',
        'This is also a test sentence, but it is different from the first one. I hope this works!',
        'A feral cat walked down the street, hoping to find a place to rest for the night',
        'The last sentence was significantly different from the others to see where the embedding lands']

In [18]:
from transformers import AutoTokenizer

# Tokenize
tokenizer = AutoTokenizer.from_pretrained(compressed_checkpoint)
input_values = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

In [20]:
# Get embeddings
embeddings = compressed_ae_model.base_model(**input_values)
sentence_embeddings = compressed_ae_model.pooler(embeddings.last_hidden_state, attention_mask=input_values.attention_mask)
sentence_embeddings.shape

torch.Size([4, 768])

In [21]:
reduced_embeddings = compressed_ae_model.reduce(sentence_embeddings)
reduced_embeddings.shape

torch.Size([4, 48])

In [22]:
reconstructed_embeddings = compressed_ae_model.decoder(reduced_embeddings)
reconstructed_embeddings.shape

torch.Size([4, 768])

In [31]:
from reduced_encoders.modeling_utils import get_cos_sim
from torch.nn import MSELoss

full_similarity = get_cos_sim(sentence_embeddings)
reduced_similarity = get_cos_sim(reduced_embeddings)
MSELoss(reduction='mean')(full_similarity, reduced_similarity)

tensor(0.6461, grad_fn=<MseLossBackward0>)

In [33]:
full_similarity

tensor([0.4352, 0.0172, 0.3200, 0.0596, 0.3839, 0.0697],
       grad_fn=<IndexBackward0>)

In [1]:
from torch.nn import MSELoss

reconstruction_loss = MSELoss(reduction='sum')(sentence_embeddings, reconstructed_embeddings)
reconstruction_loss

NameError: name 'sentence_embeddings' is not defined

In [1]:
from reduced_encoders import MPNetCompressedForPretraining

local_model = MPNetCompressedForPretraining.from_pretrained("all-mpnet-base-v2-compressed")

In [4]:
local_model.get_extra_logging_dict()

{'contrastive_weight': 0.6358375549316406,
 'reconstruction_weight': 0.6146606802940369}