* To create an Embedding Model, we typically need labeled data.
* not all real-world datasets come with a nice set of labels that we can use. We instead look for techniques to train the model without any predetermined labels—unsupervised learning.
* Many approaches exist, like Simple Contrastive Learning of Sentence Embeddings (SimCSE), Contrastive Tension (CT), Transformer-based Sequential Denoising Auto-Encoder (TSDAE),and Generative Pseudo-Labeling (GPL).

# TSDAE: Transformer-based Sequential Denoising Auto Encoder
* The underlying idea of TSDAE is that we add noise to the input sentence by removing a certain percentage of words from it.
* This “damaged” sentence is put through an
encoder, with a pooling layer on top of it, to map it to a sentence embedding.
* From this sentence embedding, a decoder tries to reconstruct the original sentence from the “damaged” sentence but without the artificial noise.
* The main concept here is that
the more accurate the sentence embedding is, the more accurate the reconstructed
sentence will be.
* This method is very similar to masked language modeling, where we try to reconstruct
and learn certain masked words. Here, instead of reconstructing masked
words, we try to reconstruct the entire sentence.
* After training, we can use the encoder to generate embeddings from text since the
decoder is only used for judging whether the embeddings can accurately reconstruct
the original sentence

# Dataset Preparation

In [1]:
! pip install -U sentence_transformers datasets -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0m

In [None]:
# download tokenizer
"""
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences
by using an unsupervised algorithm to build a model for
abbreviation words, collocations, and words that start
sentences. It must be trained on a large collection of
plaintext in the target language before it can be used.
"""
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# load dataset
from datasets import load_dataset, Dataset
ds=load_dataset("glue","mnli",split="train").select(range(25_000))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

(…)alidation_matched-00000-of-00001.parquet:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

(…)dation_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

test_matched-00000-of-00001.parquet:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

test_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

In [None]:
ds

Dataset({
    features: ['premise', 'hypothesis', 'label', 'idx'],
    num_rows: 25000
})

In [None]:
# flat sentences premise + hypothesis [concatenation]
flat_sentences=ds["premise"]+ds["hypothesis"]
flat_sentences[0]

'Conceptually cream skimming has two basic dimensions - product and geography.'

In [None]:
len(flat_sentences)==25_000*2

True

In [None]:
# add noise to our data
# noise_fn(sentence) => sentence - words = noising_sentence
from sentence_transformers.datasets import DenoisingAutoEncoderDataset
damaged_dataset = DenoisingAutoEncoderDataset(list(set(flat_sentences)))

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
# create dataset
from tqdm import tqdm
train_dataset={"damaged_sentence":[],"original_sentence":[]}
for data in tqdm(damaged_dataset):
  train_dataset["damaged_sentence"].append(data.texts[0])
  train_dataset["original_sentence"].append(data.texts[1])
train_ds=Dataset.from_dict(train_dataset)
train_ds

100%|██████████| 48353/48353 [00:13<00:00, 3628.96it/s]


Dataset({
    features: ['damaged_sentence', 'original_sentence'],
    num_rows: 48353
})

In [None]:
import pandas as pd
pd.DataFrame(train_dataset).head()

Unnamed: 0,damaged_sentence,original_sentence
0,something.,San'doro accused Vrenna of something.
1,Vrenna shouted,Vrenna! shouted Jon.
2,understands how works,Everyone knows and understands how our metric ...
3,see play make worthwhile,If you see a play in ancient theaters they mak...
4,know every time,I'll know him every time.


# Evaluator

In [12]:
val_ds=load_dataset("glue","stsb",split="validation")
val_ds

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 1500
})

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

In [13]:
evaluator=EmbeddingSimilarityEvaluator(
    sentences1=val_ds["sentence1"],
    sentences2=val_ds["sentence2"],
    scores=[score/5 for score in val_ds["label"]]
)

# Create Embedding Model

Using the [CLS] token as the pooling strategy means that instead of averaging all token embeddings, you just take the final layer's embedding of the [CLS] token to represent the sentence.


In [16]:
from sentence_transformers import models, SentenceTransformer

In [17]:
word_embed_model=models.Transformer("bert-base-uncased")
word_embed_model

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 

In [19]:
pooling_layer=models.Pooling(
    word_embed_model.get_word_embedding_dimension(),
    "cls"
    ) ## Take first token by default

In [21]:
embedding_model=SentenceTransformer(modules=[word_embed_model,pooling_layer])
embedding_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

# Loss Function: DenoisingAutoEncoderLoss
* Using our sentence pairs, we will need a loss function that attempts to reconstruct the
original sentence using the noise sentence, namely DenoisingAutoEncoderLoss. By
doing so, it will learn how to accurately represent the data. It is similar to masking but
without knowing where the actual masks are.
* Moreover, we tie the parameters of both models. Instead of having separate weights
for the encoder’s embedding layer and the decoder’s output layer, they share the same
weights. This means that any updates to the weights in one layer will be reflected in
the other layer as well

In [22]:
from sentence_transformers import losses

In [24]:
loss=losses.DenoisingAutoEncoderLoss(embedding_model,tie_encoder_decoder=True)

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.e

In [25]:
loss.decoder=loss.decoder.to("cuda")

# Train Args / Training

In [26]:
from sentence_transformers.trainer import SentenceTransformerTrainer, SentenceTransformerTrainingArguments

In [27]:
args=SentenceTransformerTrainingArguments(
    "tsdae_model",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    fp16=True,
    warmup_steps=100,
    eval_steps=100,
    logging_steps=100
)

In [29]:
trainer=SentenceTransformerTrainer(
    embedding_model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    loss=loss,
    evaluator=evaluator
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [30]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss
100,6.9263
200,4.8678
300,4.6062
400,4.4897
500,4.3732
600,4.2692
700,4.2217
800,4.15
900,4.1189
1000,4.066


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=3023, training_loss=4.031723565074177, metrics={'train_runtime': 1127.9495, 'train_samples_per_second': 42.868, 'train_steps_per_second': 2.68, 'total_flos': 0.0, 'train_loss': 4.031723565074177, 'epoch': 1.0})

In [31]:
evaluator(embedding_model)

{'pearson_cosine': 0.7432891215606202,
 'spearman_cosine': 0.750292117428726,
 'pearson_manhattan': 0.7496863640347652,
 'spearman_manhattan': 0.753500874091522,
 'pearson_euclidean': 0.7505275412747955,
 'spearman_euclidean': 0.7542343565848624,
 'pearson_dot': 0.6107435031662969,
 'spearman_dot': 0.603761221469202,
 'pearson_max': 0.7505275412747955,
 'spearman_max': 0.7542343565848624}