<a href="https://colab.research.google.com/github/OMGarad/BERT-Project/blob/main/Extending_Sentence_Embedding_To_Hindi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.7 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 8.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 37.9 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.0.15-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 2.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.6 MB/s 
Collecting huggingface-hub
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_6

In [None]:
from torch import nn

In [None]:
from sentence_transformers import SentenceTransformer, models

#Creating our own teacher and student sentence embedding models

max_seq_length = 128 #maximum length of the sentence for input (input greater than 128 characters will be truncated)
train_batch_size = 64

# Load teacher model
print("Load teacher model")
teacher_model = SentenceTransformer('bert-base-nli-stsb-mean-tokens') #The teacher model is bert-base-nli-stsb-mean-tokens

# Create student model
print("Create student model")
word_embedding_model = models.Transformer("xlm-roberta-base") #The student model is xlm-roberta-base

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

# Creating a dense layer in order to reduce the dimesions 
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())


model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


Load teacher model


Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/630 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/409 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Create student model


Downloading:   0%|          | 0.00/512 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
import os, datetime

In [None]:
print(os.getcwd())

/content


##Loading the train, test and validation datasets

In [None]:
from sentence_transformers.datasets import ParallelSentencesDataset
from torch.utils.data import DataLoader
from sentence_transformers import SentencesDataset, losses, evaluation, readers


###### Load train sets ######

train_reader = ParallelSentencesDataset(student_model=model, teacher_model=teacher_model)
train_reader.load_data('Translated_Dataset_Testing.txt') #Translated dataset containing English text and the Hindi translation
train_dataloader = DataLoader(train_reader, shuffle=True, batch_size=train_batch_size)
train_loss = losses.MSELoss(model=model)


###### Load dev sets ######

#evaluators = []
#sts_reader = readers.STSDataReader('dataset/', s1_col_idx=0, s2_col_idx=1, score_col_idx=2)
#dev_data = SentencesDataset(examples=sts_reader.get_examples('dev.txt'), model=model)
#dev_dataloader = DataLoader(dev_data, shuffle=False, batch_size=train_batch_size)
#evaluator_sts = evaluation.EmbeddingSimilarityEvaluator(dev_dataloader, name='dev')
#evaluators.append(evaluator_sts)


###### Load test sets ######

test_reader = ParallelSentencesDataset(student_model=model, teacher_model=teacher_model)
test_reader.load_data('Translated_Dataset_Training.txt') #Translated dataset containing English text and the Hindi translation
test_dataloader = DataLoader(test_reader, shuffle=False, batch_size=train_batch_size)
#test_mse = evaluation.MSEEvaluator(test_dataloader, name='test')
#evaluators.append(test_mse)


###### Train model ######

output_path = "tmp"
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=20,
          evaluation_steps=1000,
          warmup_steps=10000,
          scheduler='warmupconstant',
          output_path=output_path,
          save_best_model=True,
          optimizer_params= {'lr': 2e-5, 'eps': 1e-6, 'correct_bias': False}
          )

Epoch:   0%|          | 0/20 [00:00<?, ?it/s]

Iteration:   0%|          | 0/85 [00:00<?, ?it/s]

Iteration:   0%|          | 0/85 [00:00<?, ?it/s]

Iteration:   0%|          | 0/85 [00:00<?, ?it/s]