# Using BERT instead of word embeddings

A recent development in the embeddings world is BERT, also known as Bidirectional
Encoder Representations from Transformers, which, like word embeddings, gives
a vector representation, but it takes context into account and can represent a whole
sentence. We can use the Hugging Face sentence_transformers package to
represent sentences as vectors.

In [None]:
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

In [None]:
# !pip install transformers

In [None]:
# !pip install -U sentence-transformers

# How to do it…
The Hugging Face code makes using BERT very easy. The frst time the code runs, it will
download the necessary model, which might take some time. Once you've downloaded it,
it's just a matter of encoding the sentences using the model.

SENTENCE TRANSFORMER:
generate highly informative, fixed-size numerical representations for entire sentences or paragraph

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel

In [None]:
def mean_pooling(model_output, attention_mask):
  last_hidden_state = model_output[0]

  input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()

  sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)

  sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)

  return sum_embeddings/sum_mask

class CustomSentenceEmbedder(torch.nn.Module):
  def __init__(self, model_name: str = "Sentence-transformer"):
    super().__init__()

    # Load the pretrained tokenizer and model
    print(f"Loading model and tokenizer for: {model_name}")

    self.tokenizer = AutoTokenizer.from_pretreined(model_name)

    self.model = AutoModel.from_pretrained(model_name)

  def forward(self, features):
    model_output = self.model(**features)

    sentence_embeddings = mean_pooling(model_output, features["attention_mask"])

    return sentence_embeddings


READ IN TEXT FILE

In [None]:
filename = "001_Study_in_Scarlet.txt"
file = open("/content/001_Study_in_Scarlet.txt", "r", encoding = "UTF-8")
text= file.read()

DIVIDE INTO SENTENCES

In [35]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download("punkt_tab")

sentences = sent_tokenize(text)
print(len(sentences))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


2675


LOAD SENTENCE TRANSFORMER MODEL

In [36]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("bert-base-nli-mean-tokens")

GET THE EMBEDDINGS

In [37]:
sentence_embeddings = model.encode(sentences)

In [38]:
print(sentence_embeddings)

[[-0.15110186  0.8270001   0.4035079  ...  0.49568996  0.49750334
   0.6382162 ]
 [ 0.2457121   1.0529932   1.7424126  ...  0.46131474 -0.17410451
  -0.05920735]
 [-0.29506397 -0.3391045  -0.16513881 ... -0.08156896 -0.4176043
   0.23892957]
 ...
 [-0.154424    0.550847    0.5056195  ... -0.00271062  0.2175374
   0.42466092]
 [-0.10927247  0.9013409   1.4984224  ... -0.12560225 -0.22273837
   0.2514665 ]
 [-0.3909826   0.12573327  1.1411537  ... -0.18205898  0.25415018
   0.8183674 ]]


We can also encode a part of a sentence, such as a noun chunk:

In [39]:
sentence_embeddings = model.encode(["the beautiful lake"])

print(sentence_embeddings)

[[-7.61980042e-02 -5.74669957e-01  1.08264244e+00  7.36554861e-01
   5.51345468e-01 -9.39117789e-01 -2.80430377e-01 -5.41625440e-01
   7.50948966e-01 -4.40971285e-01  5.31526923e-01 -5.41882873e-01
   1.92792878e-01  3.44117492e-01  1.50266480e+00 -6.26990438e-01
  -2.42829174e-01 -3.66734892e-01  5.57459950e-01 -2.21802875e-01
  -9.69591916e-01 -4.38949734e-01 -7.93552697e-01 -5.84922671e-01
  -1.55690745e-01  2.12004572e-01  4.02014256e-01 -2.63063878e-01
   6.21910095e-01  5.97238183e-01  9.78124440e-02  7.20052302e-01
  -4.66322958e-01  3.86450529e-01 -8.24903309e-01  1.09985709e+00
  -3.59135211e-01 -4.31919038e-01  2.56567597e-02  5.73160291e-01
   2.40237281e-01 -7.67570615e-01  9.38899517e-01 -3.60024393e-01
  -8.77115369e-01 -2.47681215e-01 -8.65838170e-01  1.04203582e+00
   3.65989447e-01 -6.47720546e-02 -7.04247296e-01  5.91108808e-03
  -8.04807365e-01  2.21370488e-01 -1.79775044e-01  8.04758728e-01
  -4.44356829e-01 -4.46378887e-01  7.55990297e-02 -2.17623502e-01
   6.87522

# How it works…
The sentence transformer's BERT model is a pre-trained model, just like a word2vec
model, that encodes a sentence into a vector. The diﬀerence between a word2vec model
and a sentence transformer model is that we encode sentences in the latter, and not words.