# 08 Text Embeddings

In Bertopic for me text embeddings is where the actual multilingual topic modeling macig happens. So in this section there will be an introduction to the SentencBERT text embeddings algorithms, aswell as examples to see how it works, what it can and what it can't do.

## 8.1 FirstExample and Visualisation

In [2]:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import euclidean
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd
from tqdm.auto import tqdm
from src.SampleTranslation05.translation_01 import load_samples
tqdm.pandas()

import sys
sys.path.append('/Users/robinfeldmann/Projects/MaximumVarianceUnfolding')
from MVU import MaximumVarianceUnfolding

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


Importing the english model only. 

In [33]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In the first example the sentence "This sentence will be embedded" is embedded. The result is a 384 dimensional vector.

In [34]:
sentence = "This sentence will be embedded."
embedding = model.encode(sentence)

print(embedding.shape)

(384,)


As high dimensional vectors a very hard to visualize or to understand, we will use PCA to reduce the dimensionality of the vectors to two. But keep in mind that this process loses (alot of) information. Also calculating the pairwise distance between embeddings
will be used to interpret the data, but this is only possible with small sets of sentences.

In [39]:
sentences= ["The number PI has infinite decimal places and beginns with 3",
            "To calculate geometrical aspects of circles you need to use a special irrational number",
            "The weather in nuremberg is very bad, as expected"
]

df_first_example = pd.DataFrame(sentences,columns=['sentence'])

df_first_example['embeddings'] = df_first_example['sentence'].progress_apply(model.encode)

100%|██████████| 3/3 [00:00<00:00, 75.21it/s]


In [37]:
print(f"Distance between '{df_first_example.iloc[0]['sentence']}' and '{df_first_example.iloc[1]['sentence']}' is: {euclidean( df_first_example.iloc[0]['embeddings'],  df_first_example.iloc[1]['embeddings'])}")
print(f"Distance between '{df_first_example.iloc[0]['sentence']}' and '{df_first_example.iloc[2]['sentence']}' is: {euclidean( df_first_example.iloc[0]['embeddings'],  df_first_example.iloc[2]['embeddings'])}")
print(f"Distance between '{df_first_example.iloc[1]['sentence']}' and '{df_first_example.iloc[2]['sentence']}' is: {euclidean( df_first_example.iloc[1]['embeddings'],  df_first_example.iloc[2]['embeddings'])}")


Distance between 'The number PI has infinite decimal places and beginns with 3' and 'To calculate geometrical aspects of circles you need to use a special irrational number' is: 1.0821858644485474
Distance between 'The number PI has infinite decimal places and beginns with 3' and 'The weather in nuremberg is very bad, as expected' is: 1.462660789489746
Distance between 'To calculate geometrical aspects of circles you need to use a special irrational number' and 'The weather in nuremberg is very bad, as expected' is: 1.4487946033477783


The embeddings of the first two sentences are very close to each other compared to the third sentence. Now lets visualize this using PCA.

In [38]:
from src.TextEmbeddings08.text_embeddings import visualise_with_dimensional_reduction

In [40]:
pca = PCA(n_components=2)

visualise_with_dimensional_reduction(df_first_example, pca)

Or using Maximum Variance Unfolding. Which is an algorithm build to preserve pairwise distances as good as possible.

In [41]:
mvu = MaximumVarianceUnfolding()
min_embeddings = (mvu.fit_transform(np.array(df_first_example['embeddings'].to_list()), dim=2, k=3))
df_first_example['x'] = min_embeddings[:,0]
df_first_example['y'] = min_embeddings[:,1]

x_axis_range = [df_first_example['x'].min()-1, df_first_example['x'].max()+1]
y_axis_range = [df_first_example['y'].min()-1, df_first_example['y'].max()+1]

fig = px.scatter(df_first_example, y='y', x='x',range_x=x_axis_range, range_y=y_axis_range,  hover_data=['sentence'])
fig.update_traces(marker_size=10)
fig.show()

I can't really see the difference here so i will probably use pca most of the times as it seems the be the standart. Maybe on bigger, more confusing data i will try MVU again.

The sentences about math are embedded close to each other. Hover the dots to see the sentences.

Models are taken from this [website](https://www.sbert.net/docs/pretrained_models.html). For the model "all-MiniLM-L6-v2" which we are using for english it gives a maximum sequence length 256. So lets try what happens when we give longer texts.

In [18]:
sentences = ["The number PI has infinite decimal places and beginns with 3",
             "To calculate geometrical aspects of circles you need to use a special irrational number. For example to caclulcate the circumference of a circle with radius r: You would need to calculate two times r times this special number to get the correct result. This ofcourse can only be an estimation as no one can calculate with a number with like this.",
             "The weather in nuremberg is very bad, as expected",]

for sen in sentences:
    print(len(sen))

df_first_example = pd.DataFrame(sentences,columns=['sentence'])
df_first_example['embeddings'] = df_first_example['sentence'].progress_apply(model.encode)

pca = PCA(n_components=2)
min_embeddings = (pca.fit_transform(np.array(df_first_example['embeddings'].to_list())))
df_first_example['x'] = min_embeddings[:,0]
df_first_example['y'] = min_embeddings[:,1]

x_axis_range = [df_first_example['x'].min()-1, df_first_example['x'].max()+1]
y_axis_range = [df_first_example['y'].min()-1, df_first_example['y'].max()+1]

fig = px.scatter(df_first_example, y='y', x='x',range_x=x_axis_range, range_y=y_axis_range,  hover_data=['sentence'],)
fig.update_traces(marker_size=10)
fig.show()

60
346
49


100%|██████████| 3/3 [00:00<00:00, 55.27it/s]


I can't really see a difference here. So lets move on and see how the english model perfoms on translated sentences.

In [19]:
sentences= ["The number PI has infinite decimal places and beginns with 3",
            "To calculate geometrical aspects of circles you need to use a special irrational number",
            "The weather in nuremberg is very bad, as expected",
            "Die Zahl PI hat unendlich Nachkommastellen und beginnt mit 3",
            "Чтобы вычислить геометрические аспекты окружностей, нужно использовать специальное иррациональное число",
            "Das Wetter in Nürnberg ist wie erwartet, sehr schlecht",
]

languages = ["eng","eng","eng","de","ru","de"]
df_first_example = pd.DataFrame(sentences,columns=['sentence'])
df_first_example['lang'] = languages
df_first_example['embeddings'] = df_first_example['sentence'].progress_apply(model.encode)

100%|██████████| 6/6 [00:00<00:00, 73.06it/s]


In [20]:
pca = PCA(n_components=2)
min_embeddings = (pca.fit_transform(np.array(df_first_example['embeddings'].to_list())))
df_first_example['x'] = min_embeddings[:,0]
df_first_example['y'] = min_embeddings[:,1]

x_axis_range = [df_first_example['x'].min()-1, df_first_example['x'].max()+1]
y_axis_range = [df_first_example['y'].min()-1, df_first_example['y'].max()+1]

fig = px.scatter(df_first_example, y='y', x='x',range_x=x_axis_range, range_y=y_axis_range,  hover_data=['sentence'], color='lang')
fig.update_traces(marker_size=10)
fig.show()

No lets compare this to a multilangual model.

In [22]:
model_multi = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

In [23]:
sentences= ["The number PI has infinite decimal places and beginns with 3",
            "To calculate geometrical aspects of circles you need to use a special irrational number",
            "The weather in nuremberg is very bad, as expected",
            "Die Zahl PI hat unendlich Nachkommastellen und beginnt mit 3",
            "Чтобы вычислить геометрические аспекты окружностей, нужно использовать специальное иррациональное число",
            "Das Wetter in Nürnberg ist wie erwartet, sehr schlecht",
]

languages = ["eng","eng","eng","de","ru","de"]
df_first_example = pd.DataFrame(sentences,columns=['sentence'])
df_first_example['lang'] = languages
df_first_example['embeddings'] = df_first_example['sentence'].progress_apply(model_multi.encode)

pca = PCA(n_components=2)
visualise_with_dimensional_reduction(df_first_example, pca, color_to_show='lang')

100%|██████████| 6/6 [00:00<00:00, 46.05it/s]


Compared to the mono lingual model you can see that the translated texts are now closer to each other then to those texts with similiar topics. Lets also try a bigger model for multilingual embeddings that uses embeddings in a 768 dimensional vector space: paraphrase-multilingual-mpnet-base-v2.

In [24]:
model_multi_max = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

In [25]:
sentences= ["The number PI has infinite decimal places and beginns with 3",
            "To calculate geometrical aspects of circles you need to use a special irrational number",
            "The weather in nuremberg is very bad, as expected",
            "Die Zahl PI hat unendlich Nachkommastellen und beginnt mit 3",
            "Чтобы вычислить геометрические аспекты окружностей, нужно использовать специальное иррациональное число",
            "Das Wetter in Nürnberg ist wie erwartet, sehr schlecht",
]

languages = ["eng","eng","eng","de","ru","de"]
df_first_example = pd.DataFrame(sentences,columns=['sentence'])
df_first_example['lang'] = languages
df_first_example['embeddings'] = df_first_example['sentence'].progress_apply(model_multi_max.encode)

pca = PCA(n_components=2)
visualise_with_dimensional_reduction(df_first_example, pca, color_to_show='lang')

100%|██████████| 6/6 [00:00<00:00, 19.96it/s]


In [5]:
model_labse = SentenceTransformer('sentence-transformers/LaBSE')

  _torch_pytree._register_pytree_node(


In [56]:
sentences= ["The number PI has infinite decimal places and beginns with 3",
            "To calculate geometrical aspects of circles you need to use a special irrational number",
            "The weather in nuremberg is very bad, as expected",
            "Die Zahl PI hat unendlich Nachkommastellen und beginnt mit 3",
            "Чтобы вычислить геометрические аспекты окружностей, нужно использовать специальное иррациональное число",
            "Das Wetter in Nürnberg ist wie erwartet, sehr schlecht",
            "Putin",
            "Путин",
            "پوتین",
            "Πούτιν",
            "Pute",
            "Wladimir",
            "Krieg",
            "Frieden"
]

languages = ["eng","eng","eng","de","ru","de", "eng", "ru", "per","gre","de", "eng","de","de"]
df_first_example = pd.DataFrame(sentences,columns=['sentence'])
df_first_example['lang'] = languages
df_first_example['embeddings'] = df_first_example['sentence'].progress_apply(model_labse.encode)

pca = PCA(n_components=2)
visualise_with_dimensional_reduction(df_first_example, pca, color_to_show='lang')

100%|██████████| 14/14 [00:00<00:00, 22.98it/s]


Experiment with negation of sentences:

In [39]:
sentences= [   "", ""
]

df_first_example = pd.DataFrame(sentences,columns=['sentence'])
#df_first_example['lang'] = languages
df_first_example['embeddings'] = df_first_example['sentence'].progress_apply(model_labse.encode)

pca = PCA(n_components=2)
min_embeddings = (pca.fit_transform(np.array(df_first_example['embeddings'].to_list())))
df_first_example['x'] = min_embeddings[:,0]
df_first_example['y'] = min_embeddings[:,1]

x_axis_range = [df_first_example['x'].min()-1, df_first_example['x'].max()+1]
y_axis_range = [df_first_example['y'].min()-1, df_first_example['y'].max()+1]

fig = px.scatter(df_first_example, y='y', x='x',range_x=x_axis_range, range_y=y_axis_range,  hover_data=['sentence'])
fig.update_traces(marker_size=10)
fig.show()

100%|██████████| 3/3 [00:00<00:00, 20.41it/s]


Load Data and Sample small subset.

In [4]:


df = load_samples()[['cleaned_text','cleaned_text_translated','lang']]
df_sample = df.sample(1000)

Calculate embeddings for mutliple models and add those as columns to the sample.

In [118]:
model_multi_min = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
model_multi_max = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
model_mono = SentenceTransformer()

In [None]:
df_sample['embeddings_multi_min'] = df_sample['cleaned_text']

In [99]:
df_sample = df.sample(1000)
df_sample['embeddings'] = df_sample['cleaned_text'].progress_apply(lambda x: model.encode(x))
pca = PCA(n_components=2)
pca.fit(df_sample['embeddings'].to_list())

df_sample['min_embeddings'] = df_sample['embeddings'].progress_apply(lambda x: pca.transform(x.reshape(1,-1))[0])
df_sample['x'] = df_sample['min_embeddings'].apply(lambda x: x[0])
df_sample['y'] = df_sample['min_embeddings'].apply(lambda x: x[1])

x_axis_range = [df_sample['x'].min()-1, df_sample['x'].max()+1]
y_axis_range = [df_sample['y'].min()-1, df_sample['y'].max()+1]

fig = px.scatter(df_sample, y='y', x='x',range_x=x_axis_range, range_y=y_axis_range, color="lang", symbol="lang", hover_data=['cleaned_text_translated','lang'])
fig.update_traces(marker_size=10)
fig.show()

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

In [90]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from src.BertTopic07.bertopic_util import all_stop_words

vectorizer_model = CountVectorizer(stop_words=all_stop_words)
topic_model_multi = BERTopic(verbose=True, vectorizer_model=vectorizer_model)
# paraphrase-multilingual-MiniLM-L12-v2
trans_topics, probs = topic_model_multi.fit_transform(df_sample['cleaned_text_translated'])

df_sample['topic_bert_trans'] = trans_topics
fig = px.scatter(df_sample[df_sample['topic_bert_trans']!=-1], y='y', x='x',range_x=x_axis_range, range_y=y_axis_range, color="topic_bert_trans", symbol="lang", hover_data=['cleaned_text_translated','lang'])
fig.update_traces(marker_size=10)
fig.show()

2023-12-19 18:07:00,958 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

2023-12-19 18:07:18,385 - BERTopic - Embedding - Completed ✓
2023-12-19 18:07:18,389 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-19 18:07:21,552 - BERTopic - Dimensionality - Completed ✓
2023-12-19 18:07:21,553 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-19 18:07:21,590 - BERTopic - Cluster - Completed ✓
2023-12-19 18:07:21,594 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-19 18:07:21,663 - BERTopic - Representation - Completed ✓


In [87]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from src.BertTopic07.bertopic_util import all_stop_words

vectorizer_model = CountVectorizer(stop_words=all_stop_words)
topic_model_multi = BERTopic(language="multilingual",verbose=True, vectorizer_model=vectorizer_model)
# paraphrase-multilingual-MiniLM-L12-v2
topics, probs = topic_model_multi.fit_transform(df_sample['cleaned_text'])
fig = px.scatter(df_sample[df_sample['topic_bert']!=-1], y='y', x='x',range_x=x_axis_range, range_y=y_axis_range, color="topic_bert", symbol="lang", hover_data=['cleaned_text_translated','lang'])
fig.update_traces(marker_size=10)
fig.show()

2023-12-19 18:02:44,290 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

2023-12-19 18:03:02,482 - BERTopic - Embedding - Completed ✓
2023-12-19 18:03:02,483 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-19 18:03:06,153 - BERTopic - Dimensionality - Completed ✓
2023-12-19 18:03:06,153 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-19 18:03:06,194 - BERTopic - Cluster - Completed ✓
2023-12-19 18:03:06,208 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-19 18:03:06,337 - BERTopic - Representation - Completed ✓


In [80]:
len(topics)

1000

In [82]:
topics

[-1,
 1,
 4,
 0,
 0,
 -1,
 3,
 3,
 -1,
 -1,
 7,
 7,
 1,
 -1,
 0,
 5,
 -1,
 1,
 -1,
 1,
 -1,
 -1,
 0,
 2,
 -1,
 1,
 0,
 -1,
 4,
 3,
 1,
 -1,
 2,
 -1,
 6,
 0,
 8,
 -1,
 3,
 1,
 5,
 -1,
 -1,
 -1,
 -1,
 4,
 6,
 1,
 6,
 -1,
 -1,
 1,
 -1,
 0,
 1,
 0,
 0,
 6,
 -1,
 9,
 2,
 -1,
 -1,
 -1,
 1,
 -1,
 -1,
 6,
 1,
 -1,
 -1,
 -1,
 -1,
 3,
 1,
 -1,
 -1,
 -1,
 0,
 -1,
 5,
 0,
 4,
 1,
 -1,
 3,
 2,
 2,
 -1,
 -1,
 0,
 11,
 -1,
 6,
 0,
 9,
 0,
 0,
 0,
 2,
 3,
 4,
 4,
 -1,
 -1,
 -1,
 -1,
 6,
 9,
 1,
 0,
 5,
 2,
 2,
 -1,
 2,
 1,
 -1,
 2,
 0,
 -1,
 0,
 -1,
 -1,
 -1,
 -1,
 1,
 6,
 4,
 6,
 -1,
 11,
 5,
 -1,
 1,
 -1,
 -1,
 -1,
 -1,
 -1,
 3,
 2,
 0,
 -1,
 3,
 3,
 -1,
 0,
 -1,
 1,
 7,
 -1,
 0,
 -1,
 -1,
 -1,
 7,
 2,
 3,
 0,
 7,
 0,
 9,
 0,
 -1,
 0,
 3,
 3,
 5,
 0,
 3,
 -1,
 1,
 2,
 -1,
 2,
 -1,
 -1,
 2,
 3,
 -1,
 9,
 3,
 2,
 -1,
 8,
 5,
 1,
 -1,
 -1,
 7,
 1,
 3,
 -1,
 -1,
 -1,
 3,
 2,
 -1,
 -1,
 -1,
 1,
 11,
 4,
 2,
 4,
 -1,
 1,
 -1,
 0,
 -1,
 1,
 3,
 0,
 -1,
 -1,
 -1,
 2,
 -1,
 1,
 -1,
 3,
 8,
 3,
 -1,
 8,
 5,
 

In [83]:
df_sample['topic_bert']

5775    -1
24413    1
464      4
3445     0
980      0
        ..
32920   -1
37957    3
34733    3
38632   -1
22945    5
Name: topic_bert, Length: 1000, dtype: int64

In [55]:
model.encode("hellao my name ist depp").shape

(384,)

In [51]:
sentences = ["Hello, my name is Robin, iam an engineer", "Hallo, mein Name ist Robin, ich bin ein Ingenieur", "I like sausages", "heidegger ist mein Lieblingsphilosph", "Bebí mucho alcohol durante mis estudios de ingeniería", "Alcohol abuse is a severe problem for engineers "]
sentence_embeddings = model.encode(sentences)

print(sentence_embeddings.shape)
                                   
distance.euclidean(sentence_embeddings[0],sentence_embeddings[1])                
distance.euclidean(sentence_embeddings[0],sentence_embeddings[2])        

pca = PCA(n_components=2)
Xy = pca.fit_transform(sentence_embeddings)


fig = px.scatter(df, y=Xy[:,0], x=Xy[:,1],range_x=[-4,4], range_y=[-4,6]) #, color="medal", symbol="medal")
fig.update_traces(marker_size=10)
fig.show()

(6, 384)


In [44]:
Xy

array([[-2.0086875 , -2.0956204 ],
       [-1.6436944 , -1.739946  ],
       [ 4.9067483 ,  0.05094978],
       [ 0.67035085, -0.01351317],
       [-1.9247172 ,  3.7981296 ]], dtype=float32)

In [26]:
Xy[:,1]

array([-0.66215736,  0.704546  , -0.04238758], dtype=float32)

In [4]:
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-0.03208438 -0.40147376 -0.1480261   0.03269627 -0.05261087  0.23179664
  0.37597921 -0.04392295  0.1135691  -0.12239821  0.2583702   0.10285387
  0.25843033 -0.19025432 -0.04477004 -0.02309765  0.12181108  0.4561691
 -0.37526783 -0.4737594  -0.01327177  0.1702981  -0.01580675  0.17307588
 -0.16003202  0.23108149 -0.21505092 -0.27953395  0.04778303 -0.18116148
  0.1309878   0.12336046  0.4038071   0.19686654 -0.03655093 -0.08485917
 -0.08178296  0.26823217 -0.38140798  0.2301463  -0.39514923  0.10231381
 -0.24662013  0.04554879  0.0606126  -0.20166044 -0.10318054  0.27458987
 -0.0682338  -0.00788017 -0.14816892 -0.174069   -0.26222453  0.09016163
  0.2633843   0.19171038 -0.1251444   0.18008451 -0.25423777 -0.01264309
 -0.14854153  0.08243721 -0.17696884  0.14406182  0.43751812 -0.17437851
  0.5271312   0.13276206 -0.30573004  0.18121798 -0.02419089 -0.11894543
  0.12056329 -0.21464764 -0.12837155  0.1419

In [17]:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1,2], [-2, -1, 3], [-3, -2, -100], [1, 1,100], [2, 1, 0 ], [3, 2, 3]])
pca = PCA(n_components=2)
pca.fit_transform(X)
# print(pca.explained_variance_ratio_)
# print(pca.singular_values_)

array([[ -0.63134124,  -1.3897026 ],
       [ -1.61096531,  -2.26960619],
       [101.39174711,  -1.07187936],
       [-98.67074233,  -1.08957061],
       [  1.27773684,   2.26128648],
       [ -1.75643507,   3.55947228]])

In [74]:
df_1 = pd.read_excel("samples_cleaned.xlsx")
df_test = df_1[~df_1['topics'].isna()]
df_test = df_test[df_test['topics'].isin(['sanctions', 'unjustified_war', 'arms_delivery', 'people_killed'])]

In [75]:
df_test['topics'].value_counts()

topics
sanctions          25
unjustified_war    25
people_killed      25
arms_delivery      25
Name: count, dtype: int64

In [64]:
df_test.to_excel("samples_min.xlsx")

In [76]:
df_test['embeddings'] = df_test['cleaned_text'].progress_apply(model_labse.encode)

100%|██████████| 100/100 [00:04<00:00, 22.65it/s]


In [7]:
from sklearn.decomposition import KernelPCA
visualise_with_dimensional_reduction(df_test,KernelPCA(n_components=2, kernel='cosine'), color_to_show='topics', hover_data=['topics', 'cleaned_text_translated'])

NameError: name 'visualise_with_dimensional_reduction' is not defined

In [77]:
visualise_with_dimensional_reduction(df_test,PCA(n_components=2), color_to_show='topics', hover_data=['topics', 'cleaned_text_translated','lang'])

In [73]:
from src.TextEmbeddings08.text_embeddings import visualise_with_mvu_reduction, visualise_with_dimensional_reduction

visualise_with_mvu_reduction(df_test, k=40, color_to_show='topics', hover_data=['topics', 'cleaned_text_translated'])

In [86]:
import umap
fit = umap.UMAP(n_neighbors=30)

visualise_with_dimensional_reduction(df_test, fit, color_to_show='topics', hover_data=['topics', 'cleaned_text_translated','lang'])

In [43]:
np.array(df_test['embeddings']).shape
model = MaximumVarianceUnfolding()
model.fit_transform(np.array(df_test['embeddings'].to_list()), dim=2 ,k =102)

array([[ 0.11348653,  0.31540457],
       [ 0.10618962,  0.11357969],
       [-0.46867108, -0.09844133],
       [ 0.31270499, -0.02694024],
       [-0.22565404, -0.02565111],
       [-0.34339093, -0.03472691],
       [-0.11151965, -0.02645376],
       [-0.04863019,  0.20103631],
       [ 0.01398819, -0.09074035],
       [-0.27089082, -0.09890133],
       [-0.28758027, -0.30568813],
       [ 0.39393763,  0.04322562],
       [-0.17169702, -0.16977389],
       [-0.2299083 ,  0.04124734],
       [-0.0210372 ,  0.0872392 ],
       [-0.14914629,  0.08164228],
       [ 0.07885624, -0.20731077],
       [-0.10322653, -0.224598  ],
       [-0.22730167,  0.08011402],
       [-0.33804922,  0.02716399],
       [ 0.18779302, -0.18075013],
       [-0.17374801, -0.06481774],
       [-0.29042386,  0.28069941],
       [ 0.34539   , -0.00819363],
       [ 0.10919586,  0.25163435],
       [-0.11858924, -0.07172635],
       [ 0.08658769,  0.1994967 ],
       [-0.16825512,  0.0212456 ],
       [-0.01784002,

In [1]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/LaBSE')
embeddings = model.encode(sentences)

.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]