![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english/sentence-embeddings/UAEEmbeddings.ipynb)

# Sentence embeddings using Universal AnglE Embedding (UAE).

UAE is a novel angle-optimized text embedding model, designed to improve semantic textual
similarity tasks, which are crucial for Large Language Model (LLM) applications. By
introducing angle optimization in a complex space, AnglE effectively mitigates saturation of
the cosine similarity function.

# Colab Setup

In [1]:
!pip install -q spark-nlp==5.3.3 pyspark==3.5.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m568.4/568.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd

# for GPU training >> sparknlp.start(gpu = True)
spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

Spark NLP version 5.3.3
Apache Spark version: 3.5.0


# Download UAEEmbeddings Model and Create Spark NLP Pipeline
Lets create a Spark NLP pipeline with the following stages:

In [6]:
documentAssembler = DocumentAssembler() \
     .setInputCol("text") \
     .setOutputCol("document")

embeddings = UAEEmbeddings.pretrained() \
     .setInputCols(["document"]) \
     .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
     .setInputCols("embeddings") \
     .setOutputCols("finished_embeddings") \
     .setOutputAsVector(True)

pipeline = Pipeline().setStages([
     documentAssembler,
     embeddings,
     embeddingsFinisher
])

uae_large_v1 download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


Lets create a dataframe with some queries and passages to be used as input for the pipeline.

In [8]:
 data = spark.createDataFrame([["hello world"], ["hello moon"]]).toDF("text")
 data.show()

+-----------+
|       text|
+-----------+
|hello world|
| hello moon|
+-----------+



Run the pipeline and get the embeddings.

In [10]:
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(1,truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------