![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/sentence-embeddings/E5Embeddings.ipynb)

## Colab Setup

In [None]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd

# Comment out this line  and uncomment the next one to enable GPU mode and High RAM
# 
spark = sparknlp.start()

# spark = sparknlp.start(gpu=True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

# Download E5Embedding Model and Create Spark NLP Pipeline

Lets create a Spark NLP pipeline with the following stages:

In [None]:
document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("documents")

instruction = E5Embeddings.pretrained(name='e5_small', lang='en') \
            .setInputCols(["documents"]) \
            .setOutputCol("e5")

# Build pipeline with BART
pipe_components = [document_assembler, instruction]
pipeline = Pipeline().setStages( pipe_components)

Lets create a dataframe with some queries and passages to be used as input for the pipeline.

In [6]:
data = spark.createDataFrame([
            [1, "query: how much protein should a female eat"],
            [2, "query: summit define"],
            [3, "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 "
                "is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're "
                "expecting or training for a marathon. Check out the chart below to see how much protein you should "
                "be eating each day.", ],
            [4, "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain :"
                " the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the "
                "leaders of two or more governments."]
        ]).toDF("id", "text")

Run the pipeline and get the embeddings.

In [7]:
results = pipeline.fit(data).transform(data)
results.select("e5.embeddings").show(truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

Collect the results and save them to a Numpy array.

In [16]:
# collect embeddings as numpy array
embeddings = np.array([each[0][0] for each in results.select("e5.embeddings").collect()])

Investigate the cosine similarity between the query and the passages.

In [18]:
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

[[93.25909366721945, 74.10933523842462], [75.4203130378152, 92.58708611118642]]
