![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_E5_sentence_embeddings.ipynb)

# E5 Sentence Embeddings with NLU

E5, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.)


## Sources :
- https://arxiv.org/pdf/2212.03533.pdf
- https://github.com/microsoft/unilm/tree/master/e5

## Paper abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40× more parameters.


**All the available models:**

| Language | nlu.load() reference            | Spark NLP Model reference                                                                     |
|----------|---------------------------------|-----------------------------------------------------------------------------------------------|
| English  | en.embed_sentence.e5_small 	 | [e5_small](https://sparknlp.org/2023/08/25/e5_small_en.html) 					 |
| English  | en.embed_sentence.e5_small_opt  | [e5_small_opt](https://sparknlp.org/2023/08/25/e5_small_opt_en.html)             	 |
| English  | en.embed_sentence.e5_small_quantized  | [e5_small_quantized](https://sparknlp.org/2023/08/25/e5_small_quantized_en.html)             	 |
| English  | en.embed_sentence.e5_small_v2  | [e5_small_v2](https://sparknlp.org/2023/08/25/e5_small_v2_en.html)             	 |
| English  | en.embed_sentence.e5_small_v2_opt  | [e5_small_v2_opt](https://sparknlp.org/2023/08/25/e5_small_v2_opt_en.html)             	 |
| English  | en.embed_sentence.e5_small_v2_quantized  | [e5_small_v2_quantized](https://sparknlp.org/2023/08/25/e5_small_v2_quantized_en.html)             	 |
| English  | en.embed_sentence.e5_base 	 | [e5_base](https://sparknlp.org/2023/08/25/e5_base_en.html) 					 |
| English  | en.embed_sentence.e5_base_opt 	 | [e5_base_opt](https://sparknlp.org/2023/08/25/e5_base_opt_en.html) 					 |
| English  | en.embed_sentence.e5_base_quantized 	 | [e5_base_quantized](https://sparknlp.org/2023/08/25/e5_base_quantized_en.html) 					 |
| English  | en.embed_sentence.e5_base_v2 	 | [e5_base_v2](https://sparknlp.org/2023/08/25/e5_base_v2_en.html) 					 |
| English  | en.embed_sentence.e5_base_v2_opt 	 | [e5_base_v2_opt](https://sparknlp.org/2023/08/25/e5_base_v2_opt_en.html) 					 |
| English  | en.embed_sentence.e5_base_v2_quantized 	 | [e5_base_v2_quantized](https://sparknlp.org/2023/08/25/e5_base_v2_quantized_en.html) 					 |
| English  | en.embed_sentence.e5_large 	 | [e5_large](https://sparknlp.org/2023/06/21/e5_large_en.html) 					 |
| English  | en.embed_sentence.e5_large_v2 	 | [e5_large_v2](https://sparknlp.org/2023/08/25/e5_large_v2_en.html) 					 |
| English  | en.embed_sentence.e5_large_v2_opt 	 | [e5_large_v2_opt](https://sparknlp.org/2023/08/25/e5_large_v2_opt_en.html) 					 |
| English  | en.embed_sentence.e5_large_v2_quantized 	 | [e5_large_v2_quantized](https://sparknlp.org/2023/08/25/e5_large_v2_quantized_en.html) 					 |




# 1. Install NLU

In [None]:
!pip install nlu pyspark==3.1.2

# 2. Load Model and embed sample sentence

### en.embed_sentence.e5_small

In [2]:
import nlu
res = nlu.load('en.embed_sentence.e5_small').predict('query: how much protein should a female eat',
               output_level='document') # output_level should defined as document!

e5_small download started this may take some time.
Approximate size to download 76.2 MB
[OK!]


In [3]:
res

Unnamed: 0,document,word_embedding_e5_small
0,query: how much protein should a female eat,"[0.15774869918823242, -0.08336979150772095, -0..."


### en.embed_sentence.e5_base

In [4]:
res = nlu.load('en.embed_sentence.e5_base').predict("passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
               output_level='document') # output_level should defined as document!


e5_base download started this may take some time.
Approximate size to download 246.6 MB
[OK!]


In [5]:
res

Unnamed: 0,document,word_embedding_e5_base
0,passage: Definition of summit for English Lang...,"[-1.3526510000228882, -0.16715717315673828, 0...."


# 3. NLU has many more sentence embedding models!

Make sure to try them all out!
You can change 'embed_sentence.electra' in nlu.load('embed_sentence.electra') to bert, xlnet, albert or any other of the 20+ sentence embeddings offerd by NLU

In [None]:
nlu.print_all_model_kinds_for_action('embed_sentence')

For language <am> NLU provides the following Models : 
nlu.load('am.embed_sentence.xlm_roberta') returns Spark NLP model_anno_obj sent_xlm_roberta_base_finetuned_amharic
For language <de> NLU provides the following Models : 
nlu.load('de.embed_sentence.bert.base_cased') returns Spark NLP model_anno_obj sent_bert_base_cased
For language <el> NLU provides the following Models : 
nlu.load('el.embed_sentence.bert.base_uncased') returns Spark NLP model_anno_obj sent_bert_base_uncased
For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model_anno_obj tfhub_use
nlu.load('en.embed_sentence.albert') returns Spark NLP model_anno_obj albert_base_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model_anno_obj sent_bert_base_uncased
nlu.load('en.embed_sentence.bert.base_uncased_legal') returns Spark NLP model_anno_obj sent_bert_base_uncased_legal
nlu.load('en.embed_sentence.bert.finetuned') returns Spark NLP model_anno_obj sbert_setfit_