![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_E5_sentence_embeddings.ipynb)

# BGE Sentence Embeddings with NLU

 BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense
  vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector database for LLMs.

## Sources :
- https://arxiv.org/pdf/2309.07597.pdf
- https://github.com/FlagOpen/FlagEmbedding

## Paper abstract

This paper introduces C-Pack, a package of resources that significantly advance the field of general
    Chinese embeddings. C-Pack includes three critical resources.
    1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets.
    2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora
    for training embedding models.
    3) C-TEM is a family of embedding models covering multiple sizes.
    Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the
    time of the release. We also integrate and optimize the entire suite of training methods for
    C-TEM. Along with our resources on general Chinese embedding, we release our data and models for
    English text embeddings. The English models achieve stateof-the-art performance on the MTEB
    benchmark; meanwhile, our released English data is 2 times larger than the Chinese data.


**All the available models:**

| Language | nlu.load() reference            | Spark NLP Model reference                                                                     |
|----------|---------------------------------|-----------------------------------------------------------------------------------------------|
| English  | en.embed_sentence.bge_small 	 | [bge_small](https://sparknlp.org/2024/01/01/bge_small_en.html) 					 |
| English  | en.embed_sentence.bge_base  | [bge_base](https://sparknlp.org/2024/01/01/bge_base_en.html)             	 |
| English  | en.embed_sentence.bge_large  | [bge_large](https://sparknlp.org/2024/01/01/bge_large_en.html)             	 |





# 1. Install NLU

In [None]:
!pip install nlu pyspark==3.4.1

# 2. Load Model and embed sample sentence

### en.embed_sentence.bge_small

In [2]:
import nlu

res = nlu.load("en.embed_sentence.bge_small").predict('query: how much protein should a female eat')

bge_small download started this may take some time.
Approximate size to download 76.1 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


In [4]:
res

Unnamed: 0,sentence,sentence_embedding_bge_small
0,query: how much protein should a female eat,"[-0.059140872210264206, -0.013027993030846119,..."


### en.embed_sentence.bge_base

In [5]:
res = nlu.load('en.embed_sentence.bge_base').predict("passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               output_level='document') # output_level should defined as document to get the embedding of the document instead of each sentence separately.

bge_base download started this may take some time.
Approximate size to download 246.7 MB
[OK!]


In [6]:
res

Unnamed: 0,document,sentence_embedding_bge_base
0,"passage: As a general guideline, the CDC's ave...","[0.006804925389587879, -0.006068557035177946, ..."


# 3. NLU has many more sentence embedding models!

Make sure to try them all out!
You can change 'embed_sentence.electra' in nlu.load('embed_sentence.electra') to bert, xlnet, albert or any other of the 20+ sentence embeddings offerd by NLU

In [7]:
nlu.print_all_model_kinds_for_action('embed_sentence')

For language <am> NLU provides the following Models : 
nlu.load('am.embed_sentence.xlm_roberta') returns Spark NLP model_anno_obj sent_xlm_roberta_base_finetuned_amharic
For language <de> NLU provides the following Models : 
nlu.load('de.embed_sentence.bert.base_cased') returns Spark NLP model_anno_obj sent_bert_base_cased
For language <el> NLU provides the following Models : 
nlu.load('el.embed_sentence.bert.base_uncased') returns Spark NLP model_anno_obj sent_bert_base_uncased
For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model_anno_obj tfhub_use
nlu.load('en.embed_sentence.albert') returns Spark NLP model_anno_obj albert_base_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model_anno_obj sent_bert_base_uncased
nlu.load('en.embed_sentence.bert.base_uncased_legal') returns Spark NLP model_anno_obj sent_bert_base_uncased_legal
nlu.load('en.embed_sentence.bert.finetuned') returns Spark NLP model_anno_obj sbert_setfit_