![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_INSTRUCTOR_sentence_embeddings.ipynb)

# INSTRUCTOR Sentence Embeddings with NLU

Instructor👨‍🏫, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor👨‍ achieves sota on 70 diverse embedding tasks.


## Sources :
- https://arxiv.org/abs/2212.09741#
- https://github.com/xlang-ai/instructor-embedding

## Paper abstract

We introduce INSTRUCTOR, a new method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (e.g., task and domain descriptions). Unlike encoders from prior work that are more specialized, INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains, without any further training. We first annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss. We evaluate INSTRUCTOR on 70 embedding evaluation tasks (64 of which are unseen during training), ranging from classification and information retrieval to semantic textual similarity and text generation evaluation. INSTRUCTOR, while having an order of magnitude fewer parameters than the previous best model, achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets. Our analysis suggests that INSTRUCTOR is robust to changes in instructions, and that instruction finetuning mitigates the challenge of training a single model on diverse datasets.


**All the available models:**

| Language | nlu.load() reference            | Spark NLP Model reference                                                                     |
|----------|---------------------------------|-----------------------------------------------------------------------------------------------|
| English  | en.embed_sentence.instructor_base | [instructor_base](https://sparknlp.org/2023/06/08/instructor_base_en.html) |
| English  | en.embed_sentence.instructor_large       | [instructor_large](https://sparknlp.org/2023/06/21/instructor_large_en.html)             |


# 1. Install NLU

In [None]:
!pip install nlu pyspark==3.1.2

# 2. Load Model and embed sample sentence with INSTRUCTOR_BASE Sentence Embedder

## 2.1 Sentence Level Output

In [2]:
import nlu
model = nlu.load('en.embed_sentence.instructor_base')

instructor_base download started this may take some time.
Approximate size to download 387.7 MB
[OK!]


In [3]:
model

{'instructor_sentence_embeddings@INSTRUCTOR_EMBEDDINGS_1c5e51202650': INSTRUCTOR_EMBEDDINGS_1c5e51202650,
 'document_assembler': DocumentAssembler_452bb2c19dca}

In [4]:
model['instructor_sentence_embeddings@INSTRUCTOR_EMBEDDINGS_1c5e51202650'].setInstruction("Represent the Amazon review for retrieval: ")

INSTRUCTOR_EMBEDDINGS_1c5e51202650

In [5]:
model.predict("Having observed how techology has spawned new enterprises, I find that Anderson puts it all together in a meaningful and understandable tome.  He has found the common thread that will define success and failure in the future.",
              output_level='sentence')

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_instructor_base
0,Having observed how techology has spawned new ...,"[-0.031238630414009094, -0.019899437204003334,..."
0,He has found the common thread that will defin...,"[-0.015970703214406967, -0.017222288995981216,..."


## 2.2 Document Level Output

In [6]:
import nlu
model = nlu.load('en.embed_sentence.instructor_base')

instructor_base download started this may take some time.
Approximate size to download 387.7 MB
[OK!]


In [7]:
model['instructor_sentence_embeddings@INSTRUCTOR_EMBEDDINGS_1c5e51202650'].setInstruction("Represent the Amazon review for retrieval: ")

INSTRUCTOR_EMBEDDINGS_1c5e51202650

In [8]:
model.predict("Having observed how techology has spawned new enterprises, I find that Anderson puts it all together in a meaningful and understandable tome.  He has found the common thread that will define success and failure in the future.",
              output_level='document')



Unnamed: 0,document,sentence_embedding_instructor_base
0,Having observed how techology has spawned new ...,"[-0.026530979201197624, -0.019795654341578484,..."


# 3. Load Model and embed sample sentence with INSTRUCTOR_LARGE Sentence Embedder

## 3.1 Sentence Level Output

In [9]:
import nlu
model = nlu.load('en.embed_sentence.instructor_large')

instructor_large download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


In [10]:
model

{'instructor_sentence_embeddings@INSTRUCTOR_EMBEDDINGS_46e0451abc97': INSTRUCTOR_EMBEDDINGS_46e0451abc97,
 'document_assembler': DocumentAssembler_0ae696d5c70c}

In [11]:
model['instructor_sentence_embeddings@INSTRUCTOR_EMBEDDINGS_46e0451abc97'].setInstruction("Represent the Wikipedia document for retrieval: ")

INSTRUCTOR_EMBEDDINGS_46e0451abc97

In [12]:
model.predict("""Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.""",
              output_level='sentence')

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_instructor_large
0,Capitalism has been dominant in the Western wo...,"[-0.017836451530456543, 0.017106134444475174, ..."
0,"In capitalism, prices determine the demand-sup...","[-0.014247411862015724, 0.041706837713718414, ..."
0,"For example, higher demand for certain goods a...","[-0.007085954304784536, 0.03219347819685936, -..."


## 3.2 Document Level Output

In [13]:
import nlu
model = nlu.load('en.embed_sentence.instructor_large')

instructor_large download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


In [14]:
model['instructor_sentence_embeddings@INSTRUCTOR_EMBEDDINGS_46e0451abc97'].setInstruction("Represent the Wikipedia document for retrieval: ")

INSTRUCTOR_EMBEDDINGS_46e0451abc97

In [15]:
model.predict("""Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.""",
              output_level='document')



Unnamed: 0,document,sentence_embedding_instructor_large
0,Capitalism has been dominant in the Western wo...,"[-0.011226557195186615, 0.03578060120344162, -..."


# 4. NLU has many more sentence embedding models!

Make sure to try them all out!
You can change 'embed_sentence.electra' in nlu.load('embed_sentence.electra') to bert, xlnet, albert or any other of the 20+ sentence embeddings offerd by NLU

In [None]:
nlu.print_all_model_kinds_for_action('embed_sentence')

For language <am> NLU provides the following Models : 
nlu.load('am.embed_sentence.xlm_roberta') returns Spark NLP model_anno_obj sent_xlm_roberta_base_finetuned_amharic
For language <de> NLU provides the following Models : 
nlu.load('de.embed_sentence.bert.base_cased') returns Spark NLP model_anno_obj sent_bert_base_cased
For language <el> NLU provides the following Models : 
nlu.load('el.embed_sentence.bert.base_uncased') returns Spark NLP model_anno_obj sent_bert_base_uncased
For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model_anno_obj tfhub_use
nlu.load('en.embed_sentence.albert') returns Spark NLP model_anno_obj albert_base_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model_anno_obj sent_bert_base_uncased
nlu.load('en.embed_sentence.bert.base_uncased_legal') returns Spark NLP model_anno_obj sent_bert_base_uncased_legal
nlu.load('en.embed_sentence.bert.finetuned') returns Spark NLP model_anno_obj sbert_setfit_