# Embeddings LlamaIndex-OpenAi: LlamaIndex Intro. Tutorial
Alejandro Ricciardi (Omegapy)  
created date: 12/30/2023 
GitHub: https://github.com/Omegapy

Projects Description:
Embeddings are numerical representations of text. To generate embeddings for text, a specific model is required.

In LlamaIndex, the default embedding model is text-embedding-ada-002 from OpenAI. You can also leverage any embedding models offered by Langchain and Huggingface using our LangchainEmbedding wrapper.

In this notebook, we cover the low-level usage for both OpenAI embeddings and HuggingFace embeddings.

- Initialization 
    - API Keys
    - LLM Init.
    - Load File

- Templates
    - Context
    - Refined Context - More Context

- Chat
    - Simulate a ChatBot that can answer questions about llama-index.

credit: LlamaIndex https://www.youtube.com/watch?v=mIyZ_9gqakE

### API Keys
This project you require API keys from: OpenAI: https://openai.com/ 

In [168]:
# Load environment variables API Keys

from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv()) 

True

## OpenAI Embeddings

#### Example of Embeddings using Open AI embedding
LlamaIndex imports OpenAI model "text-embedding-ada-002"

If using OpenAI:
```
import openai
# Example
response = openai.Embedding.create(
  input="porcine pals say",
  model="text-embedding-ada-002"
)
```

Different embedding models can be used.
In the website: https://huggingface.co/spaces/mteb/leaderboard,
it is Text Embedding Benchmark (MTEB) Leaderboard.

As 12/30/2023 OpenAI text-embedding-ada-002 model ranks 23.
Note: the import benchmark is 'Sequence Length' (is the number of tokens that are processed by the transformer together)

In [169]:
from llama_index.embeddings import OpenAIEmbedding
openai_embedding = OpenAIEmbedding()
embed = openai_embedding.get_text_embedding("hello world!")
print(len(embed))
print(embed[:10])

1536
[-0.007677523884922266, -0.005429570563137531, -0.015862544998526573, -0.033494822680950165, -0.016825487837195396, -0.0031930040568113327, -0.015498187392950058, -0.0021015594247728586, -0.002940881997346878, -0.026936396956443787]


## Custom Embeddings

Hugging Face

While we can integrate with any embeddings offered by Langchain, you can also implement the BaseEmbedding class and run your own custom embedding model!

For this, we will use the InstructorEmbedding pip package, in order to run hkunlp/instructor-large model found here: https://huggingface.co/hkunlp/instructor-large

In [170]:
# Install dependencies
!pip install InstructorEmbedding torch transformers sentence_transformers



Test the embeddings! Instructor embeddings work by telling it to represent text in a particular domain.

This makes sense for our llama-docs-bot, since we are search very specific documentation!

Let's quickly test to make sure everything works.

In [171]:
from InstructorEmbedding import INSTRUCTOR

In [172]:
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

load INSTRUCTOR_Transformer
max_seq_length  512
[[-6.15552403e-02  1.04199732e-02  5.88440476e-03  1.93768777e-02
   5.71417809e-02  2.57655568e-02 -4.01991711e-05 -2.80044340e-02
  -2.92965490e-02  4.91884835e-02  6.78200275e-02  2.18692217e-02
   4.54528593e-02  1.50187053e-02 -4.84451912e-02 -3.25259790e-02
  -3.56492773e-02  1.19935293e-02 -6.83915569e-03  3.03126276e-02
   5.17491661e-02  3.48140486e-02  4.91032610e-03  6.68928549e-02
   1.52824381e-02  3.54217030e-02  1.07743731e-02  6.89828917e-02
   4.44019474e-02 -3.23419459e-02  1.24267889e-02 -2.15528104e-02
  -1.62690822e-02 -4.15058397e-02 -2.42290599e-03 -3.07158055e-03
   4.27047275e-02  1.56428497e-02  2.57813111e-02  5.92843145e-02
  -1.99174043e-02  1.32361855e-02  1.08408108e-02 -4.00610529e-02
  -1.36212644e-03 -1.57032702e-02 -2.53812168e-02 -1.31972907e-02
  -7.83779379e-03 -1.14009120e-02 -4.82025445e-02 -2.58416235e-02
  -4.98770736e-03  4.98239510e-02  1.19490176e-02 -5.55060469e-02
  -2.82120239e-02 -3.3220872

#### Undo Batching 

Looks good! But we can see the output is batched (i.e. a list of lists), so we need to undo the batching in our implementation!

There are only 4 methods we need to implement below.

In [173]:
from typing import Any, List
from InstructorEmbedding import INSTRUCTOR
from llama_index.embeddings.base import BaseEmbedding

In [174]:

class InstructorEmbeddings(BaseEmbedding):
    
    _instruction: str = "Represent the Computer Science text for retrieval:"
     
    def __init__(
        self, 
        instructor_model_name: str = "hkunlp/instructor-large",
        **kwargs: Any,
    ) -> None:
        _model: INSTRUCTOR = INSTRUCTOR(instructor_model_name)
        super().__init__(**kwargs)

    def _get_query_embedding(self, query: str) -> List[float]:
        embeddings = model.encode([[self._instruction, query]])
        return embeddings[0].tolist()
    
    async def _aget_query_embedding(self, query: str) -> List[float]:
        return self._get_query_embedding(query)

    def _get_text_embedding(self, text: str) -> List[float]:
        embeddings = model.encode([[self._instruction, text]])
        return embeddings[0].tolist() 
    
    def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
        embeddings = model.encode([[self._instruction, text] for text in texts])
        return embeddings.tolist()

In [175]:
# set the batch size to 1 to avoid memory issues
# if you have a large GPU, you can increase this
instructor_embeddings = InstructorEmbeddings(embed_batch_size=3)

load INSTRUCTOR_Transformer
max_seq_length  512


In [176]:
embed = instructor_embeddings.get_text_embedding("How do I create a vector index?")
print(len(embed))
print(embed[:10])

768
[0.003987060859799385, 0.012122981250286102, 0.002690523862838745, 0.01581709273159504, -0.005555964540690184, 0.03673827275633812, 0.010727009736001492, 0.00666137645021081, -0.0392913892865181, 0.013146855868399143]
