In [None]:
!pip install transformers
!pip install faiss-gpu
!pip install load_dotenv
!pip install tiktoken
!pip install langchain
!pip install sentence-transformers
!pip install openai

In [None]:
%load_ext autoreload
%autoreload 2

import dotenv

dotenv.load_dotenv()

from scripts import generate_context, retrieve_relevant_excerpts
from embeddings import retrieve_relevant_excerpts_quickly
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
needle_question_couples = [
    ("\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n", "What is the most fun thing to do in San Francisco?"),
    ("\nThe most inspiring thing to do near the Hugging Face office in Paris is to visit the Louvre museum.\n", "What is the most inspiring thing to do near the Hugging Face office in Paris?"),
]

needle, question = needle_question_couples[0]

# 0. Test retrieval

We test an Information Retrieval pipeline by first inserting a small piece of information (the _needle_) inside a very long text.
Thus we use `generate_context` to choose the length of the resulting context, in tokens.
- The longer the context, the harder it will be to find this small needle of information.

In [None]:
context = generate_context(needle, context_length=100000, depth_percent=40)

In [None]:
print(f"Context has {len(context)} characters")

In [None]:
embedding = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    encode_kwargs={'normalize_embeddings': False},
    model_kwargs={'device': 'cuda'},
)

Now based on the chosen `embedding`, we will retrieve the most relevant documents from te context to answer the given `question` (related to the `needle`).

### Retrieve documents on local machine (vanilla method)

In [None]:
retrieved_documents = retrieve_relevant_excerpts(context, question, embedding)
print(len(retrieved_documents))
print(retrieved_documents[:300])
print("(...)")
print(retrieved_documents[-300:])

### Retrieve documents with Text embeddings inference

In [None]:
retrieved_documents = await retrieve_relevant_excerpts_quickly(context, question, embedding)
print(len(retrieved_documents))
print(retrieved_documents[:300])
print("(...)")
print(retrieved_documents[-300:])

#### 👉 The computation runs much faster with Text embeddings inference.

Why is that?
As per the [repo's Readme](https://github.com/huggingface/text-embeddings-inference):

>
> TEI implements many features such as:
> - No model graph compilation step
> - Small docker images and fast boot times. Get ready for true serverless!
> - Token based dynamic batching
> - Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
> - Safetensors weight loading
> - Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
>
>    