<a href="https://github.com/different-ai/embedbase/tree/main/notebooks/Embedbase_Getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Welcome to Embedbase!

As a reminder, embedbase is the end-to-end platform to manage ML embeddings.

Embeddings allows you to:
- connect your data to ChatGPT or any other LLM.
- create recommendation engines
- classify data
- detect anomalies
- etc.

Today we will run a local-first Embedbase using a `sentence-transformers` model as `Embedder` and a `MemoryDatabase` to store embeddings.

In [30]:
!pip install -q embedbase sentence-transformers datasets git+https://github.com/different-ai/embedbase.git@main#subdirectory=sdk/embedbase-py httpx

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [31]:
from typing import List, Union
from embedbase import get_app
from embedbase.database.memory_db import MemoryDatabase
from embedbase.embedding.base import Embedder
from sentence_transformers import SentenceTransformer

class LocalEmbedder(Embedder):
    EMBEDDING_MODEL = "all-MiniLM-L6-v2"
 
    def __init__(
        self, model: str = EMBEDDING_MODEL, **kwargs
    ):
        super().__init__(**kwargs)
        self.model = SentenceTransformer(model)
 
    @property
    def dimensions(self) -> int:
        """
        Return the dimensions of the embeddings
        :return: dimensions of the embeddings
        """
        return self._dimensions
 
    def is_too_big(self, text: str) -> bool:
        """
        Check if text is too big to be embedded,
        delegating the splitting UX to the caller
        :param text: text to check
        :return: True if text is too big, False otherwise
        """
        return len(text) > self.model.get_max_seq_length()
 
    async def embed(self, data: Union[List[str], str]) -> List[List[float]]:
        """
        Embed a list of texts
        :param texts: list of texts
        :return: list of embeddings
        """
        embeddings = self.model.encode(data)
        return embeddings.tolist() if isinstance(data, list) else [embeddings.tolist()]


def run_app():
    print(
        """
                 _           _   _         _               _          _         
        /\ \        /\_\/\_\ _    / /\            /\ \       /\ \       
       /  \ \      / / / / //\_\ / /  \          /  \ \     /  \ \____  
      / /\ \ \    /\ \/ \ \/ / // / /\ \        / /\ \ \   / /\ \_____\ 
     / / /\ \_\  /  \____\__/ // / /\ \ \      / / /\ \_\ / / /\/___  / 
    / /_/_ \/_/ / /\/________// / /\ \_\ \    / /_/_ \/_// / /   / / /  
   / /____/\   / / /\/_// / // / /\ \ \___\  / /____/\  / / /   / / /   
  / /\____\/  / / /    / / // / /  \ \ \__/ / /\____\/ / / /   / / /    
 / / /______ / / /    / / // / /____\_\ \  / / /______ \ \ \__/ / /     
/ / /_______\\/_/    / / // / /__________\/ / /_______\ \ \___\/ /      
\/__________/ _      \/_/ \/_____________/\/__________/  \/_____/       
             / /\            / /\               / /\         /\ \       
            / /  \          / /  \             / /  \       /  \ \      
           / / /\ \        / / /\ \           / / /\ \__   / /\ \ \     
          / / /\ \ \      / / /\ \ \         / / /\ \___\ / / /\ \_\    
         / / /\ \_\ \    / / /  \ \ \        \ \ \ \/___// /_/_ \/_/    
        / / /\ \ \___\  / / /___/ /\ \        \ \ \     / /____/\       
       / / /  \ \ \__/ / / /_____/ /\ \   _    \ \ \   / /\____\/       
      / / /____\_\ \  / /_________/\ \ \ /_/\__/ / /  / / /______       
     / / /__________\/ / /_       __\ \_\\ \/___/ /  / / /_______\      
     \/_____________/\_\___\     /____/_/ \_____\/   \/__________/      
                                                                                                
                 [-0.005, 0.012, -0.008, ..., -0.010]
        """
    )
    return get_app().use_db(MemoryDatabase()).use_embedder(LocalEmbedder()).run()

app = run_app()

2023-04-21 15:55:24,894 - embedbase - INFO - Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f44882c5d00>
2023-04-21 15:55:24,894 - embedbase - INFO - Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f44882c5d00>
INFO:embedbase:Enabling Database <embedbase.database.memory_db.MemoryDatabase object at 0x7f44882c5d00>



                 _           _   _         _               _          _         
        /\ \        /\_\/\_\ _    / /\            /\ \       /\ \       
       /  \ \      / / / / //\_\ / /  \          /  \ \     /  \ \____  
      / /\ \ \    /\ \/ \ \/ / // / /\ \        / /\ \ \   / /\ \_____\ 
     / / /\ \_\  /  \____\__/ // / /\ \ \      / / /\ \_\ / / /\/___  / 
    / /_/_ \/_/ / /\/________// / /\ \_\ \    / /_/_ \/_// / /   / / /  
   / /____/\   / / /\/_// / // / /\ \ \___\  / /____/\  / / /   / / /   
  / /\____\/  / / /    / / // / /  \ \ \__/ / /\____\/ / / /   / / /    
 / / /______ / / /    / / // / /____\_\ \  / / /______ \ \ \__/ / /     
/ / /_______\/_/    / / // / /__________\/ / /_______\ \ \___\/ /      
\/__________/ _      \/_/ \/_____________/\/__________/  \/_____/       
             / /\            / /\               / /\         /\ \       
            / /  \          / /  \             / /  \       /  \ \      
           / / /\ \        / / /\ \        

2023-04-21 15:55:25,428 - embedbase - INFO - Enabling Embedder <__main__.LocalEmbedder object at 0x7f45741c5220>
2023-04-21 15:55:25,428 - embedbase - INFO - Enabling Embedder <__main__.LocalEmbedder object at 0x7f45741c5220>
INFO:embedbase:Enabling Embedder <__main__.LocalEmbedder object at 0x7f45741c5220>


In [32]:
from datasets import load_dataset
dataset_id = "fka/awesome-chatgpt-prompts"
dataset = load_dataset(dataset_id, 'en', split='train', streaming=True)
print(next(iter(dataset)))

{'act': 'Linux Terminal', 'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'}


In [55]:
from embedbase_client.split import split_text
from httpx import AsyncClient
import json

embedbase_dataset_id = dataset_id.split("/")[-1]
i = 0
documents = []
for row in dataset:
  for c in split_text(row["prompt"], max_tokens=30, chunk_overlap=20):
    documents.append({
        "data": c.chunk,
    })

  i+=1
  if i > 100_000:
    break
async with AsyncClient(app=app, base_url="http://localhost:8000") as client:
  res = await client.post(f"/v1/{embedbase_dataset_id}", json={"documents": documents})
  print(res.json())

2023-04-21 16:06:27,493 - embedbase - INFO - Refreshing 1476 embeddings
2023-04-21 16:06:27,493 - embedbase - INFO - Refreshing 1476 embeddings
INFO:embedbase:Refreshing 1476 embeddings
2023-04-21 16:06:27,519 - embedbase - INFO - Checking embeddings computing necessity for 1476 documents
2023-04-21 16:06:27,519 - embedbase - INFO - Checking embeddings computing necessity for 1476 documents
INFO:embedbase:Checking embeddings computing necessity for 1476 documents
2023-04-21 16:06:27,595 - embedbase - INFO - We will compute embeddings for 1476/1476 documents
2023-04-21 16:06:27,595 - embedbase - INFO - We will compute embeddings for 1476/1476 documents
INFO:embedbase:We will compute embeddings for 1476/1476 documents
2023-04-21 16:06:53,615 - embedbase - INFO - Uploaded 1443 documents
2023-04-21 16:06:53,615 - embedbase - INFO - Uploaded 1443 documents
INFO:embedbase:Uploaded 1443 documents
2023-04-21 16:06:53,624 - embedbase - INFO - Uploaded in 26.130375623703003 seconds
2023-04-21 16

In [60]:
from httpx import AsyncClient

async with AsyncClient(app=app, base_url="http://localhost:8000") as client:
  res = await client.post(f"/v1/{embedbase_dataset_id}/search", json={"query": "any idea of recipes with lentils with basil and ginger?", "top_k": 15})
  res = res.json()
  for r in res["similarities"]:
    print(r["data"])

2023-04-21 16:08:23,395 - embedbase - INFO - Query any idea of recipes with lentils with basil and ginger? created embedding, querying index
2023-04-21 16:08:23,395 - embedbase - INFO - Query any idea of recipes with lentils with basil and ginger? created embedding, querying index
INFO:embedbase:Query any idea of recipes with lentils with basil and ginger? created embedding, querying index


, and you will suggest recipes for me to try. You should only reply with the recipes you recommend, and nothing else. Do not write explanations.
. You should only reply with the recipes you recommend, and nothing else. Do not write explanations. My first request is "I am a vegetarian and
 I will tell you about my dietary preferences and allergies, and you will suggest recipes for me to try. You should only reply with the recipes you recommend
 I am looking for healthy dinner ideas."
 a vegetarian recipe for 2 people that has approximate 500 calories per serving and has a low glycemic index. Can you please provide a suggestion?
I want you to act as my personal chef. I will tell you about my dietary preferences and allergies, and you will suggest recipes for me to try
, and nothing else. Do not write explanations. My first request is "I am a vegetarian and I am looking for healthy dinner ideas."
I require someone who can suggest delicious recipes that includes foods which are nutritional