# LLM Implementation & Vector Storage Tutorial

## Original Source
See the medium article that we started from [here](https://towardsdatascience.com/all-you-need-to-know-to-build-your-first-llm-app-eb982c78ffac)

In [1]:
import requests
from bs4 import BeautifulSoup


url_wiki_uk_primeminister = "https://en.wikipedia.org/wiki/GPT-4"
response = requests.get(url_wiki_uk_primeminister)


soup = BeautifulSoup(response.content, 'html.parser')


# find the content div
content_div = soup.find('div', {'class': 'mw-parser-output'})
assert content_div is not None, f"No content returned from scraping wikiarticle at {url_wiki_uk_primeminister}"


# remove unwanted elements from div
unwanted_tags = ['sup', 'span', 'table', 'ul', 'ol']
for tag in unwanted_tags:
    for match in content_div.findAll(tag):
        match.extract()


print(content_div.get_text())

"ChatGPT-4" redirects here. For other uses, see GPT.2023 text-generating language model



Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI, and the fourth in its numbered "GPT-n" series of GPT foundation models. It was released on March 14, 2023, and has been made publicly available in a limited form via the chatbot product ChatGPT Plus (a premium version of ChatGPT), and with access to the GPT-4 based version of OpenAI's API being provided via a waitlist. As a transformer based model, GPT-4 was pretrained to predict the next token (using both public data and "data licensed from third-party providers"), and was then fine-tuned with reinforcement learning from human and AI feedback for human alignment and policy compliance.
Observers reported the GPT-4 based version of ChatGPT to be an improvement on the previous (GPT-3.5 based) ChatGPT, with the caveat that GPT-4 retains some of the same problems. Unlike the predecessors, GPT-4 can ta

In [13]:
import nltk
nltk.download('wordnet')  # approx 30 seconds

[nltk_data] Downloading package wordnet to /Users/joey/nltk_data...


True

We use Langchain in the first step to load documents, analyse them and make them efficiently searchable. After we have indexed the text, it should become much more efficient to recognize text snippets that are relevant for answering the user's questions

LangChain is able to load a number of documents from a wide variety of sources. You can find a list of possible document loaders in the LangChain documentation. Among them are loaders for HTML pages, S3 buckets, PDFs, Notion, Google Drive and many more.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

assert content_div is not None, f"No content returned from scraping wikiarticle at {url_wiki_uk_primeminister}"
article_text = content_div.get_text()


text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)


texts = text_splitter.create_documents([article_text])
print(texts[0])
print(texts[1])

page_content='"ChatGPT-4" redirects here. For other uses, see GPT.2023 text-generating language model' metadata={}
page_content='Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by' metadata={}


To convert text into embeddings, there are several ways, e.g. Word2Vec, GloVe, fastText or ELMo.

In the following we want to use the OpenAI API not only to use OpenAI's LLMs, but also to leverage their Embedding Models.

*Note: The difference between Embedding Models and LLMs is that Embedding Models focus on creating vector representations of words or phrases to capture their meanings and relationships, while LLMs are versatile models trained to generate coherent and contextually relevant text based on provided prompts or queries.*

## OpenAI Embedding models

Similar to the various LLMs from OpenAI, you can also choose between a variety of embedding models, such as Ada, Davinci, Curie, and Babbage. Among them, Ada-002 is currently the fastest and most cost-effective model, while Davinci generally provides the highest accuracy and performance.

---

We use OpenAI's API to translate our text snippets into embeddings as follows:

In [26]:
import openai
import os
import dotenv


dotenv.load_dotenv('../.env')

os.environ['OPENAI_API_KEY'] = os.environ['OPEN_AI_SECRET']  # * openai libs check this env variable
openai.api_key = os.environ['OPEN_AI_SECRET']


assert openai.api_key, "open ai api secret cannot be empty"
 
print(texts[0])

page_content='"ChatGPT-4" redirects here. For other uses, see GPT.2023 text-generating language model' metadata={}


In [8]:

# $ PAID USAGE METHOD
embedding = openai.Embedding.create(
    input=texts[0].page_content, model="text-embedding-ada-002"
)["data"][0]["embedding"]


len(embedding)

1536

When defining your models, you can set some preferences. The [OpenAI Playground](https://platform.openai.com/playground) gives you the possibility to play around a bit with the different parameters before you decide what settings you want to use:

On the right side in the Playground WebUI, you will find several parameters provided by OpenAI that allow us to influence the output of the LLM. Two parameters worth exploring are the model selection and the temperature.

Use ADA model as cheapest: [playground](https://platform.openai.com/playground?model=text-ada-001)

We are using langchain to connect to GPT.

In [10]:
from langchain.llms import OpenAI

llm = OpenAI(
    model='text-ada-001',  # 'text-davinci-003' | 'text-babbage-001' | 'text-curie-001' | 'text-ada-001' | ...rest in options on playground
    temperature=0.7,
    client=None,
    openai_api_key=os.environ['OPEN_AI_SECRET']
)

display(llm.__dict__)

{'cache': None,
 'verbose': False,
 'callbacks': None,
 'callback_manager': None,
 'client': openai.api_resources.completion.Completion,
 'model_name': 'text-ada-001',
 'temperature': 0.7,
 'max_tokens': 256,
 'top_p': 1,
 'frequency_penalty': 0,
 'presence_penalty': 0,
 'n': 1,
 'best_of': 1,
 'model_kwargs': {},
 'openai_api_key': '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',
 'openai_api_base': None,
 'openai_organization': None,
 'openai_proxy': None,
 'batch_size': 20,
 'request_timeout': None,
 'logit_bias': {},
 'max_retries': 6,
 'streaming': False,
 'allowed_special': set(),
 'disallowed_special': 'all'}

In [3]:
from langchain import VectorDBQAWithSourcesChain
import requests
from bs4 import BeautifulSoup
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
import numpy as np
import pandas as pd
from numpy.linalg import norm
####################################################################
# load documents
####################################################################
# URL of the Wikipedia page to scrape
url_wiki_fmcg = 'https://en.wikipedia.org/wiki/Fast-moving_consumer_goods'
url_wiki_uk_primeminister = 'https://en.wikipedia.org/wiki/Prime_Minister_of_the_United_Kingdom'

def get_wiki_info_to_text_chunks(url: str):

    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all the text on the page
    text = soup.get_text()

    ####################################################################
    # split text
    ####################################################################
    text_splitter = RecursiveCharacterTextSplitter(
        # Set a really small chunk size, just to show.
        chunk_size = 100,
        chunk_overlap  = 20,
        length_function = len,
    )

    texts = text_splitter.create_documents([text])

    # create new list with all text chunks
    text_chunks: list[str] = []

    for text in texts:
        text_chunks.append(text.page_content)
        
    ####################################################################
    # calculate embeddings
    ####################################################################
    df = pd.DataFrame({'text_chunks': text_chunks})
    
    return df




def save_wiki_info_to_file(url: str, outFileNameNoExt: str):
    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all the text on the page
    text = soup.get_text()
    text = text.replace('\n', '')

    # Open a new file called 'output.txt' in write mode and store the file object in a variable
    with open(f'{outFileNameNoExt}.txt', 'w', encoding='utf-8') as file:
        # Write the string to the file
        file.write(text)
        
    # load the document
    # with open(f'{outFileNameNoExt}.txt', encoding='utf-8') as f:
    #     text = f.read()

    # define the text splitter
    text_splitter = RecursiveCharacterTextSplitter(    
        chunk_size = 500,
        chunk_overlap  = 100,
        length_function = len,
    )

    texts = text_splitter.create_documents([text])
    return texts


persist_directory = './vector_stores'

def fill_vector_store_from_doc_chunks(texts: list[Document]):

    # define the embeddings model
    embeddings = OpenAIEmbeddings(client=None, openai_api_key = os.environ["OPENAI_API_KEY"])

    # use the text chunks and the embeddings model to fill our vector store
    
    # ~ https://colab.research.google.com/drive/1Q_I60MFpItT0-9jSW8wP1-7H6i2P4mUY#scrollTo=tYIqglQTk6eD&line=4&uniqifier=1
    #vstore with metadata. Here we will store page numbers.
    vectordb = Chroma.from_documents(
        texts,
        embeddings,
        # store the page number of the text_chunk using metadatas:
        # metadatas=[{"source": s} for s in sources],
        persist_directory=persist_directory,
    )
    #deciding model
    # model_name = "text-embedding-ada-002"
    # model_name = "gpt-3.5-turbo"
    # model_name = "gpt-4"
    vectordb.persist()
    return embeddings, vectordb

def fill_vector_store_from_text_chunks(texts: list[str]):

    # define the embeddings model
    embeddings = OpenAIEmbeddings(client=None, openai_api_key = os.environ["OPENAI_API_KEY"])

    # use the text chunks and the embeddings model to fill our vector store
    
    # ~ https://colab.research.google.com/drive/1Q_I60MFpItT0-9jSW8wP1-7H6i2P4mUY#scrollTo=tYIqglQTk6eD&line=4&uniqifier=1
    #vstore with metadata. Here we will store page numbers.
    vectordb = Chroma.from_texts(
        texts,
        embeddings,
        # store the page number of the text_chunk using metadatas:
        # metadatas=[{"source": s} for s in sources],
        persist_directory=persist_directory,
    )
    #deciding model
    # model_name = "text-embedding-ada-002"
    # model_name = "gpt-3.5-turbo"
    # model_name = "gpt-4"
    vectordb.persist()
    return embeddings, vectordb

def fill_vector_store_from_openai_ef(texts: list[str], openai_ef):

    # define the embeddings model
    # embeddings = OpenAIEmbeddings(client=None, openai_api_key = os.environ["OPENAI_API_KEY"])
    embeddings = openai_ef

    # use the text chunks and the embeddings model to fill our vector store
    
    # ~ https://colab.research.google.com/drive/1Q_I60MFpItT0-9jSW8wP1-7H6i2P4mUY#scrollTo=tYIqglQTk6eD&line=4&uniqifier=1
    #vstore with metadata. Here we will store page numbers.
    vectordb = Chroma.from_texts(
        texts,
        embeddings,
        # store the page number of the text_chunk using metadatas:
        # metadatas=[{"source": s} for s in sources],
        persist_directory=persist_directory,
    )
    #deciding model
    # model_name = "text-embedding-ada-002"
    # model_name = "gpt-3.5-turbo"
    # model_name = "gpt-4"
    vectordb.persist()
    return embeddings, vectordb
    
def load_vector_store():
    # Now we can load the persisted database from disk, and use it as normal.
    embeddings = OpenAIEmbeddings(client=None, openai_api_key = os.environ["OPENAI_API_KEY"])
    vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
    vectordb.get()
    return embeddings, vectordb
    
def load_QnA_service():
    embeddings, vectordb = load_vector_store()
    return VectorDBQAWithSourcesChain.from_chain_type(
        llm=OpenAI(
            model='text-embedding-ada-002',  # 'text-davinci-003' | 'text-babbage-001' | 'text-curie-001' | 'text-ada-001' | ...rest in options on playground
            temperature=0.7,
            client=None,
            openai_api_key=os.environ['OPEN_AI_SECRET'],
        ),
        k=1,
        chain_type="stuff",
        vectorstore=vectordb,
    )
    


In [47]:
df_uk_primeminister_wiki = get_wiki_info_to_text_chunks(url=url_wiki_uk_primeminister)
df_fmcg_wiki = get_wiki_info_to_text_chunks(url=url_wiki_fmcg)

In [49]:
text_splitter_docs = save_wiki_info_to_file(url=url_wiki_fmcg, outFileNameNoExt='wiki_FMCG')
embeddings, vectordb = fill_vector_store_from_doc_chunks(texts=text_splitter_docs)

ImportError: Could not import tiktoken python package. This is needed in order to for OpenAIEmbeddings. Please install it with `pip install tiktoken`.

In [21]:
display(df_fmcg_wiki)

Unnamed: 0,text_chunks
0,Fast-moving consumer goods - Wikipedia\n\n\n\n...
1,Main menu\n\n\n\n\n\nMain menu\nmove to sideba...
2,Navigation\n\t\n\nMain pageContentsCurrent eve...
3,Contribute\n\t\n\nHelpLearn to editCommunity p...
4,Languages\n\nLanguage links are at the top of ...
...,...
100,and Privacy Policy. Wikipedia® is a registered...
101,"Foundation, Inc., a non-profit organization."
102,Privacy policy\nAbout Wikipedia\nDisclaimers\n...
103,Mobile view\nDevelopers\nStatistics\nCookie st...


In [28]:
text_embedding_openai_model = "text-embedding-ada-002"

In [52]:

# get embeddings from text-embedding-ada model
def get_embedding(text: str, model: str=text_embedding_openai_model):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

# df_fmcg_wiki['ada_embedding'] = df_fmcg_wiki.text_chunks.apply(lambda x: get_embedding(x, model=text_embedding_openai_model))
# df_uk_primeminister_wiki['ada_embedding'] = df_uk_primeminister_wiki.text_chunks.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

KeyboardInterrupt: 

In [None]:
####################################################################
# calculate similarities to the user's question
####################################################################
# calcuate the embeddings for the user's question
users_question = "Who is the current Prime Minister of the UK?"

# question_embedding = get_embedding(text=users_question, model=text_embedding_openai_model)

In [None]:
users_question = "Who was the first Prime Minister of the UK?"

In [None]:
from langchain.llms import OpenAI
from langchain import PromptTemplate

users_question = "Who is the current Prime Minister of the UK?"

embeddings, vectordb = load_vector_store()

# use our vector store to find similar text chunks
results = vectordb.similarity_search(
    query=users_question,
    n_results=5
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = results, users_question = users_question)

# ask the defined LLM
llm(prompt_text)

In [18]:

import requests
from bs4 import BeautifulSoup
from collections import Counter
from nltk.stem import WordNetLemmatizer

# List of strings to search for
search_strings = ["grocery", "consumer product", "recipe"]

# Lemmatize the search strings
lemmatizer = WordNetLemmatizer()
lemmatized_search_strings = [lemmatizer.lemmatize(string.lower()) for string in search_strings]

# Wikipedia search URL
base_url = "https://en.wikipedia.org/w/index.php?search="

grocery_urls: dict[str, BeautifulSoup] = {}

# Perform search for each string
for string in lemmatized_search_strings:
    search_url = base_url + string.replace(" ", "+")
    print(f"Searching '{search_url}'")
    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, "html.parser")

    # Find all search results
    results = soup.find_all("li", {"class": "mw-search-result"})
    article_tags = soup.find_all("a[title='View the content page [ctrl-option-c]']")
    if article_tags:
        print("Result was an article")
        result = soup
        title = result.find("span", {"class": "mw-page-title-main"}).text
        link = result.find("a[title='View the content page [ctrl-option-c]']")["href"]

        # Fetch page content
        page_url = f"https://en.wikipedia.org/{link}"
        page_response = requests.get(page_url)
        page_soup = BeautifulSoup(page_response.content, "html.parser")

        # Count occurrences of the search string in page content
        content = page_soup.get_text().lower()
        count = content.count(string)

        # Check if the count is 5 or more
        if count >= 5:
            grocery_urls[page_url] = page_soup
            print("Title:", title)
            print("URL:", page_url)
            print("Occurrences:", count)
            print("---")
    elif results:
        print(f"Found {len(results)} results in search")
        # Check occurrence count in each search result
        for result in results:
            title = result.find("div", {"class": "mw-search-result-heading"}).find("a").text
            link = result.find("div", {"class": "mw-search-result-heading"}).find("a")["href"]

            # Fetch page content
            page_url = "https://en.wikipedia.org" + link
            page_response = requests.get(page_url)
            page_soup = BeautifulSoup(page_response.content, "html.parser")

            # Count occurrences of the search string in page content
            content = page_soup.get_text().lower()
            count = content.count(string)

            # Check if the count is 5 or more
            if count >= 5:
                grocery_urls[page_url] = page_soup
                print("Title:", title)
                print("URL:", page_url)
                print("Occurrences:", count)
                print("---")
    else:
        print(soup.get_text())


Searching 'https://en.wikipedia.org/w/index.php?search=grocery'




Grocery store - Wikipedia




































Jump to content








Main menu





Main menu
move to sidebar
hide



		Navigation
	

Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate




		Contribute
	

HelpLearn to editCommunity portalRecent changesUpload file




Languages

Language links are at the top of the page across from the title.



















Search











Search







Create accountLog in






Personal tools




 Create account Log in




		Pages for logged out editors learn more


ContributionsTalk


























Contents
move to sidebar
hide




(Top)





1Definition



Toggle Definition subsection





1.1United States







1.2United Kingdom







1.3India









2History



Toggle History subsection





2.1Early history







2.2Modernization









3Types



Toggle Types subsection





3.1Small format





3.1.1Neighborhood g

### Embed all sustained products, categories, vegi products and categories to visualise in one space

[('Food Collection Vietnamese Black Tiger Prawns in Garlic & Herb Butter',
  'Fish & Seafood'),
 ('Sliced Smoked Salmon', 'Fish & Seafood'),
 ('Golden Shell Mussels with a White Wine, Cream & Garlic Sauce',
  'Fish & Seafood'),
 ('Sea Bass Fillets', 'Fish & Seafood'),
 ('Food 2 Scottish Honey Roast Hot Smoked Salmon Fillets', 'Fish & Seafood'),
 ('Market St Whole Sea Bass', 'Fish & Seafood'),
 ('Flavoursome 2 Salmon Fillets', 'Fish & Seafood'),
 ('Taste the Difference British Smoked Mackerel', 'Fish & Seafood'),
 ('Food 2 Breaded Haddock Fillets', 'Fish & Seafood'),
 ('2 Breaded Chunky Cod Fillets', 'Fish & Seafood'),
 ('2 Cod Fillets Boneless', 'Fish & Seafood'),
 ('6 Boneless Salmon Fillets', 'Fish & Seafood'),
 ("The Fishmonger's On Market Street Smoked Mackerel Fillets",
  'Fish & Seafood'),
 ('Finest Raw Jumbo King Prawns', 'Fish & Seafood'),
 ('Select Farm Salmon Fillet Rosemary Garlic', 'Fish & Seafood'),
 ('Tesco Whole Sea Bream 220-400g', 'Fish & Seafood'),
 ('Lightly Dusted 2

In [18]:

from dataclasses import dataclass
import json
from typing import Tuple, Final
from nltk.corpus import stopwords
from unidecode import unidecode
from ftfy import fix_text
import re


cachedStopWords = stopwords.words("english")

def clean_words(words:str):
    words = unidecode(words)
    words = fix_text(words)
    words = ' '.join([word for word in words.split() if word not in cachedStopWords])
    words = re.sub(pattern=r'[^0-9A-Za-z\s]',string=words,repl="")
    return words


@dataclass
class VegiProductsFromJson:
    product_names: list[Tuple[str,str]]
    product_categories: list[str]
    source: Final[str] = "vegi"
    
    def limit(self, n: int):
        return type(self)(
            source=self.source,
            product_categories=self.product_categories[:n],
            product_names=self.product_names[:n],
        )

@dataclass
class ExternalProductsFromJson:
    source: str
    product_names: list[Tuple[str,str]]
    product_categories: list[str]
    
    def limit(self, n: int):
        return type(self)(
            source=self.source,
            product_categories=self.product_categories[:n],
            product_names=self.product_names[:n],
        )

    
def get_sustained_products():
    with open('../backup_localstorage/sustained-categories.json', 'r') as f:
        sustained_categories_json = json.load(f)
    sustained_categories = [clean_words(c["name"]) for c in sustained_categories_json["categories"]]

    sustained_product_names: list[Tuple[str,str]] = []
    for fp in os.listdir('../backup_localstorage/'):
        if fp != 'sustained-categories.json':
            with open(f'../backup_localstorage/{fp}', 'r') as f:
                json_obj = json.load(f)
            sustained_product_names += [
                (clean_words(p["name"]), clean_words(p["category"])) for p in json_obj
            ]
    return ExternalProductsFromJson(
        source="sustained",
        product_names=sustained_product_names,
        product_categories=sustained_categories,
    )

def get_vegi_products(vendorId: int=1):
    purple_carrot_json = requests.get(f'http://qa-vegi.vegiapp.co.uk/api/v1/vendors/{vendorId}')
    # display(purple_carrot_json.json())
    # all_keys = [k for k in purple_carrot_json.json()["vendor"]]
    return VegiProductsFromJson(
        product_names=[(clean_words(k["name"]), clean_words(k["category"]["name"])) for k in purple_carrot_json.json()["vendor"]["products"]],
        product_categories=[clean_words(k["name"]) for k in purple_carrot_json.json()["vendor"]["productCategories"]],
    )
    
vegi_products = get_vegi_products(vendorId=1)
vegi_product_names = vegi_products.product_names
vegi_product_categories = vegi_products.product_categories
sustained_products = get_sustained_products()
sustained_categories = sustained_products.product_categories
sustained_product_names = sustained_products.product_names
sustained_source_name = sustained_products.source

In [236]:
vegi_product_names

[('Falafel balls', 'Antipasti'),
 ('Classic houmous', 'Antipasti'),
 ('Creamy original spread', 'Spread'),
 ('Oumph burgers', 'Burgers'),
 ('Tofu weiner hotdogs', 'Sausages  hotdogs'),
 ('Plain tofu', 'Tofu'),
 ('Spreadable vegan butter', 'Butter'),
 ('Chickn strips original', 'Faux meats'),
 ('Violife original grated', 'Cheese'),
 ('Haggis', 'Faux meats'),
 ('Silken tofu firm', 'Tofu'),
 ('Kofu Sausages', 'Sausages  hotdogs'),
 ('Artichoke hearts', 'Antipasti'),
 ('Roasted red peppers', 'Antipasti'),
 ('2 x Strawberry yoghurt', 'Yoghurt'),
 ('Sausages', 'Sausages  hotdogs'),
 ('Semidried tomatoes', 'Antipasti'),
 ('Saurkraut', 'Ferments'),
 ('Black pudding vg', 'Faux meats'),
 ('Chunky guacamole', 'Antipasti'),
 ('Barista seed milk hemp drink', 'Milk'),
 ('Smoked tofu', 'Tofu'),
 ('Mellow cheddar block', 'Cheese'),
 ('Kalamata black olives pitted', 'Antipasti'),
 ('Chocolate spread', 'Spread'),
 ('Natural yoghurt', 'Yoghurt'),
 ('Capers', 'Antipasti'),
 ('Plain dairy free yoghurt', 'Y

In [237]:
all_product_names = [
    *vegi_product_names,
    *sustained_product_names,
]
all_products_and_cats = [
    *[x[0] for x in all_product_names],
    *vegi_product_categories,
    *sustained_categories,
]
display(all_products_and_cats)
len(all_products_and_cats)

['Falafel balls',
 'Classic houmous',
 'Creamy original spread',
 'Oumph burgers',
 'Tofu weiner hotdogs',
 'Plain tofu',
 'Spreadable vegan butter',
 'Chickn strips original',
 'Violife original grated',
 'Haggis',
 'Silken tofu firm',
 'Kofu Sausages',
 'Artichoke hearts',
 'Roasted red peppers',
 '2 x Strawberry yoghurt',
 'Sausages',
 'Semidried tomatoes',
 'Saurkraut',
 'Black pudding vg',
 'Chunky guacamole',
 'Barista seed milk hemp drink',
 'Smoked tofu',
 'Mellow cheddar block',
 'Kalamata black olives pitted',
 'Chocolate spread',
 'Natural yoghurt',
 'Capers',
 'Plain dairy free yoghurt',
 'Organic minestrone soup',
 'Chickpea  bean tagine',
 'Napoletana pasta sauce',
 'Chipotle cholula hot sauce',
 'Soya milk',
 'Mixed pitted greek olives',
 'Vegetable hotpot',
 'Sweet potato coconut  kale curry',
 'Spicy bean burger',
 'Spicy Peppernoni pizza',
 'Whole oat drink',
 'Hollandaise',
 'Egg Alternative',
 'Coconut milk',
 'Kofu Steak',
 'Raw kimchi hot sauce',
 'Pesto Siciliano

5443

Creating a vector store (vector database)
A vector store is a type of data store that is optimized for storing and retrieving large quantities of data that can be represented as vectors. These types of databases allow for efficient querying and retrieval of subsets of the data based on various criteria, such as similarity measures or other mathematical operations.

Converting our text data into vectors is the first step, but it is not enough for our needs. If we were to store the vectors in a data frame and search step-by-step the similarities between words every time we get a query, the whole process would be incredibly slow.

In order to efficiently search our embeddings, we need to index them. Indexing is the second important component of a vector database. The index provides a way to map queries to the most relevant documents or items in the vector store without having to compute similarities between every query and every document.

In recent years, a number of vector stores have been released. Especially in the field of LLMs, the attention around vector stores has exploded:

![llm vector stores](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*LKO2xA_5ZvhftwI4YhqL7A.png)

Release Vector Stores in the past years — Image by the author
Now let's just pick one and try it out for our use case. Similar to what we did in the previous sections, we are again calculating the embeddings and storing them in a vector store. To do this, we are using suitable modules from LangChain and chroma as a vector store.
![llm arch diagram](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*ApbpqcZUMF-YaA6DbnVGww.png)

### Calculate embeddings for first 100 and save to file, using function that first checks for existence of an embedding of a string, and if not exists, then asks for calculation?

In [62]:
import re
def id_from_name(name: str):
    return re.sub(
        pattern=r'[^0-9A-Za-z]',
        repl='',
        string=name.replace(" ","_")
    )


### Chroma Description
Chroma is the open-source embedding database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.

![chroma conda docs architecture diagram](https://docs.trychroma.com/img/hrm4.svg)

PyPI: [https://pypi.org/project/chromadb](https://pypi.org/project/chromadb)

:fire: The conda-forge recipe was generated with [Conda-Forger App](https://sugatoray-conda-forger.streamlit.app/).

---

### Chroma - Next steps
Chroma is designed to be simple enough to get started with quickly and flexible enough to meet many use-cases. You can use your own embedding models, query Chroma with your own embeddings, and filter on metadata. To learn more about Chroma, check out the [Usage Guide](https://docs.trychroma.com/usage-guide) and [API Reference](https://docs.trychroma.com/api-reference).

Chroma is integrated in LangChain (python and js), making it easy to build AI applications with Chroma. Check out the [integrations](https://docs.trychroma.com/integrations) page to learn more.

You can [deploy a persistent instance of Chroma](https://docs.trychroma.com/deployment) to an external server, to make it easier to work on larger projects or with a team.

In [29]:
import os
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from chromadb.api import API
from chromadb.api.fastapi import FastAPI
from chromadb.api.local import LocalAPI
from chromadb.api.types import EmbeddingFunction, Metadata, ID, Where, WhereDocument, OneOrMany, Embedding, Include, GetResult
# from chromadb.api.types import Documents, EmbeddingFunction, Embeddings, CollectionMetadata
# from chromadb.errors import ChromaError, error_types
from chromadb.types import Collection, EmbeddingRecord

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import time
from typing import Any
from pprint import pprint, pformat

%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D


# ~ https://docs.trychroma.com/getting-started
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
# You can configure Chroma to save and load from your local machine. 
# Data will be persisted on exit and loaded on start (if it exists). 
# This is useful for many experimental / prototyping workloads, limited by your machine's memory.
# ~ https://docs.trychroma.com/usage-guide#initiating-a-persistent-chroma-client
# ! Having many in-memory clients that are loading and saving to the same path can cause strange behavior including data deletion. 
# ! As a general practice, create an in-memory Chroma client once in your application, 
# ! and pass it around instead of creating many clients.
chroma_client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=os.path.join(os.path.dirname(os.getcwd()), ".chromadb")  # Optional, defaults to .chromadb/ in the current directory
))

display(chroma_client.list_collections())

chroma_esc_product_collection_name = "esc_product_vectors"

def chroma_get_esc_collection(
    name: str,
    # embedding_used_on_create_collection: EmbeddingFunction | None = embedding_functions.DefaultEmbeddingFunction(),
):
    '''
    WARNING: 
    
    If you later wish to get_collection, you MUST do so with the embedding function you supplied while creating the collection
    '''
    try:
        client = chroma_client
        return client.get_collection(
            name=name,
            embedding_function=embedding_functions.OpenAIEmbeddingFunction(
                api_key=os.environ["OPEN_AI_SECRET"],
                model_name=text_embedding_openai_model,
            ),
        )
    except Exception as e:
        print(f'Failed to get collection named "{name}", perhaps it had incorrect name or this collection was creating using a different embedding function? Error: {e}')
        print(e)
        return None

display(pformat(list(os.environ.keys())))
chroma_esc_collection = chroma_get_esc_collection(chroma_esc_product_collection_name)
display(chroma_esc_collection)

Using embedded DuckDB with persistence: data will be stored in: /Users/joey/Github_Keep/vegi-esc/vegi-esc-api/.chromadb
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


[Collection(name=esc_product_vectors)]

"['COMMAND_MODE',\n 'CONDA_DEFAULT_ENV',\n 'CONDA_EXE',\n 'CONDA_PREFIX',\n 'CONDA_PROMPT_MODIFIER',\n 'CONDA_PYTHON_EXE',\n 'CONDA_SHLVL',\n 'HOME',\n 'HOMEBREW_CELLAR',\n 'HOMEBREW_PREFIX',\n 'HOMEBREW_REPOSITORY',\n 'INFOPATH',\n 'LESS',\n 'LOGNAME',\n 'LSCOLORS',\n 'LS_COLORS',\n 'MAMBA_EXE',\n 'MAMBA_ROOT_PREFIX',\n 'MANPATH',\n 'MallocNanoZone',\n 'NVM_BIN',\n 'NVM_CD_FLAGS',\n 'NVM_DIR',\n 'NVM_INC',\n 'OLDPWD',\n 'ORIGINAL_XDG_CURRENT_DESKTOP',\n 'PAGER',\n 'PATH',\n 'PWD',\n 'SHELL',\n 'SHLVL',\n 'SSH_AUTH_SOCK',\n 'TMPDIR',\n 'USER',\n 'VSCODE_AMD_ENTRYPOINT',\n 'VSCODE_CODE_CACHE_PATH',\n 'VSCODE_CRASH_REPORTER_PROCESS_TYPE',\n 'VSCODE_CWD',\n 'VSCODE_HANDLES_UNCAUGHT_ERRORS',\n 'VSCODE_IPC_HOOK',\n 'VSCODE_NLS_CONFIG',\n 'VSCODE_PID',\n 'XML_CATALOG_FILES',\n 'XPC_FLAGS',\n 'XPC_SERVICE_NAME',\n 'ZSH',\n '_',\n '__CFBundleIdentifier',\n '__CF_USER_TEXT_ENCODING',\n 'ELECTRON_RUN_AS_NODE',\n 'APPLICATION_INSIGHTS_NO_DIAGNOSTIC_CHANNEL',\n 'VSCODE_L10N_BUNDLE_LOCATION',\n 'PY

Collection(name=esc_product_vectors)

In [None]:
def split_array(array:list[str], batch_size:int):
    split_list = [array[i:i+batch_size] for i in range(0, len(array), batch_size)]
    return split_list

def chroma_init_esc_products_collection(
    chroma_client: FastAPI | LocalAPI,
    name:str = chroma_esc_product_collection_name,
    # embedding_function: EmbeddingFunction | None = None,
    # NOTE l2 is the default, https://docs.trychroma.com/usage-guide#changing-the-distance-function 
    # NOTE and https://github.com/nmslib/hnswlib/tree/master#supported-distances
    metadata: dict | None = {"hnsw:space": "cosine"}, 
    get_or_create: bool = False,
    vegi_products: VegiProductsFromJson | None = None,
    external_products: ExternalProductsFromJson | None = None,
    batch_sizes:int = -1,
):
    '''
    - `embedding_function` by default is the sentence_transformer, but can be set to the openAi transformer 'text-embedding-ada-002' for example
    - `metadata`: defines what distance function to use. see https://docs.trychroma.com/usage-guide#changing-the-distance-function
    - `get_or_create`:  If True, will return the collection if it already exists,
    '''
    # Create collection. get_collection, get_or_create_collection, delete_collection also available!
    collection = chroma_client.create_collection(
        name=chroma_esc_product_collection_name, 
        embedding_function=embedding_functions.OpenAIEmbeddingFunction(
            api_key=os.environ["OPEN_AI_SECRET"],
            model_name=text_embedding_openai_model,
        ),
        metadata=metadata,
        get_or_create=get_or_create,
    )

    # Add docs to the collection. Can also update and delete. Row-based API coming soon!
    # collection.add(
    #     documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    #     metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!
    #     embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...
    #     ids=["doc1", "doc2"], # unique for each doc
    # )
    if vegi_products:
        _product_ids = {id_from_name(p):p for (p, c) in vegi_products.product_names}
        _product_ids_to_cat = {id_from_name(p):c for (p, c) in vegi_products.product_names}
        result = collection.get(ids=list(_product_ids.keys()))
        still_to_create_ids = [id for (id,name) in _product_ids.items() if id not in result['ids']]
        still_to_create_names = [name for (id,name) in _product_ids.items() if id not in result['ids']]
        if still_to_create_names:
            print(f'Adding {len(still_to_create_names)}/{len(_product_ids)} of the {vegi_products.source} products')
            collection.add(
                documents=[p for p in still_to_create_names], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
                metadatas=[{"category": _product_ids_to_cat[id], "source": vegi_products.source, "isProduct": True} for id in still_to_create_ids], # filter on these!
                ids=still_to_create_ids, # unique for each doc
            )
        _product_cat_ids = {id_from_name(c):c for c in vegi_products.product_categories}
        result = collection.get(ids=list(_product_cat_ids.keys()))
        still_to_create_ids = [id for (id,name) in _product_cat_ids.items() if id not in result['ids']]
        still_to_create_names = [name for (id,name) in _product_cat_ids.items() if id not in result['ids']]
        if still_to_create_names:
            print(f'Adding {len(still_to_create_names)}/{len(_product_cat_ids)} of the {vegi_products.source} categories')
            collection.add(
                documents=[c for c in still_to_create_names], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
                metadatas=[{"category": f"{vegi_products.source}_category", "source": vegi_products.source, "isProduct": False} for c in still_to_create_names], # filter on these!
                ids=still_to_create_ids, # unique for each doc
            )
    if external_products:
        _product_ids = {id_from_name(p):p for (p, c) in external_products.product_names}
        _product_ids_to_cat = {id_from_name(p):c for (p, c) in external_products.product_names}
        result = collection.get(ids=list(_product_ids.keys()))
        still_to_create_ids = [id for (id,name) in _product_ids.items() if id not in result['ids']]
        still_to_create_names = [name for (id,name) in _product_ids.items() if id not in result['ids']]
        if still_to_create_names:
            print(f'Adding {len(still_to_create_names)}/{len(_product_ids)} of the {external_products.source} products')
            try:
                batches_names = [still_to_create_names]
                batches_ids = [still_to_create_ids]
                if batch_sizes > -1 and batch_sizes < len(still_to_create_names):
                    batches_names = split_array(still_to_create_names, batch_sizes)
                    batches_ids = split_array(still_to_create_ids, batch_sizes)
                rows_calculated = 0
                for i,b in enumerate(batches_names):
                    display(f'Computing batch {i} ({i*batch_sizes}/{len(still_to_create_names)})')
                    display(batches_names[i])
                    collection.add(
                        documents=[p for p in batches_names[i]], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
                        metadatas=[{"category": _product_ids_to_cat[id], "source": external_products.source, "isProduct": True} for id in batches_ids[i]], # filter on these!
                        ids=batches_ids[i], # unique for each doc
                    )
                    rows_calculated += len(batches_names[i])
                    if rows_calculated >= 250:
                        chroma_client.persist()
            except Exception as e:
                print(e)
                raise e
        _product_cat_ids = {id_from_name(c):c for c in external_products.product_categories}
        result = collection.get(ids=list(_product_cat_ids.keys()))
        still_to_create_ids = [id for (id,name) in _product_cat_ids.items() if id not in result['ids']]
        still_to_create_names = [name for (id,name) in _product_cat_ids.items() if id not in result['ids']]
        if still_to_create_names:
            print(f'Adding {len(still_to_create_names)}/{len(_product_cat_ids)} of the {external_products.source} categories')
            collection.add(
                documents=[c for c in still_to_create_names], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
                metadatas=[{"category": f"{external_products.source}_category", "source": external_products.source, "isProduct": False} for c in still_to_create_names], # filter on these!
                ids=still_to_create_ids, # unique for each doc
            )

    # ! In a normal python program, .persist() will happening automatically if you set it. But in a Jupyter Notebook you will need to manually call client.persist().
    chroma_client.persist()
    return collection
    
def chroma_remove_collection(name: str, client: API):
    return client.delete_collection(name=name)

# collection.peek() # returns a list of the first 10 items in the collection
# collection.count() # returns the number of items in the collection
# collection.modify(name="new_name") # Rename the collection


# # Query/search 2 most similar results. You can also .get by id
# results = collection.query(
#     query_texts=["This is a query document"],
#     n_results=2,
#     # where={"metadata_field": "is_equal_to_this"}, # optional filter
#     # where_document={"$contains":"search_string"}  # optional filter
# )
# # ~ https://docs.trychroma.com/usage-guide#querying-a-collection
# # The query will return the n_results closest matches to each query_embedding, in order. An optional where filter dictionary can be supplied to filter the results by the metadata associated with each document. Additionally, an optional where_document filter dictionary can be supplied to filter the results by contents of the document.
# collection.query(
#     query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2] ...]
#     n_results=10,
#     where={"metadata_field": "is_equal_to_this"},
#     where_document={"$contains":"search_string"}
# )
# # query by a set of query_texts. Chroma will first embed each query_text with the collection's embedding function, and then perform the query with the generated embedding.
# collection.query(
#     query_texts=["doc10", "thus spake zarathustra", ...]
#     n_results=10,
#     where={"metadata_field": "is_equal_to_this"},
#     where_document={"$contains":"search_string"}
# )
# # You can also retrieve items from a collection by id using .get.
# collection.get(
#     ids=["id1", "id2", "id3", ...],
#     where={"style": "style1"}
# )
# # .get also supports the where and where_document filters. If no ids are supplied, it will return all items in the collection that match the where and where_document filters.

def chroma_get_esc_product_vectors(
    client: FastAPI | LocalAPI,
    vegi_only: bool = False,
    collection: Collection | None = None,
    get_distances: bool = False,
    neighbourhood_of_product: Tuple[str,str] = ("", ""),
    n_results:int=10,
    ):
    '''
    Returns a Tuple[ GetResult.Embeddings, GetResult.Documents, GetResult ]
    
    A GetResult which is a dict with keys: ['ids', 'embeddings', 'documents', 'metadatas']
    '''
    if collection is None:
        collection = chroma_get_esc_collection(
            name=chroma_esc_product_collection_name,
            client=client,
        )
    # ~ https://docs.trychroma.com/usage-guide#using-where-filters
    where_products: dict[str,Any] = {
        'isProduct': True, # ~ https://docs.trychroma.com/usage-guide#querying-a-collection:~:text=Using%20the%20%24eq%20operator%20is%20equivalent%20to%20using%20the%20where%20filter.
    }
    where_categories: dict[str,Any] = {
        'isProduct': False,
    }
    if vegi_only:
        where_products["source"] = {"$eq": vegi_products.source}
        where_categories["source"] = {"$eq": vegi_products.source}
    
    query_texts=None
    where_products_documents: dict[str,Any] = {}
    if neighbourhood_of_product[0]:
        if get_distances:
            query_texts=[neighbourhood_of_product[0]]
        else:
            # what we want is all products in same category, + other categories
            products_in_same_category = [p for p,c in vegi_products.product_names if c == neighbourhood_of_product[1]]
            other_category_names = [
                *vegi_products.product_categories,
                *sustained_products.product_categories,
            ]
            # where_products_documents['$contains'] = neighbourhood_of_product
            query_texts = [
                *products_in_same_category,
                *other_category_names
            ]
    else:
        get_distances = False
        
    if collection is not None:
        result: GetResult
        ind: int
        if get_distances:
            if query_texts:
                print(f'chroma db `collection.query` will return len(query_texts)={len(query_texts)} results with each result containing the nearest {n_results} results. Use get to return exact matches on an id')
            _include:Include = ["embeddings", "documents", "metadatas", "distances"]
            result = collection.query(
                query_texts=query_texts,
                n_results=n_results,
                where=None if query_texts else where_products,
                include=_include,
            )
            if query_texts is not None and len(query_texts) == 1:
                def _flatten(a:list):
                    arr = np.array(a)
                    return arr.reshape(-1, arr.shape[-1]).reshape(-1, arr.shape[-1])
                    # return arr.ravel()
                for k in ["ids", *_include]:
                    result[k] = np.array(result[k])
                    
                # for k in ["embeddings", "metadatas"]:
                #     result[k] = _flatten(result[k])
            
                # # result["distances"] = np.squeeze(result["distances"])#.reshape(-1, np.array(result["distances"]).shape[-1])
                # result["documents"] = np.squeeze(result["documents"])
                # result["ids"] = np.squeeze(result["ids"])
                # result["metadatas"] = np.squeeze(result["metadatas"])
                # np.squeeze(x, axis=(2,)).shape
                display({k:result[k].shape for k in ["ids",*_include]})
            ind = -1
        else:
            ids = list(set([id_from_name(q) for q in query_texts])) if query_texts else None
            _include:Include = ["embeddings", "documents", "metadatas"]
            result = collection.get(
                ids=ids,
                where=None if query_texts else where_products,
                include=_include,
            )
            ind = result['ids'].index(id_from_name(neighbourhood_of_product[0])) if ids else -1
        return result, ind
    else:
        return None, -1
    
def chroma_nearest_category_vectors_to_products(
    client: FastAPI | LocalAPI,
    neighbourhood_of_product: Tuple[str,str] = ("", ""),
    n_results=10,
):
    result, id = chroma_get_esc_product_vectors(
        client=client,
        vegi_only=False,
        get_distances=True,
        neighbourhood_of_product=neighbourhood_of_product,
        n_results=n_results,
    )
    if not result:
        return None, None, None
    embeddings = np.array(result['embeddings'])
    metadatas = result['metadatas']
    source = [md['source'] for md in metadatas]
    category = [md['category'] for md in metadatas]
    df = pd.DataFrame(data={
        'documents': result['documents'],
        'source': source,
        'y': category if neighbourhood_of_product[0] else source,
        'category': category,
        'cos_diff': result['distances'],
    })
    df.sort_values(by=["cos_diff"], ascending=True, inplace=True)
    return df
    
def chroma_visualise_esc_product_vectors(
    client: FastAPI | LocalAPI,
    vegi_only: bool = False,
    neighbourhood_of_product: Tuple[str,str] = ("", ""),
    n_results=5,
):
    result, id = chroma_get_esc_product_vectors(
        client=client,
        vegi_only=vegi_only,
        get_distances=False,
        neighbourhood_of_product=neighbourhood_of_product,
        n_results=n_results,
    )
    if not result:
        return None, None, None
    embeddings = np.array(result['embeddings'])
    metadatas = result['metadatas']
    source = [md['source'] for md in metadatas]
    category = [md['category'] for md in metadatas]
    # display(result['embeddings'])
    # display(np.array(result['embeddings']).shape)
    # display(result['documents'])
    # display(np.array(result['documents']).shape)
    # display(np.array(result['metadatas']).shape)
    # display(metadatas)
    
    pca = PCA(n_components=3)
    pca_result = pca.fit_transform(embeddings)
    # if neighbourhood_of_product[0]:
    #     # what we want is all products in same category, + other categories
    #     products_in_same_category = [p for p,c in vegi_products.product_names if c == neighbourhood_of_product[1]]
    #     other_category_names = [
    #         *vegi_products.product_categories,
    #         *sustained_products.product_categories,
    #     ]
    #     # where_products_documents['$contains'] = neighbourhood_of_product
    #     query_texts = [
    #         *products_in_same_category,
    #         *other_category_names
    #     ]
    df = pd.DataFrame(data={
        'documents': result['documents'],
        'source': source,
        'y': category if neighbourhood_of_product[0] else source,
        'category': category,
    })
    df['pca-one'] = pca_result[:, 0]
    df['pca-two'] = pca_result[:, 1]
    df['pca-three'] = pca_result[:, 2]

    print('Explained variation per principal component: {}'.format(
        pca.explained_variance_ratio_))
    
    np.random.seed(42)
    rndperm = np.random.permutation(df.shape[0])
    
    plt.figure(figsize=(16, 10))
    scatter_text(
        x="pca-one", 
        y="pca-two",
        hue="y",
        # palette=sns.color_palette("hls", 10),
        palette=sns.color_palette("bright", df["y"].shape[0]),
        data=df.loc[rndperm, :],
        legend="full",
        alpha=0.3,
        labels_column="documents",
        title="PCA",
        origin_index=id,
    )
    time_start = time.time()
    tsne = TSNE(n_components=2, verbose=1, perplexity=max(1,min(40,embeddings.shape[0]/5)), n_iter=300)
    tsne_results = tsne.fit_transform(embeddings)
    df['tsne-2d-one'] = tsne_results[:,0]
    df['tsne-2d-two'] = tsne_results[:,1]

    plt.figure(figsize=(16,10))
    scatter_text(
        x="tsne-2d-one", 
        y="tsne-2d-two",
        hue="y",
        # palette=sns.color_palette("hls", 10),
        palette=sns.color_palette("bright", df["y"].shape[0]),
        data=df.loc[rndperm, :],
        legend="full",
        alpha=0.3,
        labels_column="documents",
        title="TSNE",
        origin_index=id,
    )
    
    print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
    return df, pca_result, tsne_results

def scatter_text(
    x: str, 
    y: str,
    hue: str, 
    palette,
    data:pd.DataFrame, 
    legend: str, 
    alpha: float,
    labels_column: str, 
    title: str = "", 
    xlabel: str = "", 
    ylabel: str = "",
    origin_index: int = -1,
):
    if not xlabel:
        xlabel = x
    if not ylabel:
        ylabel = y
    if not title:
        title = f'{xlabel} vs {ylabel}'
    
        
    """Scatter plot with country codes on the x y coordinates
       Based on this answer: https://stackoverflow.com/a/54789170/2641825"""
    # Create the scatter plot
    p1 = sns.scatterplot(
        x=x, y=y,
        hue=hue,
        data=data,
        palette=palette,
        alpha=alpha,
        legend=legend,
        size = 12,
    )
    
    # if origin_index > -1:
    #     origin = {'x':data[x][origin_index], 'y': data[y][origin_index]}
    #     # Plot the origin point with a larger marker and different color
    #     p1 = sns.scatterplot(x=origin['x'], y=origin['y'], s=100, color='red')
        
    #     # Label the origin point
    #     p1.text(origin['x'], origin['y'], 'Origin', ha='center', va='bottom', fontsize=12, color='red')
    
    # Add text besides each point
    for line in range(0,data.shape[0]):
        if line == origin_index:
            p1.text(data[x][line]+0.01, data[y][line], 
                    data[labels_column][line], horizontalalignment='left', 
                    size='large', color='green', weight='bold')
        else: 
            p1.text(data[x][line]+0.01, data[y][line], 
                 data[labels_column][line], horizontalalignment='left', 
                 size='medium', color='black', weight='semibold')
    # def label_point(x, y, val, ax):
    #     a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
    #     for i, point in a.iterrows():
    #         ax.text(point['x']+.02, point['y'], str(point['val']))

    # label_point(data[x], data[y], data[labels_column], plt.gca())
    # Set title and axis labels
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    return p1
    
def chroma_get_vectors(
    client: FastAPI | LocalAPI,
    query_embeddings: OneOrMany[Embedding] | None = None,
    query_texts: OneOrMany[Document] | None = None,
    n_results: int = 10,
    where: Where | None = None,
    where_document: WhereDocument | None = None,
    ):
    '''
    For filtering by metadata, see https://docs.trychroma.com/usage-guide#filtering-by-metadata
    '''
    collection = chroma_get_esc_collection(
        name=chroma_esc_product_collection_name,
        client=client,
    )
    if collection is None:
        return None
    if query_embeddings:
        include: Include = ["embeddings", "metadatas", "documents", "distances"]
        return collection.query(
            query_embeddings=query_embeddings,
            n_results=n_results,
            where=where,
            where_document=where_document,
            include=include, # https://docs.trychroma.com/troubleshooting#using-get-or-query-embeddings-say-none
        )
    elif query_texts:
        include: Include = ["embeddings", "metadatas", "documents", "distances"]
        return collection.query(
            query_texts=query_texts,
            n_results=n_results,
            where=where,
            where_document=where_document,
            include=include, # https://docs.trychroma.com/troubleshooting#using-get-or-query-embeddings-say-none
        )
    else:
        include: Include = ["embeddings", "metadatas", "documents"]
        return collection.get(
            where=where,
            where_document=where_document,
            include=include,
            limit=n_results,
        )
    
def chroma_update_vectors(
    client: FastAPI | LocalAPI,
    ids: OneOrMany[ID],
    embeddings: OneOrMany[Embedding] | None = None,
    metadatas: OneOrMany[Metadata] | None = None,
    documents: OneOrMany[Document] | None = None
):
    '''
    See https://docs.trychroma.com/usage-guide#updating-data-in-a-collection
    '''
    collection = chroma_get_esc_collection(
        name=chroma_esc_product_collection_name,
        client=client,
    )
    if collection is None:
        return None
    return collection.update(
        ids=ids,
        embeddings=embeddings,  # [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
        metadatas=metadatas,  # [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
        documents=documents,  #["doc1", "doc2", "doc3", ...],
    )
    
    
def chroma_get_vectors_for_texts(client: API, *texts: str):
    collection = chroma_get_esc_collection(
        name=chroma_esc_product_collection_name,
        client=client,
    )
    if collection is not None: 
        return collection.get(
            ids=[id_from_name(name=name) for name in texts]
        )
    else:
        return None
            

def chromadb_check_connection(client: FastAPI | LocalAPI):
    '''returns a nanosecond heartbeat. Useful for making sure the client remains connected.'''
    return client.heartbeat()

def chromadb_reset_db(client: FastAPI | LocalAPI):
    '''Empties and completely resets the database. ⚠️ This is destructive and not reversible.'''
    client.reset()
    
def chroma_most_similar_sustained_products(*vegi_product_names:str):
    _include: Include = ["embeddings", "documents", "distances", "metadatas"]
    collection = chroma_get_esc_collection(
        name=chroma_esc_product_collection_name,
        client=chroma_client,
    )
    query_texts = [*vegi_product_names]
    result = collection.query(
        query_texts=query_texts,
        where={'source': {"$eq": sustained_products.source}},
        n_results=3,
        include=_include,
    )
    for k in ["ids", *_include]:
        result[k] = np.array(result[k])
    display({k:result[k].shape for k in ["ids", *_include]})
    display(result["ids"])
    for i,text in enumerate(query_texts):
        most_similiar_doc = result["documents"][i,0]
        display(f'{text}[vegi] -> {most_similiar_doc}[sustained]')
    output = {
        text:result["documents"][i,0]
        for i,text in enumerate(query_texts)
    }
    return output

def chroma_most_similar_sustained_product(vegi_product_name:str):
    return chroma_most_similar_sustained_products(vegi_product_name)
    

In [13]:
def chroma_query_vector_store(include = ["documents"], top_n: int = 10, *query_texts: str):
    '''
    You can also query by a set of `query_texts`.
    Chroma will first embed each `query_text` with the collection's `embedding` function,
    and then perform the query with the `generated embedding`.
    
    The query will return the `n_results=10` closest matches to each `query_embedding`, in order.
    this will look like mapping a list of `query_texts` with len 3 -> a list of `embeddings` of len 3 -> a list: `[(top_n similar matches to emb in order) for emb in embeddings]`
    ========================================================================
    collection.query notes:
    top_n = 10 # by default
    return [
        {
            'documents':
                [ nth_most_similar_document_in_collection_to(text, n) for n in range(topn) ]
        }
        for text in query_texts
    ]
    ========================================================================
    
    Returns
    -------
    dict[Include, list[list[...]]]
        where this looks like:
    {
        'ids': [N,t] as list[list[ID]],
        'documents': [N,t] as list[list[Document]],
        'distances': [N,t] as list[list[float]],
    }
    where N := len(query_texts)
    where t := top_n nearest matches for each text to search for similarities for
    ========================================================================
    '''
    collection = chroma_get_esc_collection(
        name=chroma_esc_product_collection_name,
    )
    if collection is None:
        _collections = chroma_client.list_collections()
        display(
            f"Unable to get chroma esc collection: '{chroma_esc_product_collection_name}', possible collection names are: {(_collections)}"
        )
        return None
    result = collection.query(
        query_texts=[*query_texts],
        include=include,
        n_results=top_n,
    ) if query_texts or "distances" in include else collection.get(
        include=include,
        limit=top_n,
    )
    return result

In [32]:
from chromadb.api.models.Collection import Collection
from chromadb.api.types import (
    # EmbeddingFunction,
    Metadata,
    ID,
    Where,
    WhereDocument,
    OneOrMany,
    Embedding,
    Include,
    GetResult,
    QueryResult,
    Document,
)

def _f():
    measure_similarity_to_word = "Cupcake"
    words = []
    result = chroma_query_vector_store(["documents", "distances"], 3, measure_similarity_to_word, *words)
    assert result is not None, "chromadb QueryResult in LLM().chroma_query_vector_store didnt obtain a query result"
    documents = result["documents"]
    assert documents is not None, "chromadb QueryResult in LLM().chroma_query_vector_store didnt contain documents"
    if not documents:
        return {}
    assert "distances" in result, "distances key should be returned from LLM().chroma_query_vector_store but wasnt"
    distances = result["distances"]  # type: ignore
    assert distances is not None, "chromadb QueryResult in LLM().chroma_query_vector_store didnt contain distances"
    if not distances:
        return {}
    if isinstance(documents[0], list) and isinstance(documents[0][0], Document):
        return {
            d: 1.0 / w
            for (dl, wl) in list(zip(documents, distances))
            for (d, w) in list(zip(dl, wl))
        }
    elif isinstance(documents[0], Document):
        return {
            str(d): np.mean([1.0 / w for w in wl]) for (d, wl) in list(zip(documents, distances))
        }
    return {
        d: 1.0 / w
        for (dl, wl) in list(zip(documents, distances))
        for (d, w) in list(zip(dl, wl))
    }

_f()

{'Cakes': 0.10784614086151123,
 'Cake': 0.11612486839294434,
 'Candy': 0.1327604055404663}

In [None]:
        return {}
    assert isinstance(result, QueryResult) and "distances" in result.keys(), "distances key should be returned from LLM().chroma_query_vector_store but wasnt"
    distances = result["distances"]
    assert distances is not None, "chromadb QueryResult in LLM().chroma_query_vector_store didnt contain distances"
    if not distances:
        return {}
    if isinstance(documents[0], list) and isinstance(documents[0][0], Document):
        return {
            d: w
            for (dl, wl) in list(zip(documents, distances))
            for (d, w) in list(zip(dl, wl))
        }
        return [
            zip(dl, [1.0 / w for w in wl]) for dl in documents for wl in distances
        ]
    elif isinstance(documents[0], Document):
        return [(str(d), [1.0 / w for w in wl]) for d in documents for wl in distances]
    return [
        zip(dl, [1.0 / w for w in wl]) for dl in documents for wl in distances
    ]

_f()

In [284]:
# chroma_most_similar_sustained_product(vegi_product_name=vegi_products.product_names[0][0])
# chroma_most_similar_sustained_product(vegi_product_name=vegi_products.product_names[3][0])
# chroma_most_similar_sustained_product(vegi_product_name=vegi_products.product_names[30][0])
# chroma_most_similar_sustained_product(vegi_product_name=vegi_products.product_names[0][1]) # category
# chroma_most_similar_sustained_product(vegi_product_name=vegi_products.product_names[3][1]) # category
# chroma_most_similar_sustained_product(vegi_product_name=vegi_products.product_names[30][1]) # category

chroma_most_similar_sustained_products(
    vegi_products.product_names[0][0],
    vegi_products.product_names[30][0],
    vegi_products.product_names[10][0],
    vegi_products.product_names[10][1],
)

{'ids': (4, 3),
 'embeddings': (4, 3, 1536),
 'documents': (4, 3),
 'distances': (4, 3),
 'metadatas': (4, 3)}

array([['FalafelButternutCauliflowerTabboulehSideSalad',
        'TabboulehSalad', 'CurriedChickpeaSalad'],
       ['NapoletanaPastaSauce', 'BolognesePastaSauce',
        'AuthenticItalianRoastedGarlicPastaSauce'],
       ['ExtraFirmSilkenTofu', 'SweetSoySauce', 'SoftGoatsCheese'],
       ['ExtraFirmSilkenTofu', 'TofuChunksChilliGarlicSauce',
        'Asparagus']], dtype='<U45')

'Falafel balls[vegi] -> Falafel Butternut  Cauliflower Tabbouleh Side Salad[sustained]'

'Napoletana pasta sauce[vegi] -> Napoletana Pasta Sauce[sustained]'

'Silken tofu firm[vegi] -> Extra Firm Silken Tofu[sustained]'

'Tofu[vegi] -> Extra Firm Silken Tofu[sustained]'

{'Falafel balls': 'Falafel Butternut  Cauliflower Tabbouleh Side Salad',
 'Napoletana pasta sauce': 'Napoletana Pasta Sauce',
 'Silken tofu firm': 'Extra Firm Silken Tofu',
 'Tofu': 'Extra Firm Silken Tofu'}

In [272]:
_include: Include = ["embeddings", "documents", "distances", "metadatas"]
result = collection.query(
    query_texts=[vegi_product_names[0][0]],
    n_results=10,
    include=_include,
)
for k in ["ids", *_include]:
    result[k] = np.array(result[k])
display({k:result[k].shape for k in ["ids", *_include]})
display(result["ids"])

{'ids': (1, 3),
 'embeddings': (1, 3, 1536),
 'documents': (1, 3),
 'distances': (1, 10),
 'metadatas': (1, 3)}

array([['Falafelballs', 'Sweetpotatofalafelballs',
        'FalafelButternutCauliflowerTabboulehSideSalad']], dtype='<U45')

In [269]:
result, id = chroma_get_esc_product_vectors(
    client=chroma_client,
    vegi_only=False,
    get_distances=True,
    neighbourhood_of_product=vegi_products.product_names[1],
    n_results=80,  # Gets the n nearest neighbours to each query_text: The query will return the n_results closest matches to each query_embedding, in order.
)
display(result['documents'])
display(result['distances'])

chroma db `collection.query` will return len(query_texts)=1 results with each result containing the nearest 80 results. Use get to return exact matches on an id


{'ids': (1, 26),
 'embeddings': (1, 26, 1536),
 'documents': (1, 26),
 'metadatas': (1, 26),
 'distances': (1, 80)}

array([['Classic houmous', 'Reduced Fat Houmous',
        'Food Crunchy Carrot Houmous', 'Chargrill red pepper houmous',
        'Carrot  Reduced Fat Houmous', 'Tahini',
        'Taste Difference Summer Edition Sriracha Style Rippled Houmous',
        'Be Good Yourself Mini Houmous Snack Pots', 'Tabbouleh Salad',
        'Super Olive Tapenade', 'Moroccan Dry Black Olives',
        'Middle eastern chickpea curry', 'Falafel balls ',
        'Stoneless Black Hojiblanca Olives',
        'Creationz Tagine Chickpeas', 'Harissa Paste',
        'Sweet potato falafel balls', 'Oatgurt Greek Style',
        'Mixed Greek Olives  Feta', 'Classic Chilli', 'Vintage Gouda',
        'Mixed pitted greek olives', 'Black Olives Greek Style',
        'Original vegenaise', 'Greek Kalamata Olives',
        'Halkidiki Olives Stuffed Garlic Cloves']], dtype='<U62')

array([[0.        , 0.09757543, 0.09757543, 0.10235643, 0.10277927,
        0.10294116, 0.10294116, 0.10294116, 0.10294116, 0.10294116,
        0.10294116, 0.10294116, 0.11146665, 0.12483245, 0.12483245,
        0.12483245, 0.12483245, 0.12483245, 0.12483245, 0.12484723,
        0.12489945, 0.13405561, 0.13405561, 0.13594478, 0.13594478,
        0.13961971, 0.13964325, 0.13996041, 0.14000767, 0.14000767,
        0.14028168, 0.1417433 , 0.14210564, 0.14210564, 0.14210564,
        0.14210564, 0.14210564, 0.14581019, 0.146869  , 0.14796013,
        0.14851791, 0.14851791, 0.14905983, 0.14908248, 0.14916629,
        0.14916629, 0.14916629, 0.14916629, 0.14916629, 0.14916629,
        0.14921254, 0.15122324, 0.15208668, 0.15208668, 0.15246832,
        0.15246832, 0.15270025, 0.15270907, 0.15279073, 0.152843  ,
        0.15289205, 0.15289205, 0.15289205, 0.15289205, 0.15289205,
        0.15296751, 0.15296751, 0.15296751, 0.1530472 , 0.15368295,
        0.15368295, 0.15368295, 0.15368295, 0.15

In [246]:
collection = chroma_get_esc_collection(
    name=chroma_esc_product_collection_name,
    client=chroma_client,
)
display(collection)
if collection is not None:
    display(f'ChromaDB Collection contains {collection.count()} vectors')
    display(collection.peek())

Collection(name=esc_product_vectors)

'ChromaDB Collection contains 5025 vectors'

{'ids': ['EssentialWaitroseTunaSteakSpringWater',
  'SmokedGarlic',
  'TescoPerthshireScottishCarbonatedNaturalWater',
  'RadnorSplashForestFruitsSparklingFlavouredSpringWater',
  'Tofuweinerhotdogs',
  'Puree',
  'Sausageshotdogs',
  'Cheese',
  'Gravy',
  'Syrup'],
 'embeddings': [[0.00042959023267030716,
   -0.0032904783729463816,
   0.013767293654382229,
   -0.01663774810731411,
   -0.0037343103904277086,
   0.01841987669467926,
   -0.017807696014642715,
   -0.009801714681088924,
   -0.007910752668976784,
   -0.0425262451171875,
   0.0008689584210515022,
   0.00527156749740243,
   -0.014474703930318356,
   -0.012359275482594967,
   -0.020895814523100853,
   0.008842629380524158,
   0.015168510377407074,
   -0.020691752433776855,
   0.023290125653147697,
   -0.012080391868948936,
   -0.00859095435589552,
   0.009726892225444317,
   0.007359788753092289,
   0.0003530675021465868,
   0.006478926632553339,
   -0.010590748861432076,
   0.016161605715751648,
   -0.02979286015033722,
   0

In [131]:
vegi_products.product_names[:10]

[('Falafel balls ', 'Antipasti'),
 ('Classic houmous', 'Antipasti'),
 ('Creamy original spread', 'Spread'),
 ('Oumph! burgers', 'Burgers'),
 ('Tofu weiner hotdogs ', 'Sausages & hotdogs'),
 ('Plain tofu', 'Tofu'),
 ('Spreadable vegan butter', 'Butter'),
 ('Chickn strips (original)', 'Faux meats'),
 ('Violife original grated', 'Cheese'),
 ('Haggis', 'Faux meats')]

In [239]:


wierd_names = []
for x in sustained_products.product_names:
    if re.match(pattern=r'[^0-9A-Za-z\s]', string=x[0]):
        wierd_names.append((x[0], clean_words(x[0])))
display(wierd_names)

[]

In [None]:
collection = chroma_init_esc_products_collection(
    chroma_client=chroma_client,
    # name=chroma_esc_product_collection_name + '___',
    name=chroma_esc_product_collection_name,
    embedding_function=embedding_functions.OpenAIEmbeddingFunction(
        api_key=os.environ["OPEN_AI_SECRET"],
        model_name=text_embedding_openai_model,
    ),
    vegi_products=vegi_products,
    external_products=sustained_products,
    # vegi_products=vegi_products.limit(n=10),
    # external_products=sustained_products.limit(n=10),
    get_or_create=True, # gets collection if already exists.
    batch_sizes=25,
)
display(collection)

In [None]:

df = chroma_nearest_category_vectors_to_products(
    client=chroma_client,
    neighbourhood_of_product=vegi_products.product_names[1],
    n_results=30,
)
df

# embeddings = np.array(result['embeddings'])
# metadatas = result['metadatas']
# source = [md['source'] for md in metadatas]
# category = [md['category'] for md in metadatas]
# df = pd.DataFrame(data={
#     'documents': result['documents'],
#     'source': source,
#     'y': category if neighbourhood_of_product[0] else source,
#     'category': category,
#     'cos_sim': result['distances'],
# })
# df.sort_values(by=["cos_sim"], ascending=False, inplace=True)

In [None]:
# result = chroma_get_esc_product_vectors(
#     client=chroma_client,
#     collection=collection,
# )
# if result:
    # display(result['embeddings'])
    # ~ upsert embeddings using https://docs.trychroma.com/reference/Collection#upsert
    # display(np.array(result['embeddings']).shape)
    # embeddings, documents, getResultCollection = result
    # # display(list(vectors.keys()))
    # display(embeddings)
    # display(documents)
_df, pca_results, tsne_results = chroma_visualise_esc_product_vectors(
    client=chroma_client,
    neighbourhood_of_product=vegi_products.product_names[1],
    n_results=4,
)
if _df is not None:
    _df.head()

## Running Chroma in client/server mode
Chroma can also be configured to use an on-disk database, useful for larger data which doesn't fit in memory. To run Chroma in client server mode, run the docker container:

```shell
docker-compose up -d --build
```

Then update your chroma client to point at the docker container. `Default: localhost:8000`

```python
import chromadb
from chromadb.config import Settings
chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
                                        chroma_server_host="localhost",
                                        chroma_server_http_port="8000"
                                    ))
```

That's it! Chroma's API will run in client-server mode with just this change.

In [58]:

from dataclasses import dataclass
from chromadb.utils import embedding_functions

@dataclass
class TextVector:
    text: str
    vector: list


@dataclass
class TextSimilarity:
    text_one: TextVector    
    text_two: TextVector    
    similarity: float


def get_similarity(text_one: str, text_two: str):
    '''
    Uses a vector store to check if embeddings already exist for each textinput, 
    if an embedding does not exist, the openAi text-embedding-ada-002 model is used to get a vector for the text
    the cosine difference is then calculated between the 2 vectors and returned with embeddings 
    '''
    

def calculate_similarities_to_all(user_input_text: str, text_chunks: list[str]):
    # calcuate the embeddings for the user's question
    # df = df.copy(deep=True)
    # get embeddings from text-embedding-ada model
    # def _get_embedding(text: str, model: str=text_embedding_openai_model):
    #     text = text.replace("\n", " ")
    #     # todo: Add tqdm progress emitter
    #     return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
    def _get_embedding(text: str, model: str=text_embedding_openai_model):
        text = text.replace("\n", " ")
        # todo: Add tqdm progress emitter
        # ~ https://docs.trychroma.com/embeddings#openai
        openai_ef = embedding_functions.OpenAIEmbeddingFunction(
            api_key=os.environ['OPEN_AI_SECRET'],
            model_name="text-embedding-ada-002"
        )
        return openai_ef
    
    df = pd.DataFrame({'text_chunks': text_chunks})
    embeddings, vectordb = fetch_vector_store(texts=df['ada_embedding'].values)
    text_chunks_done, embeds = vectordb.fetch(text_chunks)
    df['ada_embedding'] = df.text_chunks.apply(lambda x: _get_embedding(x, model=text_embedding_openai_model))
    try:
        openai_ef = np.array(text_chunks).apply(_get_embedding)
        embeddings, vectordb = fill_vector_store_from_openai_ef(texts=text_chunks, openai_ef=openai_ef)
    except:
        pass
    # users_question = "Who is the current Prime Minister of the UK?"
    question_embedding = get_embedding(text=user_input_text, model=text_embedding_openai_model)

    # create a list to store the calculated cosine similarity
    cos_sim = []

    for index, row in df.iterrows():
        A = row.ada_embedding
        B = question_embedding

        # calculate the cosine similiarity
        cosine = np.dot(A,B)/(norm(A)*norm(B))

        cos_sim.append(cosine)

    df["cos_sim"] = cos_sim
    df["input"] = user_input_text
    df.sort_values(by=["cos_sim"], ascending=False)
    fn = user_input_text.replace(" ", "_")
    try:
        df.to_parquet(f'./vector_stores/vegi_products_from_{fn}.parquet')
    except Exception as e:
        print(f"Unable to save df to parquet because: {e}")
        print(e)
    return df

In [59]:
# todo: run the embedding function in chunks and concat the dfs at the end, then run in parallel with dask, see other esc project from before in esc_pdfs scripts location for this
# ! function took 31.8s for 10 comparisons (c 3s per comparison...)
# TODO: Add way to check for existing dfs with comparison that contains user_input vs the text_chunks, and only recompare text_chunks that have not already been embedded -> ths should be done by the vectorstore persisted...
df_vegi_prods = calculate_similarities_to_all(user_input_text="Classic houmous", text_chunks=all_products_and_cats[:10])

In [60]:
display(df_vegi_prods.head(n=5))

# def retrieve_embedding_cached(text: str, model: str="text-embedding-ada-002"):
#     text = text.replace("\n", " ").strip()
#     # ~ https://colab.research.google.com/drive/1Q_I60MFpItT0-9jSW8wP1-7H6i2P4mUY
#     stored = 
#     if not stored:
#         return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
#     else:
#         return stored
    

Unnamed: 0,text_chunks,ada_embedding,cos_sim
0,Falafel balls,"[-0.010855322703719139, -0.010623958893120289,...",0.854157
1,Classic houmous,"[-0.003279944648966193, 0.006397642195224762, ...",0.999998
2,Creamy original spread,"[-0.01092411857098341, -0.0071936058811843395,...",0.832197
3,Oumph! burgers,"[-0.011739605106413364, -0.019587628543376923,...",0.822517
4,Tofu weiner hotdogs,"[0.004491493571549654, -0.010557667352259159, ...",0.795934
