### A peek into the database

This notebook acts as your guide into the database. 
The examples provided are based on the "Data management for production quality deep learning models: Challenges and solutions" publication

It is interactive, but will work only if you have already added a document to the database. You can do that from command line with "python add_document document.pdf". You have to supply your own document - this is by design, I want you to find something relevant to your interests. You will see the benefits of vector databases yourself. If you do not want to do that, you can take a look at the generated outputs.

Once you add one, the ./db folder will be generated in current directory which holds your chunked documents.

Below is the code used to set up a connection to the database in the db folder:

In [19]:
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
import os

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key= os.environ["OPENAI_API_KEY"],
                model_name="text-embedding-ada-002"
            )

chroma_client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",
                                    persist_directory="./db"
                                ))



chroma_client.list_collections()

Using embedded DuckDB with persistence: data will be stored in: ./db
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


[Collection(name=langchain)]

LangChain creates a collection with the name "langchain". In this case, it uses OpenAI model to embed the text and calculate vectors. 
While retrieving the collection we have to supply the embedding function, or we will get an error later on.

In [20]:
collection = chroma_client.get_or_create_collection(
            name="langchain",
            embedding_function=openai_ef
        )
collection.peek()

{'ids': ['fa74c205-fa78-11ed-a7c9-845cf32dc1a6',
  'fa74c22b-fa78-11ed-a638-845cf32dc1a6',
  'fa74c1e1-fa78-11ed-9900-845cf32dc1a6',
  'fa74c1e2-fa78-11ed-a5a0-845cf32dc1a6',
  'fa74adf3-fa78-11ed-b503-845cf32dc1a6',
  'fa74adf4-fa78-11ed-8220-845cf32dc1a6',
  'fa74adf5-fa78-11ed-8ed2-845cf32dc1a6',
  'fa74adf6-fa78-11ed-b7ba-845cf32dc1a6',
  'fa74adf7-fa78-11ed-b6d0-845cf32dc1a6',
  'fa74adf8-fa78-11ed-b73b-845cf32dc1a6'],
 'embeddings': [[-0.009861902333796024,
   -0.016499318182468414,
   0.0026312365662306547,
   -0.0167226605117321,
   -0.0165132787078619,
   0.016150347888469696,
   -0.0009326232830062509,
   -6.815827509853989e-05,
   -0.003132008947432041,
   -0.04695745185017586,
   -0.009575746953487396,
   0.02095218002796173,
   0.0008942365529946983,
   -0.0019158473005518317,
   -0.014244970865547657,
   0.007321398239582777,
   0.0017387449042871594,
   -0.0044668205082416534,
   -0.0044214543886482716,
   -0.03230069577693939,
   -0.012514077126979828,
   -0.00199611042

We can count the rows to see how many document parts we have in the database. This is not relevant to how many documents we have, only how many sliced parts of all documents.

In [21]:
collection.count()

148

Here is the part relevant to the DB underneath LangChains ChainVectorDBChain. This retrieves the most relevant documents. We can specify how many of those we want - first ones will be most relevant.

In [22]:
collection.query(
    query_texts=["What are the lifecycle phases of test data management?"],
    n_results=5
)

{'ids': [['fa74c20c-fa78-11ed-bff2-845cf32dc1a6',
   'fa74adf7-fa78-11ed-b6d0-845cf32dc1a6',
   'fa74c203-fa78-11ed-8b10-845cf32dc1a6',
   'fa74c1ec-fa78-11ed-9397-845cf32dc1a6',
   'fa74c1e9-fa78-11ed-ba50-845cf32dc1a6']],
 'embeddings': None,
 'documents': [['To summarize, the data management challenges that can be\n\nfound at the post-deployment stage include the change in data\n\nsources and distribution, feedback loops, and data drifts. A de-\n\ngenerative feedback loop is a problem specific to recommender\n\nsystems.\n\n12\n\nA.R. Munappy, J. Bosch, H.H. Olsson et al.\n\nThe Journal of Systems & Software 191 (2022) 111359\n\nTable 4\n\nMapping between Data Lifecycle phase, Challenges and Potential Solutions.\n\nData lifecycle Phase\n\nChallenge\n\n1. Lack of labeled data\n\nData Collection\n\nData Exploration\n\nData Preprocessing\n\n2. Data Granularity\n\n3. Shortage of diverse samples\n\n4. Data sharing and\n\ntracking methods\n\n5. Data Storage complying\n\nto GDPR\n\n6. Stati