## Indexing: Inspecting and Managing Documents in a Vectorstore

In [1]:
%load_ext dotenv
%dotenv

In [2]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [3]:
embedding = OpenAIEmbeddings(
    model='text-embedding-ada-002'
)

In [4]:
vectorstore_from_dir = Chroma(
    persist_directory= './vector-store-local',
    embedding_function=embedding
    )

  vectorstore_from_dir = Chroma(


In [6]:
vectorstore_from_dir.get(ids= 'bfa4ee37-05bc-4886-afb9-9f345dd8c132', include=["embeddings"])

{'ids': ['bfa4ee37-05bc-4886-afb9-9f345dd8c132'],
 'embeddings': array([[ 0.00478017, -0.01535145,  0.02508651, ...,  0.02121745,
         -0.01364157, -0.00687695]]),
 'documents': None,
 'uris': None,
 'included': ['embeddings'],
 'data': None,
 'metadatas': None}

In [7]:
added_document = Document(page_content="Alright! So.. Let's discuss the not-so-obvious diffferences between the terms analysis and",
                          metadata={'Course Title': 'Introduction to Data and Data Science',
                                    'Lecture Title':'Analysis vs analytics'})

In [8]:
added_document

Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Analysis vs analytics'}, page_content="Alright! So.. Let's discuss the not-so-obvious diffferences between the terms analysis and")

In [9]:
vectorstore_from_dir.add_documents([added_document])

['8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2']

In [10]:
vectorstore_from_dir.get(ids= '8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2', 
                         include=["embeddings"])

{'ids': ['8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2'],
 'embeddings': array([[ 0.00114877, -0.0014047 ,  0.01826029, ...,  0.02054116,
         -0.01492569, -0.04257623]]),
 'documents': None,
 'uris': None,
 'included': ['embeddings'],
 'data': None,
 'metadatas': None}

In [11]:
updated_document = Document(page_content="Just a test to update the document store",
                            metadata={'Course Title':"Introduction to Data and Data Science",
                                      'Lecture Title':'Programming Languages & Software Employed in Data Science.'})

In [12]:
vectorstore_from_dir.update_document(document_id='8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2',document=updated_document)

In [13]:
vectorstore_from_dir.get(ids= '8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2')

{'ids': ['8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2'],
 'embeddings': None,
 'documents': ['Just a test to update the document store'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'Lecture Title': 'Programming Languages & Software Employed in Data Science.',
   'Course Title': 'Introduction to Data and Data Science'}]}

In [14]:
vectorstore_from_dir.delete(ids= '8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2')

In [15]:
vectorstore_from_dir.get(ids= '8e1b19a4-5bbe-49f1-b735-ff36da2c7ca2')

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

## Retrival and Similarity Evaluation

In [16]:
question = "What programming languages do data scientists use?"

In [17]:
retrieved_docs = vectorstore_from_dir.similarity_search(query=question,
                                                        k = 5)

In [18]:
retrieved_docs

[Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data'),
 Document(metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'Course Title': 'Introduction to Data and Data Science'}, page_content='Thus, we need a lot of computational power, and we can expect people to use the languages similar to those in the big data column. Apart from R, Python, and MATLAB, other, faster languages are used

In [19]:
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n--------------\nLecture Title: {i.metadata['Lecture Title']}\n")

Page Content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
--------------
Lecture Title: Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: Thus, we need a lot of computational power, and we can expect people to use the languages similar to those in the big data column. Apart from R, Python, and MATLAB, other, faster languages are used like Java, JavaScript, C, C++, and Scala. Cool. What we said may be wonderful, but that’s not all! By using one or more programming languages, people create application software or, as they are sometimes called, software solutions, that are adjusted 