# Indexing
## Embedding
After splitting, We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

In [None]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch

### Sentence Transformer 
Model: "all-MiniLM-L6-v2"
```markdown
pooling_configuration:
{
  "word_embedding_dimension": 384,
  "pooling_mode_cls_token": false,
  "pooling_mode_mean_tokens": true,
  "pooling_mode_max_tokens": false,
  "pooling_mode_mean_sqrt_len_tokens": false
}
sentence_bert_config:
{
  "max_seq_length": 256,
  "do_lower_case": false
}

In [58]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# equivalent to SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# this is the default model of langchain-huggingface-embeddings
# embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# import getpass
# import os
# os.environ["OPENAI_API_KEY"] = getpass.getpass()
# from langchain_openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings(model="text-embedding-3-large") # 1536 dimensions by default
# text = "This is a test document."
# query_result = embeddings.embed_query(text)
# print(query_result[:5])
# doc_result = embeddings.embed_documents([text])
# print(doc_result[0][:5])
# # but by passing in dimensions=1024 we can reduce the size of our embeddings to 1024:
# embeddings_1024 = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)

In [59]:
sentences = [
    'Why is RAG popular nowadays?',
    'It is complex to integrate retrieval component effectively.',
    'RAG provides more contextually relevant information.',
    'RAG can enhance the quality of generated text.',
    'A challenge in RAG is ensuring the retrieved documents are always relevant to the query.',
    'Models are particularly useful in open-domain question answering where external knowledge is crucial.',
    'Long Context Windows enable the analysis and summarization of lengthy documents efficiently.',
    'It is difficult to maintain coherence and context across lengthy spans of text.',
    'Long Context Windows offers the advantage of extracting key insights from large volumes of text quickly.',
    'Its ability to automate the summarization of reports, saving time and effort.',
    'It is the computational intensity required to process extensive texts.',
    'RAG is kind of difficult for a sophomore.'
]
query_result = embeddings.embed_query(sentences[0])
# embed single query
doc_results = embeddings.embed_documents(sentences) # encode
# embed list of texts

print(len(sentences))
print(len(doc_results[0]) == len(doc_results[1]) == len(query_result) == 384)
print(query_result)
print(doc_results[0])
# 二者结果有极细微差别 是因为api不同 预处理格式有区别
# 使用使用embed_query()来embed query 使用embed_documents()来embed documents

12
True
[-0.01946728117763996, 0.04551958292722702, 0.050001200288534164, -0.008813056163489819, -0.061793047934770584, 0.08493052423000336, 0.044280748814344406, 0.0900396853685379, -0.021105149760842323, 0.01685332879424095, -0.01165743451565504, 0.043103333562612534, -0.004071816336363554, -0.043039899319410324, 0.02665494568645954, 0.021819395944476128, 0.11488445103168488, -0.005748739931732416, -0.011197486892342567, -0.043945666402578354, -0.12934647500514984, 0.014117486774921417, 0.0438896045088768, -0.03750012442469597, 0.012102385051548481, -0.002147136954590678, -0.050425123423337936, -0.026829950511455536, 0.04428607597947121, -0.035589549690485, -0.040583543479442596, 0.12647095322608948, 0.01170614268630743, -0.0845954418182373, -0.11681095510721207, -0.01938350312411785, 0.02140682190656662, 0.07671185582876205, -0.020045312121510506, 0.037622977048158646, -0.00033299901406280696, 0.005541974678635597, -0.02060716226696968, -0.10365992784500122, -0.04180404916405678, -0

**embed_query & embed_documents**:

通过提供两个不同的函数 区分不同的使用场景 使API的设计更具可读性和明确性

用户可以明确地知道哪个函数用于处理文档 哪个函数用于处理查询 这有助于避免混淆 使代码更加直观

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
# print(embeddings)
# print(embeddings[0])
len(embeddings)
# len(embeddings[0])

**Pooling?**    
HuggingFaceEmbeddings 返回的嵌入向量通常是 PyTorch 张量  
Pooling (池化): 用于从高维数据中**提取特征**，降低数据维度  

Average Pooling  
Max Pooling  
Attention Pooling (注意力池化): 根据输入数据的重要性分配权重

## Store

Chroma in two modes:

- in-memory - in a python script or jupyter notebook

- in-memory with persistance - in a script or notebook and save/load to disk

Chroma是结构化的嵌入向量数据库，专门用于存储和检索高维嵌入向量

Chroma的存储方式是结构化的，因为它对数据进行了索引和组织以支持高效检索，但其存储内容（向量和元数据）可以视为非结构化数据的表示

因此，可以说ChromaDB是结构化存储非结构化数据的数据库

Chroma以collection的形式组织数据(mongoDB)，每个collection包含多个文档 嵌入向量和元数据

In [46]:
# use ChromaDB to store embeddings
from langchain_chroma import Chroma
from asg_loader import DocumentLoading
from asg_splitter import TextSplitting
# from langchain_community.embeddings.sentence_transformer import (
#     SentenceTransformerEmbeddings,
# )
from langchain.embeddings import HuggingFaceEmbeddings


**In-memory** - in a python script or jupyter notebook

In [57]:
asg_splitter = TextSplitting()
splitters = asg_splitter.pypdf_recursive_splitter("./Test2.pdf")
embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# load it into Chroma
db = Chroma.from_documents(splitters, embedding_function)

# query it
query = "What methodology is used in the paper?"
docs = db.similarity_search(query) # use vector store as a retriever
for doc in docs:
    print(doc)
print("\n")
print(docs[0])

page_content='InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pp.\n15836–15848. Association for Computational Linguistics, 2023.\nYung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic,'
page_content='InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pp.\n15836–15848. Association for Computational Linguistics, 2023.\nYung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic,'
page_content='InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pp.\n15836–15848. Association for Computational Linguistics, 2023.\nYung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic,'
page_content='InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pp.\n15836–15848. Association for Computational Linguistics, 2023.\nYung-Sung Chuang, Rumen Dangovski,

In [60]:
# 查看数据库中的所有向量
vectors = db.get(include=["embeddings"])
print(len(vectors))
# print("查看数据库中的所有向量:")
# for i, vector in enumerate(vectors["embeddings"]):
#     print(f"Vector {i+1}:\n{vector}\n{'-'*50}\n")

# 查询数据库中的文档数量
documents = db.get()
num_documents = len(documents["documents"])
print(f"Total number of documents: {num_documents}")

# 查询数据库中的向量数量
vectors = db.get(include=["embeddings"])
num_vectors = len(vectors["embeddings"])
print(f"Total number of vectors: {num_vectors}")

# 查询数据库中的元数据数量
metadata = db.get(include=["metadatas"])
num_metadata = len(metadata["metadatas"])
print(f"Total number of metadatas: {num_metadata}")

# 查询数据库中的单个文档
document_id = documents["ids"][0]
document = db.get(ids=[document_id])
# document_id = documents["documents"][0]["id"]
# document = db.get_by_id(document_id)
print(f"查询数据库中的单个文档：\n{document}")
# print(f"查询数据库中的单个文档:\n{document}\n")

# 查询数据库中的所有属性
print("All attributes in the database:")
print(documents.keys())


# # create simple ids
# ids = [str(i) for i in range(1, len(splitters) + 1)]
# asg_splitter = TextSplitting()
# splitters = asg_splitter.pypdf_recursive_splitter("./Test2.pdf")
# embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# # add data
# example_db = Chroma.from_documents(splitters, embedding_function, ids=ids)
# splitters = example_db.similarity_search(query)
# print(splitters[0].metadata)

# # update the metadata for a document
# splitters[0].metadata = {
#     "source": "Text2.pdf",
#     "author": "Li Xianming, Li Jing",
# }
# example_db.update_document(ids[0], splitters[0])
# print(example_db._collection.get(ids=[ids[0]]))

# # delete the last document
# print("count before", example_db._collection.count())
# example_db._collection.delete(ids=[ids[-1]])
# print("count after", example_db._collection.count())

'''
{'source': '../../../state_of_the_union.txt'}
{'ids': ['1'], 'embeddings': None, 'metadatas': [{'new_value': 'hello world', 'source': '../../../state_of_the_union.txt'}], 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.']}
count before 46
count after 45
'''

6
Total number of documents: 1736
Total number of vectors: 1736
Total number of metadatas: 1736
查询数据库中的单个文档：
{'ids': ['0021ac77-f03e-4c2e-9ffa-72d1c50f1d54'], 'embeddings': None, 'metadatas': [None], 'documents': ['Preprint\n-1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.004-53-42-31-20-10-5\n(a)\n (b)\nFigure 5: (a) Density plots of cosine similarities between sentence pairs in the STS-B test set. The\npairs have been categorized into 6 groups, reflecting the ground truth ratings (where higher ratings'], 'uris': None, 'data': None}
All attributes in the database:
dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])


"\n{'source': '../../../state_of_the_union.txt'}\n{'ids': ['1'], 'embeddings': None, 'metadatas': [{'new_value': 'hello world', 'source': '../../../state_of_the_union.txt'}], 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.']}\ncount before

In [None]:
# load the document and split it into chunks
asg_splitter = TextSplitting()
splitters = asg_splitter.pypdf_recursive_splitter("./Test1.pdf")
# for splitter in splitters: # for visualization
#     print(splitter)
#     print("=====================================")

Update and Delete

While building toward a real application, you want to go beyond adding data, and also update and delete data.

Chroma has users provide ids to simplify the bookkeeping here. ids can be the name of the file, or a combined has like filename_paragraphNumber, etc.

Chroma supports all these operations - though some of them are still being integrated all the way through the LangChain interface. Additional workflow improvements will be added soon.

Here is a basic example showing how to do various operations:

In [61]:
# create simple ids
ids = [str(i) for i in range(1, len(splitters) + 1)]
asg_splitter = TextSplitting()
splitters = asg_splitter.pypdf_recursive_splitter("./Test2.pdf")
embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# add data
example_db = Chroma.from_documents(splitters, embedding_function, ids=ids)

splitters = example_db.similarity_search(query)
print(splitters[0].metadata)

# update the metadata for a document
splitters[0].metadata = {
    "source": "Text2.pdf",
    "author": "Li Xianming, Li Jing",
}
example_db.update_document(ids[0], splitters[0])
print(example_db._collection.get(ids=[ids[0]]))

# delete the last document
print("count before", example_db._collection.count())
example_db._collection.delete(ids=[ids[-1]])
print("count after", example_db._collection.count())

'''
{'source': '../../../state_of_the_union.txt'}
{'ids': ['1'], 'embeddings': None, 'metadatas': [{'new_value': 'hello world', 'source': '../../../state_of_the_union.txt'}], 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.']}
count before 46
count after 45
'''

{}
{'ids': ['1'], 'embeddings': None, 'metadatas': [{'author': 'Li Xianming, Li Jing', 'source': 'Text2.pdf'}], 'documents': ['InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pp.\n15836–15848. Association for Computational Linguistics, 2023.\nYung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic,'], 'uris': None, 'data': None}
count before 1737
count after 1736


"\n{'source': '../../../state_of_the_union.txt'}\n{'ids': ['1'], 'embeddings': None, 'metadatas': [{'new_value': 'hello world', 'source': '../../../state_of_the_union.txt'}], 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.']}\ncount before

Use OpenAI Embeddings

Not open-source, but much more powerful

In [None]:
# get a token: https://platform.openai.com/account/api-keys
import os
from getpass import getpass
from langchain_openai import OpenAIEmbeddings
OPENAI_API_KEY = getpass()
# os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
# embeddings = OpenAIEmbeddings()
# new_client = chromadb.EphemeralClient()
# openai_lc_client = Chroma.from_documents(
#     docs, embeddings, client=new_client, collection_name="openai_collection"
# )
# query = "What did the president say about Ketanji Brown Jackson"
# docs = openai_lc_client.similarity_search(query)
# print(docs[0].page_content)

Similarity search with score.

The returned distance score is cosine distance. Therefore, a lower score is better.

In [None]:
docs = db.similarity_search_with_score(query)

docs[0]
'''
(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at 
        it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his 
        life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States 
        Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is 
        nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals 
        Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', 
        metadata={'source': '../../../state_of_the_union.txt'}),
 1.1972057819366455)
'''

Retriever options: to use Chroma as a retriever.

In addition to using similarity search in the retriever object, you can also use mmr.

In [None]:
retriever = db.as_retriever(search_type="mmr")

retriever.invoke(query)[0]
'''
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at 
        it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I'd like to honor someone who has dedicated his 
        life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States 
        Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is 
        nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals 
        Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', 
        metadata={'source': '../../../state_of_the_union.txt'})
'''

Filtering on metadata
It can be helpful to narrow down the collection before working with it.

For example, collections can be filtered on metadata using the get method.

In [None]:
# filter collection for updated source
example_db.get(where={"source": "some_other_source"})

'''
{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': []}
'''

**In-memory with persistance** - in a script or notebook and save/load to disk

Save to disk: simply initialize the Chroma client and pass the directory where the data to be saved to.

Chroma makes a best-effort to automatically save data to disk, however multiple in-memory clients can stop each other's work. As a best practice, only have one client per path running at any given time.

In [None]:
# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
docs = db2.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
docs = db3.similarity_search(query)
print(docs[0].page_content)

Passing a Chroma Client into Langchain

You can also create a Chroma Client and pass it to LangChain. This is particularly useful if you want easier access to the underlying database.

You can also specify the collection name that you want LangChain to use.

In [None]:
import chromadb

persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_or_create_collection("collection_name")
collection.add(ids=["1", "2", "3"], documents=["a", "b", "c"])

langchain_chroma = Chroma(
    client=persistent_client,
    collection_name="collection_name",
    embedding_function=embedding_function,
)

print("There are", langchain_chroma._collection.count(), "in the collection")


'''
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
``````output
There are 3 in the collection
'''