# 和 Chroma 对比相似度是否有差别

问题：

- 使用 Chroma 取的余弦相似度和直接使用嵌入模型不同
  - `./simple.ipynb` 两个相似度分别是 0.6 和 0.3
  - Chroma 是 0.65 和 0.5

## 准备

In [1]:
%%time
%%capture

persist_dir = "/tmp/chroma_my_books"
!rm -rf $persist_dir

CPU times: user 10.5 ms, sys: 0 ns, total: 10.5 ms
Wall time: 109 ms


In [2]:
%%time
%%capture

!pip install chromadb
!pip install llama-index-vector-stores-chroma
!pip install llama-index-embeddings-huggingface
!pip install llama-index

CPU times: user 46.2 ms, sys: 13.6 ms, total: 59.7 ms
Wall time: 14.8 s


In [3]:
%%time

books = [
    {
        "name": "围城",
        "description": "主人公方鸿渐留学回国后，面临找工作和个人感情的种种问题。他经历了几段感情波折，包括与鲍小姐的失败婚姻和与孙柔嘉的恋情，最终与孙柔嘉结婚。但婚后生活并不如意，他在事业上也遭遇挫折，未能实现自己的理想。",
        "author": "钱钟书",
    },
    {
        "name": "故乡",
        "description": "小说讲述了主人公“我”（即鲁迅的化身）在阔别多年后回到故乡接母亲到城里生活的故事。在故乡，他遇到了童年的玩伴闰土和老仆人杨二嫂。通过与他们的交谈和观察，主人公感受到故乡的变化和人们生活的困苦。",
        "author": "鲁迅",
    },
    {
       "name": "阿Q正传",
        "description": "讲述了阿Q这个贫苦农民在中国封建社会中的悲惨生活。他虽然穷困潦倒，但心态自负，总是以精神胜利法来安慰自己，逃避现实的困境。然而，随着社会动荡和革命的到来，阿Q的命运变得更加悲惨，最终被误认为是革命党人而被处死。",
        "author": "鲁迅",
    }, 
]

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.58 µs


In [4]:
%%time

from llama_index.core import Document

documents=[]

for book in books:
    document=Document(
        text=book['description'],
        metadata={"name": book['name'], "author": book['author']},
    )
    documents.append(document)

CPU times: user 2.75 s, sys: 314 ms, total: 3.07 s
Wall time: 2.72 s


In [5]:
%%time

from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

Settings.embed_model = HuggingFaceEmbedding(model_name="/models/bge-small-zh-v1.5")

CPU times: user 1.63 s, sys: 168 ms, total: 1.8 s
Wall time: 1.73 s


In [6]:
Settings.embed_model.max_length

512

## 使用 Chroma

### 设置存储位置

In [7]:
%%time

import chromadb

collection_name="my_books"

chroma_client = chromadb.EphemeralClient() # 临时客户端，内存存储
chroma_collection = chroma_client.create_collection(
    name=collection_name,
    metadata={"hnsw:space": "cosine"},
    # metadata={"hnsw:space": "ip"},
)

CPU times: user 456 ms, sys: 15.7 ms, total: 471 ms
Wall time: 471 ms


### 构建 VectorStoreIndex

In [8]:
%%time

from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(documents, storage_context=storage_context)

CPU times: user 567 ms, sys: 64.3 ms, total: 631 ms
Wall time: 629 ms


## 查询

In [9]:
%%time

retriever = index.as_retriever()
nodes = retriever.retrieve("方鸿渐")

nodes

CPU times: user 9.06 ms, sys: 15.3 ms, total: 24.3 ms
Wall time: 24.9 ms


[NodeWithScore(node=TextNode(id_='ae7d0e87-1e75-4679-baca-1b75f17ca57e', embedding=None, metadata={'name': '围城', 'author': '钱钟书'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='主人公方鸿渐留学回国后，面临找工作和个人感情的种种问题。他经历了几段感情波折，包括与鲍小姐的失败婚姻和与孙柔嘉的恋情，最终与孙柔嘉结婚。但婚后生活并不如意，他在事业上也遭遇挫折，未能实现自己的理想。', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.6553767513462198),
 NodeWithScore(node=TextNode(id_='5c85c31c-d068-401e-ad51-d5eeb6518d63', embedding=None, metadata={'name': '阿Q正传', 'author': '鲁迅'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='讲述了阿Q这个贫苦农民在中国封建社会中的悲惨生活。他虽然穷困潦倒，但心态自负，总是以精神胜利法来安慰自己，逃避现实的困境。然而，随着社会动荡和革命的到来，阿Q的命运变得更加悲惨，最终被误认为是革命党人而被处死。', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.5053860977050783)]

### 过滤

In [10]:
%%time

from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="author", operator=FilterOperator.EQ, value="钱钟书"
        ),
    ]
)

CPU times: user 90 µs, sys: 10 µs, total: 100 µs
Wall time: 103 µs


In [11]:
%%time

retriever = index.as_retriever(filters=filters)
nodes = retriever.retrieve("方鸿渐")

nodes

CPU times: user 10.2 ms, sys: 0 ns, total: 10.2 ms
Wall time: 9.41 ms


[NodeWithScore(node=TextNode(id_='ae7d0e87-1e75-4679-baca-1b75f17ca57e', embedding=None, metadata={'name': '围城', 'author': '钱钟书'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='主人公方鸿渐留学回国后，面临找工作和个人感情的种种问题。他经历了几段感情波折，包括与鲍小姐的失败婚姻和与孙柔嘉的恋情，最终与孙柔嘉结婚。但婚后生活并不如意，他在事业上也遭遇挫折，未能实现自己的理想。', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.6553767513462198)]