# 基于嵌入和 BM25 混合检索

## 准备

In [1]:
%%time
%%capture

# 安装所需的库

!pip install llama-index-vector-stores-qdrant
!pip install qdrant_client
!pip install trafilatura

!pip install rank_bm25
!pip install nltk jieba

!pip install llama-index-retrievers-bm25==0.1.3
!pip install rank_bm25
!pip install jieba

CPU times: user 72.9 ms, sys: 17.5 ms, total: 90.4 ms
Wall time: 26.6 s


In [2]:
%time

# 加载llm和embeddings

%run ../utils2.py

from llama_index.core import Settings

Settings.llm=get_llm()
Settings.embed_model = get_embedding()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.34 µs


## 加载文档

In [3]:
%%time

from llama_index.readers.web import TrafilaturaWebReader

documents = TrafilaturaWebReader().load_data(
    ["https://baike.baidu.com/item/固态电池"]
)

len(documents)

CPU times: user 313 ms, sys: 7.61 ms, total: 320 ms
Wall time: 371 ms


1

In [4]:
documents[0].text[:500]

'收藏\n查看我的收藏\n0有用+1\n- 中文名\n- 固态电池\n- 外文名\n- Solid-state batteries\n- 领 域\n- 硬件\n- 能量密度\n- 锂离子电池的2倍\n- 性 质\n- 一种使用固体电极和固体电解质的电池\n- 特 点\n- 功率密度较低，能量密度较高\n2030年，锂离子电池将不再是电动汽车电池主流，但其在某些电子原件领域仍有一席之地。 [1]据SNE Researchd的测算，2025年我国固态电池市场空间有望达30亿元，2030年有望达到200亿元。 [3]\n在2010年，丰田就曾推出过续航里程可超过1000KM的固态电池。而包括QuantumScape以及Sakti3所做的努力也都是在试图用固态电池来取代传统的液态锂电池。\n加拿大Avestor公司也曾尝试过研发固态锂电池，最终2006年正式申请破产。Avestor公司使用一种高分子聚合物分离器，代替电池中的液体电解质，但一直没有解决安全问题，在北美地区发生过几起电池燃烧或者爆炸事件。\n2015年3月中旬，真空吸尘器的发明者、英国戴森公司（Dyson）创始人詹姆斯·戴森将其首笔1500万美元的投资投向了固态电池公'

In [5]:
%%time

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=128, chunk_overlap=10)
nodes = splitter.get_nodes_from_documents(documents)

len(nodes)

CPU times: user 256 ms, sys: 23.4 ms, total: 280 ms
Wall time: 279 ms


31

## 基于嵌入的检索

In [6]:
%%time

# 启用 Qdrant 作为向量存储

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client.models import Distance, VectorParams
from llama_index.core import StorageContext

client = qdrant_client.QdrantClient(
    location=":memory:",
    vectors_config=VectorParams(
        size=1024, 
        distance=Distance.COSINE
    ),
)

vector_store = QdrantVectorStore(client=client, collection_name="simple")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

CPU times: user 747 ms, sys: 15.4 ms, total: 762 ms
Wall time: 762 ms


In [7]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(
    nodes=nodes, 
    storage_context=storage_context
)

CPU times: user 135 ms, sys: 5.41 ms, total: 140 ms
Wall time: 4.24 s


In [8]:
%%time

retriever = index.as_retriever()
nodes = retriever.retrieve("固态电池是啥?")

print("\n\n".join(node.text for node in nodes))

而锂离子就像优秀的运动员，在摇椅的两端来回奔跑，在锂离子从正极到负极再到正极的运动过程中，电池的充放电过程便完成了。

[4]
2022年，我国动力电池技术创新能力不断提高，三元电池系统能量密度最大值提升至212Wh/kg，磷酸铁锂电池系统能量密度最大值提升至176.1Wh/kg；纯电动乘用车单车平均带电量提升至50.9kWh，续航400公里以上车型占比提升至70.7%。
CPU times: user 319 µs, sys: 7.32 ms, total: 7.64 ms
Wall time: 105 ms


In [9]:
len(nodes)

2

## 基于检索的查询

In [11]:
%%time

from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("固态电池是啥?")
streaming_response.print_response_stream()

固态电池是一种使用固体电解质代替传统锂离子电池中使用的液体或凝胶状电解质的电池。这种技术旨在提高能量密度、安全性以及循环寿命，同时减少自放电和热失控的风险。尽管固态电池具有许多潜在优势，但其商业化应用仍面临挑战，包括成本高、制造复杂性和性能优化等问题。CPU times: user 244 ms, sys: 153 ms, total: 397 ms
Wall time: 4.44 s


## 混合检索，加入 BM25

In [12]:
%%time

# 下载停用词

# 设置 HTTP 代理环境变量
# https://github.com/nltk/nltk_data/issues/154#issuecomment-2144880495
http_proxy="http://192.168.0.134:7890"

import nltk
nltk.set_proxy(f'{http_proxy}')
nltk.download('stopwords')

CPU times: user 8.05 ms, sys: 2.94 ms, total: 11 ms
Wall time: 1.53 s


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
%%time
import jieba
from typing import List
from nltk.corpus import stopwords

def chinese_tokenizer(text: str) -> List[str]:
    # Use jieba to segment Chinese text
    return list(jieba.cut(text))
    # return list(jieba.lcut(text))

CPU times: user 56.1 ms, sys: 23.4 ms, total: 79.5 ms
Wall time: 79.6 ms


In [16]:
%%time

from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=128, chunk_overlap=10)

docstore = SimpleDocumentStore()
docstore.add_documents(splitter.get_nodes_from_documents(documents))

CPU times: user 5.89 ms, sys: 3.74 ms, total: 9.64 ms
Wall time: 8.73 ms


In [17]:
%%time

from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever

retriever = QueryFusionRetriever(
    [
        index.as_retriever(similarity_top_k=2),
        BM25Retriever.from_defaults(
            # docstore=index.docstore, 
            docstore=docstore,
            similarity_top_k=2,
            tokenizer=chinese_tokenizer,
        ),
    ],
    num_queries=1,
    use_async=True,
)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.423 seconds.
Prefix dict has been built successfully.


CPU times: user 409 ms, sys: 31.9 ms, total: 441 ms
Wall time: 436 ms


In [19]:
%%time

import nest_asyncio

nest_asyncio.apply()

CPU times: user 589 µs, sys: 0 ns, total: 589 µs
Wall time: 623 µs


In [20]:
%%time

nodes = retriever.retrieve("固态电池是啥?")
print("\n\n".join(node.text for node in nodes))

AttributeError: 'NoneType' object has no attribute 'search'