# llamaIndex BM25Retriever 支持中文

初步结论：

- 还是有问题，无法处理中文
- [[Bug]: BM25Retriever cannot work on chinese #13866](https://github.com/run-llama/llama_index/issues/13866)
- [BM25Retriever 支持中文吗？](https://www.51cto.com/aigc/1003.html) - 这个文档给出的结果是可以用的，我的代码基本和他一样却不行

## 准备

In [7]:
%%time
%%capture

%pip install rank_bm25
%pip install nltk jieba
%pip install llama-index-retrievers-bm25

CPU times: user 16 ms, sys: 27.9 ms, total: 43.8 ms
Wall time: 10.9 s


In [1]:
%%time

# 下载停用词

# 设置 HTTP 代理环境变量
# https://github.com/nltk/nltk_data/issues/154#issuecomment-2144880495
http_proxy="http://192.168.0.134:7890"

import nltk
nltk.set_proxy(f'{http_proxy}')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...


CPU times: user 669 ms, sys: 186 ms, total: 855 ms
Wall time: 1.2 s


[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
%%time
import jieba
from typing import List
from nltk.corpus import stopwords

def chinese_tokenizer(text: str) -> List[str]:
    # Use jieba to segment Chinese text
    return list(jieba.cut(text))

# def chinese_tokenizer(text: str) -> List[str]:
#     tokens = jieba.lcut(text)
#     return [token for token in tokens if token not in stopwords.words('chinese')]

CPU times: user 11 µs, sys: 1 µs, total: 12 µs
Wall time: 13.8 µs


## 简单测试 BM25 的使用

In [14]:
%%time

from rank_bm25 import BM25Okapi

corpus = [
    "床前明月光",
    "疑是地上霜",
    "举头望明月",
    "低头思故乡",
]
tokenized_corpus = [chinese_tokenizer(doc) for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query = "床前明月光"
tokenized_query = chinese_tokenizer(query)

doc_scores = bm25.get_scores(tokenized_query)
doc_scores

CPU times: user 339 µs, sys: 43 µs, total: 382 µs
Wall time: 406 µs


array([1.8621931, 0.       , 0.       , 0.       ])

## llamaindex BM25Retriever

In [18]:
%%time

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.response.notebook_utils import display_source_node

documents = [Document(text="床前明月光"),
             Document(text="疑是地上霜"),
             Document(text="举头望明月"),
             Document(text="低头思故乡")]

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=2,
    tokenizer=chinese_tokenizer
)

nodes = retriever.retrieve("故乡")
for node in nodes:
    display_source_node(node)

**Node ID:** cd615693-be30-43e1-a69d-41bf74295fe1<br>**Similarity:** 0.0<br>**Text:** 低头思故乡<br>

**Node ID:** 1cc4a96e-ba6c-4f28-bbd1-a04f438c21f0<br>**Similarity:** 0.0<br>**Text:** 举头望明月<br>

CPU times: user 5.01 ms, sys: 0 ns, total: 5.01 ms
Wall time: 4.23 ms
