In [1]:
text = """蔚来萤火虫即将交付，李斌所预测的两年内新能源车渗透率超80%能否实现？

9 月 6 日消息，近日，蔚来汽车创始人、董事长李斌在蔚来九周年内部讲话及财报电话会上透露了多项重要信息，其中最引人注目的莫过于第三品牌萤火虫将于 2025 年正式交付，并预测新能源汽车的市场渗透率将在未来两年内超过 80%。

据李斌介绍，蔚来汽车将形成三个品牌矩阵，覆盖从 14 万元到 80 万元的广阔市场区间。其中，第三品牌萤火虫作为蔚来汽车布局中低端市场的重要棋子，将于 2025 年正式交付。这一举措不仅丰富了蔚来的产品线，也进一步提升了其在新能源汽车市场的竞争力。

This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.

There are a few ways to determine what that threshold is, which are controlled by the breakpoint_threshold_type kwarg."""

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=64, chunk_overlap=12, length_function=len
    )

In [8]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "../../bge-small-zh-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
hf_embedding = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [11]:
from langchain.docstore.document import Document

In [17]:
texts = text_splitter.split_text(text)
doc_texts = [Document(page_content=_,metadata={}) for _ in texts]
vectorstore = FAISS.from_documents(doc_texts, hf_embedding)

In [16]:
from rank_bm25 import BM25Okapi

In [20]:
import jieba
bm25 = BM25Okapi([jieba.lcut(text) for text in texts])

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\13494\AppData\Local\Temp\jieba.cache
Loading model cost 1.473 seconds.
Prefix dict has been built successfully.


In [24]:
import numpy as np

In [49]:
# 原始的代码使用vectorestore获取all_docs，顺序是有问题的
def fusion_retrieval(vectorstore, bm25, query: str, k: int = 5, alpha: float = 0.5):

    # Step 1: Get all documents from the vectorstore
    # all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
    all_docs = texts
    
    # Step 2: Perform BM25 search
    bm25_scores = bm25.get_scores(jieba.lcut(query))

    # Step 3: Perform vector search
    vector_results = vectorstore.similarity_search_with_score(query, k=len(all_docs))
    # print(vector_results)
    pagecontent2score = {i.page_content:j for i,j in vector_results}
    # Step 4: Normalize scores
    vector_scores = np.array([pagecontent2score.get(i) for i in all_docs])
    vector_scores = 1 - (vector_scores - np.min(vector_scores)) / (np.max(vector_scores) - np.min(vector_scores))

    bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))
    # Step 5: Combine scores
    combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores  
    # Step 6: Rank documents
    sorted_indices = np.argsort(combined_scores)[::-1]
    
    # Step 7: Return top k documents
    return [all_docs[i] for i in sorted_indices[:k]]

In [51]:
query = "蔚来"

top_docs = fusion_retrieval(vectorstore, bm25, query, k=2, alpha=0.5)
docs_content = [doc for doc in top_docs]
print(docs_content)

['日消息，近日，蔚来汽车创始人、董事长李斌在蔚来九周年内部讲话及财报电话会上透露了多项重要信息，其中最引人注目的莫过于第三品牌萤', '据李斌介绍，蔚来汽车将形成三个品牌矩阵，覆盖从 14 万元到 80']
