# 测试通过嵌入检索的成功率

初步结论：

- k=1时，只有56%，明显低于中文提示词的情况，后者是72%
- k=2时，达到83%，又高于中文的情况
- k=5时，依然是83%

初始情况不如中文提示词，但是文档摘要中包括问答。

整体看documentSummaryIndex，简单易用，但是效果一般，只能达到70%左右的检索召回率

## 准备

In [1]:
%%time

INDEX_PATH="retrieve-index2"
DATA_PATH="retrieve-data2"

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.1 µs


In [2]:
%%time
!rm -rf $INDEX_PATH

CPU times: user 6.43 ms, sys: 428 µs, total: 6.86 ms
Wall time: 109 ms


In [20]:
%%time

# test_data=[
#     {
#         "url": "https://www.guancha.cn/ZiZheng/2024_06_27_739401_s.shtml",
#         "question": "和平共处五项原则为啥到现在还有重要意义？"
#     },
#     {
#         "url": "https://www.guancha.cn/kegongliliang/2024_06_27_739408_s.shtml",
#         "question": "丰鸟科技在内蒙古主场地的资产清点和撤出的具体原因是什么？"
#     },
#     {
#         "url": "https://user.guancha.cn/main/content?id=1257033",
#         "question": "电池工厂火灾的具体起因是什么？"
#     },
# ]

import json

file_path = './news-data-from-shen.json'
with open(file_path, 'r', encoding='utf-8') as file:
    test_data = json.load(file)

len(test_data)

CPU times: user 716 µs, sys: 10 µs, total: 726 µs
Wall time: 689 µs


205

In [4]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)

In [21]:
%%time

import requests
from gne import GeneralNewsExtractor

def get_news_data(url):

    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 10; SM-G975F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Mobile Safari/537.36'
    }
    response = requests.get(url, headers=headers)

    if 'finance.china.com.cn' in url:
        response.encoding = 'utf-8'
        
    html = response.text

    extractor = GeneralNewsExtractor()
    data = extractor.extract(html, noise_node_list=[
                               '//div[@class="comment-list"]'])
    data['url']=url
    return data

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 10.5 µs


In [22]:
%%time

!rm -rf $DATA_PATH
!mkdir -p $DATA_PATH

import json

for news in test_data:
    data=get_news_data(news['url'])
    data['url']=news['url']
    file_path = f'./{DATA_PATH}/{data["title"]}.json'
    with open(file_path, 'w') as json_file:
        json.dump(data, json_file, indent=4)

CPU times: user 13.4 s, sys: 135 ms, total: 13.6 s
Wall time: 51.7 s


In [23]:
%%time

import glob

# 指定目录路径
directory = f'./{DATA_PATH}'

# 使用 glob 模块匹配所有的 .json 文件
json_files = glob.glob(f"{directory}/*.json")

# 输出文件数量
print(f"在 {directory} 目录下有 {len(json_files)} 个 .json 文件。")


在 ./retrieve-data2 目录下有 200 个 .json 文件。
CPU times: user 0 ns, sys: 1.92 ms, total: 1.92 ms
Wall time: 1.56 ms


In [24]:
%%time

from llama_index.core import(
    Document
)

def data2doc(news_data):
    document=Document(text=news_data['content'], 
                  metadata={"title": news_data['title'],
                            'publish_time': news_data['publish_time'],
                            'author': news_data['author'],
                            'url': news_data['url'],
                            'images': news_data['images'],
                           })
    document.doc_id = document.metadata["title"]
    return document

CPU times: user 2.45 s, sys: 257 ms, total: 2.71 s
Wall time: 2.52 s


In [25]:
%%time

from llama_index.core import SimpleDirectoryReader

documents=SimpleDirectoryReader(input_dir=f"./{DATA_PATH}").load_data(num_workers=4)
for document in documents:
    document.doc_id=document.metadata['file_name']

CPU times: user 39.7 ms, sys: 57 µs, total: 39.7 ms
Wall time: 3.08 s


In [26]:
%%time

import json

docs=[]
for document in documents:
    news_data=json.loads(document.text) # documents[0].text
    docs.append(data2doc(news_data))

documents=docs

len(docs)

CPU times: user 12.3 ms, sys: 3.95 ms, total: 16.3 ms
Wall time: 15.4 ms


200

In [27]:
%%time

import nest_asyncio
nest_asyncio.apply()

CPU times: user 1.49 ms, sys: 24 µs, total: 1.52 ms
Wall time: 1.25 ms


In [28]:
%%time

# 加载llm和embeddings
%run ../utils2.py

from llama_index.core import Settings

Settings.llm=get_llm()
Settings.embed_model=get_embedding()

CPU times: user 664 ms, sys: 28 ms, total: 692 ms
Wall time: 692 ms


In [29]:
%%time

from llama_index.core import get_response_synthesizer
from llama_index.core import DocumentSummaryIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True
)

doc_summary_index = DocumentSummaryIndex.from_documents(
    documents,
    transformations=[splitter],
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/200 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/200 [00:00<?, ?it/s]

current doc id: 1至5月份规模以上工业企业利润保持增长
current doc id: 2023年度审计工作报告发布：中央财政赤字4.16万亿元 与调整后预算持平
current doc id: 2024世界人工智能大会将于7月4日开幕 这些亮点值得期待→
current doc id: 2024年“礼赞共和国创造新生活”现代科技馆体系联合行动 “智慧未来”主题科普活动在湖南长沙举办
current doc id: 7月1日起正式启动申报！2024年国家医保药品目录将迎来调整
current doc id: OpenAI究竟在砸谁的饭碗？
current doc id: OpenAI：推迟发布语音助手以保更佳处理用户要求
current doc id: “1+N”政策体系推进建设高质量资本市场
current doc id: “产能过剩”这道伪命题给经济学家整不会了
current doc id: “人工智能+”带来新变化
current doc id: “塞铁”跳反了？塞尔维亚“背刺”俄罗斯：承认已暗中向乌克兰提供8亿美元弹药！
current doc id: “威胁马科斯执政”，杜特尔特父子三人明年将竞选菲律宾参议员
current doc id: “季末回表” VS “存款搬家” 银行理财规模高增6月会否再现
current doc id: “屏蔽生”又上热搜！打击高考分数炒作，光是令行禁止还远远不够【弓道是非】
current doc id: “我真想从上海带两个机器人回去”——秘鲁总统点赞中国科技进步
current doc id: “旅游补贴”？以谣言为低价游引流的吃相太难看
current doc id: “类型片赛道拥挤难破局，《潜伏》深入人心后，再写假夫妻就很难”
current doc id: “至少23人死亡”，肯尼亚总统让步：拒绝签署财政法案，退回议会重审
current doc id: “防晒焦虑”背后隐藏什么？专家：过度防晒有害健康
current doc id: 《光伏产业专利发展年度报告（2024）》发布 知识产权生态建设稳步向前
current doc id: 【C财经】国家能源局：中国新能源产业不存在所谓的“产能过剩”问题
current doc id: 【世界说】荒谬！美国枪支销售商成为“种族战争”煽动者
cur

Generating embeddings:   0%|          | 0/200 [00:00<?, ?it/s]

CPU times: user 4.8 s, sys: 263 ms, total: 5.06 s
Wall time: 44min 40s


In [30]:
%%time

doc_summary_index.storage_context.persist(INDEX_PATH)

CPU times: user 646 ms, sys: 11.9 ms, total: 658 ms
Wall time: 658 ms


In [26]:
%%time

from llama_index.core import load_index_from_storage
from llama_index.core import StorageContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir=INDEX_PATH)
doc_summary_index = load_index_from_storage(storage_context)

CPU times: user 103 ms, sys: 0 ns, total: 103 ms
Wall time: 102 ms


## 检索

### 简单测试

In [31]:
%%time

from llama_index.core.indices.document_summary import (
    DocumentSummaryIndexEmbeddingRetriever,
)

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # similarity_top_k=1,
)

retrieved_nodes = retriever.retrieve(test_data[0]["question"])

CPU times: user 11.7 ms, sys: 0 ns, total: 11.7 ms
Wall time: 1.42 s


In [28]:
retrieved_nodes[0].metadata

{'title': '冯超：普京访越，透视出的国家相处之道应该是什么？',
 'publish_time': '2024-06-23 08:48:43',
 'author': '小婷',
 'url': 'https://www.guancha.cn/fengchao/2024_06_23_738933_s.shtml',
 'images': ['https://i.guancha.cn/news/dfic/2024/06/21/20240621150313664.jpg',
  'https://i.guancha.cn/news/dfic/2024/06/21/20240621150436659.jpg',
  'https://i.guancha.cn/shiping-banner.jpg']}

In [32]:
retrieved_nodes[0].metadata['url']==test_data[0]["url"]

False

### k=1

In [33]:
%%time

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=1,
)

results=[]

for news in test_data:
    retrieved_nodes = retriever.retrieve(news["question"])
    result=False
    for node in retrieved_nodes:
        # print(news["question"])
        if node.metadata['url']==news["url"]:
            result=True
            break
    results.append(result)

results

CPU times: user 1.97 s, sys: 22.8 ms, total: 1.99 s
Wall time: 20.2 s


[False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,

In [35]:
results.count(True) / len(results)

0.43902439024390244

### k=2

In [34]:
%%time

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=2,
)

results=[]

for news in test_data:
    retrieved_nodes = retriever.retrieve(news["question"])
    result=False
    for node in retrieved_nodes:
        if node.metadata['url']==news["url"]:
            result=True
            break
    results.append(result)

results.count(True) / len(results)

CPU times: user 2.01 s, sys: 38.8 ms, total: 2.05 s
Wall time: 20.3 s


0.43902439024390244

### k=3

In [36]:
%%time

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=3,
)

results=[]

for news in test_data:
    retrieved_nodes = retriever.retrieve(news["question"])
    result=False
    for node in retrieved_nodes:
        if node.metadata['url']==news["url"]:
            result=True
            break
    results.append(result)

results.count(True) / len(results)

CPU times: user 2.13 s, sys: 31 ms, total: 2.16 s
Wall time: 20.3 s


0.5414634146341464

### k=5

In [37]:
%%time

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=5,
)

results=[]

for news in test_data:
    retrieved_nodes = retriever.retrieve(news["question"])
    result=False
    for node in retrieved_nodes:
        if node.metadata['url']==news["url"]:
            result=True
            break
    results.append(result)

results.count(True) / len(results)

CPU times: user 2.23 s, sys: 26.7 ms, total: 2.25 s
Wall time: 20.5 s


0.624390243902439

### k=6

In [38]:
%%time

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=6,
)

results=[]

for news in test_data:
    retrieved_nodes = retriever.retrieve(news["question"])
    result=False
    for node in retrieved_nodes:
        if node.metadata['url']==news["url"]:
            result=True
            break
    results.append(result)

results.count(True) / len(results)

CPU times: user 2.3 s, sys: 42.8 ms, total: 2.34 s
Wall time: 20.6 s


0.6390243902439025

### k=10

In [39]:
%%time

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=10,
)

results=[]

for news in test_data:
    retrieved_nodes = retriever.retrieve(news["question"])
    result=False
    for node in retrieved_nodes:
        if node.metadata['url']==news["url"]:
            result=True
            break
    results.append(result)

results.count(True) / len(results)

CPU times: user 2.59 s, sys: 16.7 ms, total: 2.61 s
Wall time: 20.8 s


0.697560975609756

### k=100

In [41]:
%%time

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=len(test_data)/2,
)

results=[]

for news in test_data:
    retrieved_nodes = retriever.retrieve(news["question"])
    result=False
    for node in retrieved_nodes:
        if node.metadata['url']==news["url"]:
            result=True
            break
    results.append(result)

results.count(True) / len(results)

CPU times: user 9.83 s, sys: 32.8 ms, total: 9.86 s
Wall time: 22.9 s


0.8975609756097561