# 加载数据

## 文档和节点

### 文档 - Document

#### 创建示例文档 - Document.example()

In [3]:
%%time

from llama_index.core import Document

document = Document.example()
document

CPU times: user 103 µs, sys: 0 ns, total: 103 µs
Wall time: 95.6 µs


Document(id_='a2e90ccb-ef46-4697-ab0a-a5c552739253', embedding=None, metadata={'filename': 'README.md', 'category': 'codebase'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='\nContext\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.\nProvides an advanced retrieval/query interface over your data:\nFeed in any LLM input prompt, get back retrieved co

In [4]:
document.text

'\nContext\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.\nProvides an advanced retrieval/query interface over your data:\nFeed in any LLM input prompt, get back retrieved context and knowledge-augmented output.\nAllows easy integrations with your outer application framework\n(e.g. with LangChain, Flask, Docker, ChatGPT, anything else).\nLlamaIndex provides tools for both beginner users 

In [5]:
document.metadata

{'filename': 'README.md', 'category': 'codebase'}

#### 通过目录加载文件 - SimpleDirectoryReader 

In [8]:
%%time

!ls ./data -hl

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
len(documents)

total 1.3M
-rw-r--r-- 1 root root  11K Jun 24 09:45 孔乙己.txt
-rw-r--r-- 1 root root 1.3M Jun 24 09:45 量化CTA风格因子跟踪-本周市场高位震荡为主，风格倾向低贝.pdf
CPU times: user 150 ms, sys: 7.85 ms, total: 158 ms
Wall time: 264 ms


15

In [9]:
documents[0].metadata

{'file_path': '/root/notebook/my-jupyter-notebook/llm/llama_index/basic/data/孔乙己.txt',
 'file_name': '孔乙己.txt',
 'file_type': 'text/plain',
 'file_size': 10245,
 'creation_date': '2024-06-24',
 'last_modified_date': '2024-06-24'}

In [10]:
documents[1].metadata

{'page_label': '1',
 'file_name': '量化CTA风格因子跟踪-本周市场高位震荡为主，风格倾向低贝.pdf',
 'file_path': '/root/notebook/my-jupyter-notebook/llm/llama_index/basic/data/量化CTA风格因子跟踪-本周市场高位震荡为主，风格倾向低贝.pdf',
 'file_type': 'application/pdf',
 'file_size': 1302105,
 'creation_date': '2024-06-24',
 'last_modified_date': '2024-06-24'}

In [11]:
documents[2].metadata

{'page_label': '2',
 'file_name': '量化CTA风格因子跟踪-本周市场高位震荡为主，风格倾向低贝.pdf',
 'file_path': '/root/notebook/my-jupyter-notebook/llm/llama_index/basic/data/量化CTA风格因子跟踪-本周市场高位震荡为主，风格倾向低贝.pdf',
 'file_type': 'application/pdf',
 'file_size': 1302105,
 'creation_date': '2024-06-24',
 'last_modified_date': '2024-06-24'}

#### 加载指定文件 - SimpleDirectoryReader

In [32]:
%%time

documents = SimpleDirectoryReader(input_files=['./data/孔乙己.txt']).load_data()
len(documents)

CPU times: user 1.26 ms, sys: 149 µs, total: 1.4 ms
Wall time: 1.3 ms


1

In [35]:
documents[0].text[:128]

'孔乙己⑴\n\n\n\n\u3000\u3000鲁镇的酒店的格局，是和别处不同的：都是当街一个曲尺形的大柜台，柜里面预备着热水，可以随时温酒。做工的人，傍午傍晚散了工，每每花四文铜钱，买一碗酒，——这是二十多年前的事，现在每碗要涨到十文，——靠柜外站着，热热的喝了休息；倘肯多花一文，'

#### 手动创建文档对象

In [28]:
%%time

content="""
天然石墨烯指的是从天然石墨中提取或加工出来的石墨烯。石墨烯是一种由单层碳原子组成的二维材料，具有独特的物理和化学性质，如极高的导电性、热导率、机械强度和透光性。
"""

document = Document(
    text=content,
    metadata={"filename": "./content.txt", "category": "材料学"},
)

document.metadata

CPU times: user 101 µs, sys: 0 ns, total: 101 µs
Wall time: 93.9 µs


{'filename': './content.txt', 'category': '材料学'}

#### 设置 doc_id

In [13]:
%%time

document.doc_id = "天然石墨烯"

CPU times: user 51 µs, sys: 0 ns, total: 51 µs
Wall time: 52.7 µs


#### 自定义 LLM 元数据 - excluded_llm_metadata_keys

In [30]:
from llama_index.core.schema import MetadataMode

print(document.get_content(metadata_mode=MetadataMode.LLM))

filename: ./content.txt
category: 材料学


天然石墨烯指的是从天然石墨中提取或加工出来的石墨烯。石墨烯是一种由单层碳原子组成的二维材料，具有独特的物理和化学性质，如极高的导电性、热导率、机械强度和透光性。


In [31]:
document.excluded_llm_metadata_keys = ["filename"]
print(document.get_content(metadata_mode=MetadataMode.LLM))

category: 材料学


天然石墨烯指的是从天然石墨中提取或加工出来的石墨烯。石墨烯是一种由单层碳原子组成的二维材料，具有独特的物理和化学性质，如极高的导电性、热导率、机械强度和透光性。


#### 自定义 Embedding 元数据

In [29]:
print(document.get_content(metadata_mode=MetadataMode.EMBED))

filename: ./content.txt
category: 材料学


天然石墨烯指的是从天然石墨中提取或加工出来的石墨烯。石墨烯是一种由单层碳原子组成的二维材料，具有独特的物理和化学性质，如极高的导电性、热导率、机械强度和透光性。


In [27]:
document.excluded_embed_metadata_keys = ["filename"]
print(document.get_content(metadata_mode=MetadataMode.EMBED))

category: 材料学


天然石墨烯指的是从天然石墨中提取或加工出来的石墨烯。石墨烯是一种由单层碳原子组成的二维材料，具有独特的物理和化学性质，如极高的导电性、热导率、机械强度和透光性。


### 节点 - Node

#### 从文档解析出节点

In [69]:
%%time
from llama_index.core.node_parser import SentenceSplitter

documents = SimpleDirectoryReader(input_files=['./data/孔乙己.txt']).load_data()
parser = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=20
)
nodes = parser.get_nodes_from_documents(documents)
len(nodes)

CPU times: user 8.87 ms, sys: 92 µs, total: 8.97 ms
Wall time: 8.57 ms


21

In [70]:
nodes[0].text

'孔乙己⑴\n\n\n\n\u3000\u3000鲁镇的酒店的格局，是和别处不同的：都是当街一个曲尺形的大柜台，柜里面预备着热水，可以随时温酒。'

In [71]:
nodes[1].text

'做工的人，傍午傍晚散了工，每每花四文铜钱，买一碗酒，——这是二十多年前的事，现在每碗要涨到十文，——靠柜外站着，热热的喝了休息；倘肯多花一文，便可以买一碟盐煮笋，或者茴香豆，做下酒物了，如果出到十几文，那就能买一样荤菜，但这些顾客，多是短衣帮，大抵没有这样阔绰。只有穿长衫的，才踱进店面隔壁的房子里，要酒要菜，慢慢地坐喝。'

#### 手动创建节点

In [72]:
%%time

from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo

node1 = TextNode(text="鲁镇的酒店的格局，是和别处不同的：都是当街一个曲尺形的大柜台，柜里面预备着热水，可以随时温酒。", id_="node_1")
node2 = TextNode(text="做工的人，傍午傍晚散了工，每每花四文铜钱，买一碗酒，——这是二十多年前的事，现在每碗要涨到十文，", id_="node_2")
# set relationships
node1.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(
    node_id=node2.node_id
)
node2.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(
    node_id=node1.node_id
)
nodes = [node1, node2]

node1

CPU times: user 112 µs, sys: 0 ns, total: 112 µs
Wall time: 114 µs


TextNode(id_='node_1', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='node_2', node_type=None, metadata={}, hash=None)}, text='鲁镇的酒店的格局，是和别处不同的：都是当街一个曲尺形的大柜台，柜里面预备着热水，可以随时温酒。', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

#### 元数据提取

##### 每节点摘要提取 - SummaryExtractor

In [75]:
%%time

import nest_asyncio
nest_asyncio.apply()

# 加载llm和embeddings
%run ../utils2.py

from llama_index.core import Settings

# Settings.llm=get_llm("gpt-3.5-turbo")
Settings.llm=get_llm()

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
    SummaryExtractor,
)

documents = SimpleDirectoryReader(input_files=['./data/孔乙己.txt']).load_data()

sentence_splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=20
)
summary_extractor=SummaryExtractor()


pipeline = IngestionPipeline(
    transformations=[sentence_splitter,summary_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)



Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

100% 21/21 [00:46<00:00,  2.23s/it]

CPU times: user 221 ms, sys: 12.4 ms, total: 233 ms
Wall time: 46.8 s





In [83]:
nodes[10].metadata['section_summary']

'这段内容主要描述了以下几个关键点和实体：\n\n1. **文件路径**：数据目录下的"孔乙己.txt"文件。\n2. **文件名**："孔乙己.txt"。\n3. **文件类型**：纯文本（text/plain）。\n4. **文件大小**：10,245字节。\n5. **创建日期**：2024年6月24日。\n6. **最后修改日期**：与创建日期相同，为2024年6月24日。\n\n内容主要围绕着对"孔乙己"这一角色的描述展开。其中提到了：\n\n- 孔乙己表现出高兴的样子，并提及回字有四种不同的写法（可能是指汉字“回”字的多种书写形式）。\n- 描述了孔乙己与旁观者或参与者的互动，包括用酒和豆子作为奖励来吸引孩子的注意。\n- 孩子们在看到孔乙己时会围过来，并从他那里获得小礼物（可能是糖或其他糖果），这表明孔乙己可能以某种方式向孩子们分发甜食。\n\n整体上，这段内容通过描述孔乙己的行为和与周围环境的互动，展现了一个特定场景下的角色特征和社会关系。'

##### 多个节点的 TitleExtractor

如果是单一文档，标题会一直不变，即使设置为 `node=5`

In [105]:
%%time

from llama_index.core.extractors import (
    TitleExtractor
)

documents = SimpleDirectoryReader(input_files=['./data/news.txt', './data/故乡.txt']).load_data()

title_extractor = TitleExtractor(nodes=5)

pipeline = IngestionPipeline(
    transformations=[sentence_splitter, title_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/2 [00:00<?, ?it/s]

100% 5/5 [00:01<00:00,  3.47it/s]
100% 5/5 [01:01<00:00, 12.34s/it]


CPU times: user 172 ms, sys: 7.45 ms, total: 180 ms
Wall time: 1min 12s


In [108]:
nodes[0].metadata['document_title']

'"全球战略格局下的地缘政治博弈：南海问题、俄罗斯策略、乌克兰军事合作与美国外交应对"\n\nThis comprehensive title encapsulates the various themes discussed in your list:\n\n1. **南海问题与俄罗斯策略的共鸣** - This part focuses on how Russia might align its interests with those of Southeast Asian nations concerning territorial disputes in the South China Sea, possibly through diplomatic or strategic cooperation.\n\n2. **俄罗斯舰队访问古巴：军事合作与战略考量解析** - Here, we delve into the implications of Russian naval presence in Cuba, examining both military alliances and geopolitical strategies involved.\n\n3. **西方国家解除对乌克兰使用其提供的武器系统的地域限制以应对俄罗斯的领土扩张** - This segment discusses Western countries\' decisions to allow Ukraine to use provided weaponry beyond their borders as a response to Russia\'s aggressive actions.\n\n4. **美乌军事合作升级：武器自由使用与跨境打击行动** - It explores the enhanced military cooperation between the US and Ukraine, particularly in terms of unrestricted weapon usage across international boundaries for defensive purposes.\n\n5. **拜登关于乌克兰武器使用指导的外交论述解析** - T

In [109]:
nodes[44].metadata['document_title']

'"时光的印记：离别、变迁与重逢——从老屋到月下少年的记忆之旅"\n\n这个综合标题涵盖了所有提供的内容点：\n\n1. "时光的印记"：强调了故事中时间流逝的概念，以及随着时间推移而发生的变化。\n2. "离别、变迁与重逢"：概括了文本中的三个主要主题，即离开故乡、环境变化和再次相遇或团聚的情感体验。\n3. "从老屋到月下少年的记忆之旅"：具体描述了故事的起点（老屋）和终点（月下少年），以及贯穿其中的是对记忆的探索和情感的连结。\n\n这个标题既包含了对个人经历的深入探讨，也体现了对家庭、故乡和时间流逝等普遍主题的关注。'

##### 文档问答提取 - QuestionsAnsweredExtractor

In [129]:
%%time

from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
)

sentence_splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=20
)

qa_extractor=QuestionsAnsweredExtractor(questions=3)

pipeline = IngestionPipeline(
    transformations=[sentence_splitter, qa_extractor]
)

documents = SimpleDirectoryReader(input_files=['./data/孔乙己.txt']).load_data()
excluded=["file_name",
          "file_path",
           "file_type","file_size",
           "creation_date",
           "last_modified_date"]

for document in documents:
    document.excluded_llm_metadata_keys =excluded
    document.excluded_embed_metadata_keys = excluded

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

len(nodes)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

100% 19/19 [00:25<00:00,  1.32s/it]

CPU times: user 198 ms, sys: 8 ms, total: 206 ms
Wall time: 25 s





19

In [128]:
nodes[0].metadata['questions_this_excerpt_can_answer']

'1. **酒店的格局有何独特之处？** - 这段文本描述了鲁镇酒店的布局，特别指出都是当街设置一个曲尺形的大柜台，并且柜内备有热水供顾客温酒。这表明了当地酒店在空间设计上的特点和功能。\n\n2. **谁是酒店的主要顾客群体？** - 文本中提到的“短衣帮”指的是那些穿着非正式或廉价衣物的人，暗示他们可能是酒店的主要顾客群。这提供了关于鲁镇社会阶层划分的信息，以及人们根据着装选择消费场所的习惯。\n\n3. **酒的价格和购买力如何变化？** - 通过比较20多年前每碗酒只需四文铜钱与现在的十文，文本揭示了随着时间的推移，货币价值的变化和通货膨胀的情况。这不仅反映了经济变迁，还可能暗示了社会整体生活水平或消费习惯的演变。\n\n这些问题是基于对文本内容的深入理解而提出的，它们提供了关于鲁镇酒店布局、社会阶层以及经济状况的具体信息，这些信息在其他地方可能难以找到。'

##### 提取节点的关键字 - KeywordExtractor

In [150]:
%%time


from llama_index.core.extractors import (
    KeywordExtractor,
)

documents = SimpleDirectoryReader(input_files=['./data/孔乙己.txt']).load_data()

sentence_splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=20
)

keyword_extractor=KeywordExtractor(keywords=3)

pipeline = IngestionPipeline(
    transformations=[sentence_splitter, keyword_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

len(nodes)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

100% 21/21 [00:05<00:00,  3.84it/s]

CPU times: user 213 ms, sys: 4.76 ms, total: 218 ms
Wall time: 5.49 s





21

In [151]:
nodes[12].metadata['excerpt_keywords']

'喝酒的人, 打折了腿, 丁举人'

##### 提取实体信息 - EntityExtractor

没有提取到任何信息。

In [158]:
%%time

from llama_index.extractors.entity import EntityExtractor
from llama_index.core.node_parser import SentenceSplitter

documents = SimpleDirectoryReader(input_files=['./data/news.txt']).load_data()

sentence_splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=20
)

entity_extractor = EntityExtractor(
    prediction_threshold=0.5,
    label_entities=True,  # include the entity label in the metadata (can be erroneous)
    device="cuda",  # set to "cuda" if you have a GPU
)

pipeline = IngestionPipeline(
    transformations=[sentence_splitter, entity_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

len(nodes)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting entities:   0%|          | 0/22 [00:00<?, ?it/s]

CPU times: user 2.24 s, sys: 269 ms, total: 2.51 s
Wall time: 32.1 s


22

In [159]:
nodes[0]

TextNode(id_='2f477e8c-423c-4d14-a85e-58d724b33f1d', embedding=None, metadata={'file_path': 'data/news.txt', 'file_name': 'news.txt', 'file_type': 'text/plain', 'file_size': 11203, 'creation_date': '2024-06-24', 'last_modified_date': '2024-06-24'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='a55e6190-f1ba-460a-b15b-e0d029e52e1e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'data/news.txt', 'file_name': 'news.txt', 'file_type': 'text/plain', 'file_size': 11203, 'creation_date': '2024-06-24', 'last_modified_date': '2024-06-24'}, hash='eb8d1c10826e39c832b5b063f74b9bd78eb2cfa313cbddcd0b2be6e1eb62391e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='3185b0b9-89bb-480d-a461-3d7703132

##### 提取扩充元数据 - MarvinMetadataExtractor

见： [Metadata Extraction and Augmentation w/ Marvin](https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/MarvinMetadataExtractorDemo/)

因为依赖 openAI，没有测试成功，暂时不用：

```marvin.settings.openai.api_key = os.environ["OPENAI_API_KEY"]```

##### 提取 Pydantic 对象 - PydanticProgramExtractor

可以将文档借助llm提取为pydantic对象。

参考：

- [PydanticProgramExtractor](https://docs.llamaindex.ai/en/stable/api_reference/extractors/pydantic/)
- [Examples - Pydantic Extractor](https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/PydanticExtractor/)

## 从本地加载文件 - SimpleDirectoryReader

### 支持文件类型

见 [Supported file types](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/#supported-file-types)

### 使用

### 通过指定目录加载

只加载当前目录下文件，不包括子目录。

In [None]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="path/to/directory")
documents = reader.load_data()
# documents = reader.load_data(num_workers=4)

### 递归子目录加载

In [None]:
SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)

### 在文件加载时进行迭代

In [None]:
reader = SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)
all_docs = []
for docs in reader.iter_data():
    # <do something with the documents per file>
    all_docs.extend(docs)

### 指定/限定加载的文件

In [None]:
# 指定文件
SimpleDirectoryReader(input_files=["path/to/file1", "path/to/file2"])

# 指定目录，排除指定的文件
SimpleDirectoryReader(
    input_dir="path/to/directory", exclude=["path/to/file1", "path/to/file2"]
)

# 指定要求文件扩展名
SimpleDirectoryReader(
    input_dir="path/to/directory", required_exts=[".pdf", ".docx"]
)

# 指定加载最大文件数
SimpleDirectoryReader(input_dir="path/to/directory", num_files_limit=100)

### 指定文件编码

In [None]:
SimpleDirectoryReader(input_dir="path/to/directory", encoding="latin-1")

### 提取元数据函数

In [None]:
def get_meta(file_path):
    return {"foo": "bar", "file_path": file_path}


SimpleDirectoryReader(input_dir="path/to/directory", file_metadata=get_meta)

### 扩展到其他文件类型

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.readers.base import BaseReader
from llama_index.core import Document


class MyFileReader(BaseReader):
    def load_data(self, file, extra_info=None):
        with open(file, "r") as f:
            text = f.read()
        # load_data returns a list of Document objects
        return [Document(text=text + "Foobar", extra_info=extra_info or {})]


reader = SimpleDirectoryReader(
    input_dir="./data", file_extractor={".myfile": MyFileReader()}
)

documents = reader.load_data()
print(documents)

### 支持外部文件系统

采用一个可选fs参数，可用于遍历远程文件系统。

In [None]:
from s3fs import S3FileSystem

s3_fs = S3FileSystem(key="...", secret="...")
bucket_name = "my-document-bucket"

reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,  # recursively searches all subdirectories
)

documents = reader.load_data()
print(documents)

## 数据连接器

- [LlamaHub](https://llamahub.ai/?tab=readers)
- 大部分 reader 集成在 SimpleDirectoryReader
- 很多第三方 Reader

### LlamaParse

In [None]:
%%time

# llama-parse is async-first, running the sync code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse

import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-zKRvJ3wqj8bkXBHVjJ8DMNPlqfxQb8yRDrFGFPHpCwNxZbX6"

documents = LlamaParse(result_type="markdown").load_data("./伟大的中国工业革命.pdf")

## 节点解析器

### 直接使用

In [165]:
%%time

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter(chunk_size=100, chunk_overlap=20)
content="""
天然石墨烯指的是从天然石墨中提取或加工出来的石墨烯。石墨烯是一种由单层碳原子组成的二维材料，具有独特的物理和化学性质，如极高的导电性、热导率、机械强度和透光性。
"""

nodes = node_parser.get_nodes_from_documents(
    [Document(text=content)], show_progress=False
)

len(nodes)

CPU times: user 0 ns, sys: 1.78 ms, total: 1.78 ms
Wall time: 1.41 ms


2

In [167]:
nodes[1]

TextNode(id_='50d26deb-5264-45b4-8550-9f7d0faa4956', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='0d0e49d5-7f88-4f61-86a6-332e743307ea', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='02f2425037e1c58fdd9d9bf86bffa7b22a55f394a5868fd240df597f8ac5edd9'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='2c50f607-b7fe-43ca-bb7a-217913b6da2d', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='9edfce7b6c39880cb7b603111e683c027a33d953c3a06f5e8aed56ec821bb92b')}, text='石墨烯是一种由单层碳原子组成的二维材料，具有独特的物理和化学性质，如极高的导电性、热导率、机械强度和透光性。', start_char_idx=27, end_char_idx=81, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

### 通过管道使用

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter

documents = SimpleDirectoryReader("./data").load_data()

pipeline = IngestionPipeline(transformations=[TokenTextSplitter(), ...])

nodes = pipeline.run(documents=documents)

### 通过索引使用

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

documents = SimpleDirectoryReader("./data").load_data()

# global
from llama_index.core import Settings

Settings.text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

# per-index
index = VectorStoreIndex.from_documents(
    documents,
    transformations=[SentenceSplitter(chunk_size=1024, chunk_overlap=20)],
)

### 节点解析器模块

#### 基于文件的解析器

##### 简单文件节点解析器

In [174]:
%%time


from llama_index.core.node_parser import SimpleFileNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path

md_docs = FlatReader().load_data(Path("./data/test.md"))

parser = SimpleFileNodeParser()
md_nodes = parser.get_nodes_from_documents(md_docs)

md_nodes[0].metadata

CPU times: user 527 µs, sys: 58 µs, total: 585 µs
Wall time: 577 µs


{'Header_1': 'test', 'filename': 'test.md', 'extension': '.md'}

##### Markdown节点解析器

In [177]:
from llama_index.core.node_parser import MarkdownNodeParser

parser = MarkdownNodeParser()

nodes = parser.get_nodes_from_documents(md_docs)
nodes[0]

TextNode(id_='7afb2b21-e9c1-4886-adef-6aae88c493ab', embedding=None, metadata={'Header_1': 'test', 'filename': 'test.md', 'extension': '.md'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='2c5015c7-5a8e-465c-9e64-50912382a953', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'test.md', 'extension': '.md'}, hash='9b13d8a4db6eb677f49e273c211d2b0392c6fa64161ec2e5e495b43bd6bce358')}, text='test\n\n这是一个测试：\n\n- 测试1\n- 测试2', start_char_idx=2, end_char_idx=28, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

##### JSONNode解析器

TODO: 没有正确解析

In [181]:
from llama_index.core.node_parser import JSONNodeParser

md_docs = FlatReader().load_data(Path("./data/data.json"))

parser = JSONNodeParser()
nodes = parser.get_nodes_from_documents(md_docs)

nodes
# md_docs

[Document(id_='c73b9565-a9ae-48a1-bf53-f72b8e6195b7', embedding=None, metadata={'filename': 'data.json', 'extension': '.json'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='{\n  "name": "张三"\n  "data": ["data1", "data2"]\n}\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]

##### HTML节点解析器

In [None]:
from llama_index.core.node_parser import HTMLNodeParser

parser = HTMLNodeParser(tags=["p", "h1"])  # optional list of tags
nodes = parser.get_nodes_from_documents(html_docs)

#### 文本分割器

##### 代码分割器

In [None]:
from llama_index.core.node_parser import CodeSplitter

splitter = CodeSplitter(
    language="python",
    chunk_lines=40,  # lines per chunk
    chunk_lines_overlap=15,  # lines overlap between chunks
    max_chars=1500,  # max chars per chunk
)
nodes = splitter.get_nodes_from_documents(documents)


##### LangchainNode解析器

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.core.node_parser import LangchainNodeParser

parser = LangchainNodeParser(RecursiveCharacterTextSplitter())
nodes = parser.get_nodes_from_documents(documents)

##### SentenceSplitter

In [None]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

##### 句子窗口节点解析器

In [None]:
import nltk
from llama_index.core.node_parser import SentenceWindowNodeParser

node_parser = SentenceWindowNodeParser.from_defaults(
    # how many sentences on either side to capture
    window_size=3,
    # the metadata key that holds the window of surrounding sentences
    window_metadata_key="window",
    # the metadata key that holds the original sentence
    original_text_metadata_key="original_sentence",
)

##### 语义分割节点解析器

In [None]:
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

##### TokenTextSplitter

In [None]:
from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=1024,
    chunk_overlap=20,
    separator=" ",
)
nodes = splitter.get_nodes_from_documents(documents)

#### 基于关系的解析器

In [None]:
from llama_index.core.node_parser import HierarchicalNodeParser

node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

### 摄取管道

#### 使用模式

pipeline直接生成nodes

In [None]:
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ]
)

# run the pipeline
nodes = pipeline.run(documents=[Document.example()])

#### 连接到向量数据库

pipeline设置了向量库，执行pipeline将自动存储到向量库。后续可以基于向量库生成索引。

In [None]:
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline
from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

# Ingest directly into a vector db
pipeline.run(documents=[Document.example()])

# Create your index
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store)

#### 计算管道中的嵌入

In [None]:
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),     # 嵌入是管道的一个阶段
    ],
    vector_store=vector_store, # 设置向量存储必须在管道设置嵌入阶段
)

# Ingest directly into a vector db
pipeline.run(documents=[Document.example()])

#### 缓存

In [None]:
### 本地缓存

# save
pipeline.persist("./pipeline_storage")

# load and restore state
new_pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
    ],
)
new_pipeline.load("./pipeline_storage")

# will run instantly due to the cache
nodes = pipeline.run(documents=[Document.example()])

# delete all context of the cache
cache.clear()

我们支持多个缓存远程存储后端

- RedisCache
- MongoDBCache
- FirestoreCache

#### 文件管理

附加docstore到摄取管道将启用文档管理。

使用document.doc_id或node.ref_doc_id作为接地点，摄取管道将主动寻找重复的文档。

In [None]:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore

pipeline = IngestionPipeline(
    transformations=[...], docstore=SimpleDocumentStore()
)

#### 并行处理

该run方法IngestionPipeline可以用并行进程执行。它通过将multiprocessing.Pool节点批次分布到各个处理器来实现。

#### Transformations

是指将节点列表作为输入并返回节点列表的操作。

以下组件是Transformation对象：

- TextSplitter
- NodeParser
- MetadataExtractor
- Embeddings模型

Transformation一般和 pipeline 一起使用，也可以直接使用：

In [None]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor

node_parser = SentenceSplitter(chunk_size=512)
extractor = TitleExtractor()

# use transforms directly
nodes = node_parser(documents)

# or use a transformation in async
nodes = await extractor.acall(nodes)

与索引结合，from_documents()转换可以传递到索引或整体全局设置中。

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter

transformations = [
    TokenTextSplitter(chunk_size=512, chunk_overlap=128),
    TitleExtractor(nodes=5),
    QuestionsAnsweredExtractor(questions=3),
]

# global
from llama_index.core import Settings

Settings.transformations = [text_splitter, title_extractor, qa_extractor]

# per-index
index = VectorStoreIndex.from_documents(
    documents, transformations=transformations
)