# Document/Nodes

- Document: 
   
   A Document is a generic container around any data source - for instance, a PDF, an API output, or retrieved data from a database. They can be constructed manually, or created automatically via our data loaders

- Node:
   
   A Node represents a "chunk" of a source Document, whether that is a text chunk, an image, or other.

## Document

### By built-in example

In [1]:
from llama_index.core import Document, VectorStoreIndex
from IPython.display import display, Markdown

In [2]:
example = Document.example()
example

Document(id_='9163241c-dea3-4312-80c7-7f73e1f3e2ac', embedding=None, metadata={'filename': 'README.md', 'category': 'codebase'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='\nContext\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.

In [3]:
content = example.get_content()
content

'Context\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.\nProvides an advanced retrieval/query interface over your data:\nFeed in any LLM input prompt, get back retrieved context and knowledge-augmented output.\nAllows easy integrations with your outer application framework\n(e.g. with LangChain, Flask, Docker, ChatGPT, anything else).\nLlamaIndex provides tools for both beginner users an

In [4]:
display(Markdown(content))

Context
LLMs are a phenomenal piece of technology for knowledge generation and reasoning.
They are pre-trained on large amounts of publicly available data.
How do we best augment LLMs with our own private data?
We need a comprehensive toolkit to help perform this data augmentation for LLMs.

Proposed Solution
That's where LlamaIndex comes in. LlamaIndex is a "data framework" to help
you build LLM  apps. It provides the following tools:

Offers data connectors to ingest your existing data sources and data formats
(APIs, PDFs, docs, SQL, etc.)
Provides ways to structure your data (indices, graphs) so that this data can be
easily used with LLMs.
Provides an advanced retrieval/query interface over your data:
Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
Allows easy integrations with your outer application framework
(e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
LlamaIndex provides tools for both beginner users and advanced users.
Our high-level API allows beginner users to use LlamaIndex to ingest and
query their data in 5 lines of code. Our lower-level APIs allow advanced users to
customize and extend any module (data connectors, indices, retrievers, query engines,
reranking modules), to fit their needs.

### Manaully create document

In [5]:
documents = [
    Document(text="庭院深深深幾許", metadata={"file_name": "mathbook.txt"}),
    Document(text="哈囉你好嗎", metadata={"file_name": "song.md"})
]

In [6]:
documents[0]

Document(id_='98e0f3e6-30ac-412e-bb28-68d535b0a7c6', embedding=None, metadata={'file_name': 'mathbook.txt'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='庭院深深深幾許', mimetype=None, path=None, url=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}')

In [7]:
documents[0].get_content()  # Default `get_content` function does not include metadata

'庭院深深深幾許'

### How to use metadata?

In [8]:
# Get the content with metadata
from llama_index.core.schema import MetadataMode
documents[0].get_content(metadata_mode=MetadataMode.ALL)

'file_name: mathbook.txt\n\n庭院深深深幾許'

In [9]:
# set excludes metadata keys for both embed and llm
documents[0].excluded_embed_metadata_keys=["file_name"]
documents[0].excluded_llm_metadata_keys=["file_name"]

# Test the content during embed and llm
print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))
print(documents[0].get_content(metadata_mode=MetadataMode.LLM))

庭院深深深幾許
庭院深深深幾許


## Node

### Node Parser (JSON data)

In [10]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="../data/json/").load_data()

In [11]:
len(documents)

1

In [12]:
from llama_index.core.node_parser import JSONNodeParser
json_parser = JSONNodeParser()
nodes = json_parser.get_nodes_from_documents(documents)

In [13]:
len(nodes)

10

In [14]:
nodes[0].get_content()

'title 【台灣陷入尷尬處境】半導體耗電量狂飆又得淨零轉型，核電會是最佳解？\ncontent 台灣企業滿足了全球高達 68% 的晶片製造供應，隨之而來的是大量能源需求——根據綠色和平組織的預測，2030 年台灣半導體製造產業之耗電量，將會相當於 2021 年紐西蘭全島的耗電量的兩倍，其中將有 82% 的需求來自台積電。\n台灣為何陷入能源危機、與淨零目標遙遙無期？\nAI 時代用電激增，台灣卻在此時面臨牽涉到國安、氣候、政治挑戰的能源困境：台灣全島有高達 9 成的用電量靠的是進口石化燃料；兩岸情勢也持續緊繃，台灣必須面對來自中國的經濟封鎖、國際孤立，甚或武力入侵威脅；而出於政治考量，執政黨主張在 2025 年前告別核電，打造「非核家園」，並在同一年達成「燃煤發電 30%、天然氣發電 50%、再生能源 20%」的能源結構目標。\n再者，政府還訂立野心勃勃的潔淨能源目標，包含遵循巴黎協議，2030 年台灣將減碳 23%，並在 2050 年前達成淨零碳排；私部門方面，包含台積電在內的多家大廠，都簽署了 RE100 全球再生能源倡議，承諾於 2050 年前採用 100％ 綠電。\n目前看來，沒有任何一絲達標跡象，現實和理想之間尚存在著難以跨越的鴻溝。\n台灣現行能源結構十分脆弱，極度仰賴進口\n根據經濟部統計，去年台灣有 83% 的用電需求仰賴石化燃料，其中煤炭發電佔 42%、天然氣佔 40%、石油佔 1%，另外核能佔 6%，而太陽能、風力、水力、生質能發電加總起來也才佔 10%。\n這樣的能源供應系統極不穩定，因為燃料進口隨時面臨到國際價格波動或中國封鎖的風險；即便台灣政府能出手調整電價，卻也造成台電債台高築，而一旦中國海軍封鎖台灣海峽，台灣島約只有 6 週的煤炭儲備量，以及約 1 週的液化天然氣儲備量。\n縱使上述風險都尚未發生，今年台灣發電的備轉容量已多次掉到 5%（理想的系統備轉容量為 25%），且在過去 8 年內，就發生過 4 次大規模跳電，限電情況也不少見，在在顯示出台灣供電困境亟待解決。\n再生能源為何進展緩慢？專家警告外資出走可能性\n由於台灣國土面積小且山地多，太陽能發電面臨土地取得不易的限制。離岸風電雖大有潛力，但近年受到政府欲扶植在地產業發展的政策限制，只能運用 MIT 產品和聘用台灣勞工，礙於技術落差，造成不必要的成本支出與工程延宕；另一方面，

### Node Parser (HTML Data)

In [15]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="../data/html/").load_data()

In [16]:
%pip install -q bs4


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [17]:
from llama_index.core.node_parser import HTMLNodeParser
html_parser = HTMLNodeParser()
# DEFAULT_TAGS = ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "b", "i", "u", "section"]
nodes = html_parser.get_nodes_from_documents(documents)
len(nodes)

3

In [18]:
for each in nodes:
    print(each.get_content())

Welcome to My Webpage
This is a simple HTML example with some basic styling.
Visit
Example.com
to
      learn more.
HTML
CSS
JavaScript


### Node Parser (Markdown data)

In [19]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="../data/markdown/").load_data()

In [20]:
from llama_index.core.node_parser import MarkdownNodeParser
md_parser = MarkdownNodeParser()
nodes = md_parser.get_nodes_from_documents(documents)
print(len(nodes))

1


In [21]:
nodes[0].get_content()

'🐍 Python 簡單語法筆記\n\n1. 變數與資料型別\n\n   Python 是一個動態型別語言，無需明確聲明變數型別。\n\n```python\n# 整數\nx = 10\n\n# 浮點數\ny = 3.14\n\n# 字串\nname = "Alice"\n\n# 布林值\nis_active = True\n```\n\n2. 基本運算\n\n   Python 支援常見的數學運算，並使用簡單的符號來表示。\n\n```python\n# 加法\nresult = 5 + 3 # 8\n\n# 減法\nresult = 10 - 7 # 3\n\n# 乘法\nresult = 4 \\* 2 # 8\n\n# 除法\nresult = 10 / 2 # 5.0 (浮點數結果)\n\n# 次方\nresult = 2 \\*\\* 3 # 8\n```\n\n3. 條件判斷\n\n   使用 if、elif 和 else 來進行條件判斷。\n\n```python\nage = 18\n\nif age >= 18:\nprint("成人")\nelif age > 12:\nprint("青少年")\nelse:\nprint("兒童")\n```\n\n4. 迴圈\n\n   Python 有兩種主要的迴圈：for 和 while。\n\n- for 迴圈\n\n  ```python\n  # 列印 1 到 5\n  for i in range(1, 6):\n  print(i)\n  ```\n\n- while 迴圈\n\n  ```python\n  # 列印直到條件為假\n  count = 0\n  while count < 5:\n  print(count)\n  count += 1\n  ```\n\n5. 函式\n\n使用 def 關鍵字來定義函式\n\n```python\ndef greet(name):\nreturn f"Hello, {name}!"\n\n# 呼叫函式\nmessage = greet("Alice")\nprint(message) # Hello, Alice!\n```\n\n6. 清單與迭代\n\n   清單是 Python 中的基本資料結構，可以存放多個元素。\n\n```pyth

### Node Parser (SimpleFileNodeParser-No matter file type)

In [22]:
documents = [
    *SimpleDirectoryReader(input_dir="../data/json/").load_data(),
    *SimpleDirectoryReader(input_dir="../data/html/").load_data(),
    *SimpleDirectoryReader(input_dir="../data/markdown/").load_data()
]

In [23]:
len(documents)

3

In [24]:
from llama_index.core.node_parser import SimpleFileNodeParser
parser = SimpleFileNodeParser()
nodes = parser.get_nodes_from_documents(documents)

In [25]:
len(nodes)

3

### Splitter (SenstenceSplitter)

In [26]:
documents = [
    Document(text="在今年備受矚目的 NBA 總冠軍賽中，洛杉磯湖人隊以驚險的表現擊敗了波士頓塞爾提克隊，\
        成功捧起隊史第 18 座總冠軍獎杯，刷新聯盟歷史記錄。系列賽中，LeBron James 和 Anthony Davis \
        持續展現頂尖表現，尤其在關鍵的第七場比賽中，LeBron 在最後三秒投中致勝球，成為球隊奪冠的最大功臣。\
        儘管塞爾提克隊的年輕核心 Jayson Tatum 和 Jaylen Brown 表現出色，但最終無法逆轉湖人隊的\
        全面優勢。這場史詩般的對決不僅展現了雙方球員的技術與韌性，也為球迷留下了無數經典瞬間，\
        成為 NBA 歷史上的又一里程碑")
]

In [27]:
from llama_index.core.node_parser import SentenceSplitter
# Default chunk_size is 1024, chunk_overlap is 200
# Be aware that the chunk_overlap cannot larger than chunk_size
splitter = SentenceSplitter(chunk_size=100, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents(documents)

In [28]:
len(nodes)

5

In [29]:
nodes[0].get_content()

'在今年備受矚目的 NBA 總冠軍賽中，洛杉磯湖人隊以驚險的表現擊敗了波士頓塞爾提克隊，        成功捧起隊史第 18'

### Manually create node

In [30]:
from llama_index.core.schema import TextNode
nodes = [
    TextNode(text="1th chunk"),
    TextNode(text="2nd chunk", metadata={"kind": "special chunk"}),
    TextNode(text="3rd chunk", id=3),
]
nodes

[TextNode(id_='560ba749-c35f-4736-92f2-b74f46d75b31', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='1th chunk', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'),
 TextNode(id_='f1209f8c-9d51-4200-a9c6-4ee359bfc1ed', embedding=None, metadata={'kind': 'special chunk'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='2nd chunk', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'),
 TextNode(id_='13873cec-1735-4973-bc27-31ab1b28708b', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='