# Document/Nodes

- Document: 
   
   A Document is a generic container around any data source - for instance, a PDF, an API output, or retrieved data from a database. They can be constructed manually, or created automatically via our data loaders

- Node:
   
   A Node represents a "chunk" of a source Document, whether that is a text chunk, an image, or other.

## Document

### By built-in example

In [1]:
from llama_index.core import Document, VectorStoreIndex
from IPython.display import display, Markdown

In [2]:
example = Document.example()
example

Document(id_='9163241c-dea3-4312-80c7-7f73e1f3e2ac', embedding=None, metadata={'filename': 'README.md', 'category': 'codebase'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='\nContext\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.

In [3]:
content = example.get_content()
content

'Context\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.\nProvides an advanced retrieval/query interface over your data:\nFeed in any LLM input prompt, get back retrieved context and knowledge-augmented output.\nAllows easy integrations with your outer application framework\n(e.g. with LangChain, Flask, Docker, ChatGPT, anything else).\nLlamaIndex provides tools for both beginner users an

In [4]:
display(Markdown(content))

Context
LLMs are a phenomenal piece of technology for knowledge generation and reasoning.
They are pre-trained on large amounts of publicly available data.
How do we best augment LLMs with our own private data?
We need a comprehensive toolkit to help perform this data augmentation for LLMs.

Proposed Solution
That's where LlamaIndex comes in. LlamaIndex is a "data framework" to help
you build LLM  apps. It provides the following tools:

Offers data connectors to ingest your existing data sources and data formats
(APIs, PDFs, docs, SQL, etc.)
Provides ways to structure your data (indices, graphs) so that this data can be
easily used with LLMs.
Provides an advanced retrieval/query interface over your data:
Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
Allows easy integrations with your outer application framework
(e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
LlamaIndex provides tools for both beginner users and advanced users.
Our high-level API allows beginner users to use LlamaIndex to ingest and
query their data in 5 lines of code. Our lower-level APIs allow advanced users to
customize and extend any module (data connectors, indices, retrievers, query engines,
reranking modules), to fit their needs.

### Manaully create document

In [5]:
documents = [
    Document(text="Â∫≠Èô¢Ê∑±Ê∑±Ê∑±ÂπæË®±", metadata={"file_name": "mathbook.txt"}),
    Document(text="ÂìàÂõâ‰Ω†Â•ΩÂóé", metadata={"file_name": "song.md"})
]

In [6]:
documents[0]

Document(id_='98e0f3e6-30ac-412e-bb28-68d535b0a7c6', embedding=None, metadata={'file_name': 'mathbook.txt'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Â∫≠Èô¢Ê∑±Ê∑±Ê∑±ÂπæË®±', mimetype=None, path=None, url=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}')

In [7]:
documents[0].get_content()  # Default `get_content` function does not include metadata

'Â∫≠Èô¢Ê∑±Ê∑±Ê∑±ÂπæË®±'

### How to use metadata?

In [8]:
# Get the content with metadata
from llama_index.core.schema import MetadataMode
documents[0].get_content(metadata_mode=MetadataMode.ALL)

'file_name: mathbook.txt\n\nÂ∫≠Èô¢Ê∑±Ê∑±Ê∑±ÂπæË®±'

In [9]:
# set excludes metadata keys for both embed and llm
documents[0].excluded_embed_metadata_keys=["file_name"]
documents[0].excluded_llm_metadata_keys=["file_name"]

# Test the content during embed and llm
print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))
print(documents[0].get_content(metadata_mode=MetadataMode.LLM))

Â∫≠Èô¢Ê∑±Ê∑±Ê∑±ÂπæË®±
Â∫≠Èô¢Ê∑±Ê∑±Ê∑±ÂπæË®±


## Node

### Node Parser (JSON data)

In [10]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="../data/json/").load_data()

In [11]:
len(documents)

1

In [12]:
from llama_index.core.node_parser import JSONNodeParser
json_parser = JSONNodeParser()
nodes = json_parser.get_nodes_from_documents(documents)

In [13]:
len(nodes)

10

In [14]:
nodes[0].get_content()

'title „ÄêÂè∞ÁÅ£Èô∑ÂÖ•Â∞∑Â∞¨ËôïÂ¢É„ÄëÂçäÂ∞éÈ´îËÄóÈõªÈáèÁãÇÈ£ÜÂèàÂæóÊ∑®Èõ∂ËΩâÂûãÔºåÊ†∏ÈõªÊúÉÊòØÊúÄ‰Ω≥Ëß£Ôºü\ncontent Âè∞ÁÅ£‰ºÅÊ•≠ÊªøË∂≥‰∫ÜÂÖ®ÁêÉÈ´òÈÅî 68% ÁöÑÊô∂ÁâáË£ΩÈÄ†‰æõÊáâÔºåÈö®‰πãËÄå‰æÜÁöÑÊòØÂ§ßÈáèËÉΩÊ∫êÈúÄÊ±Ç‚Äî‚ÄîÊ†πÊìöÁ∂†Ëâ≤ÂíåÂπ≥ÁµÑÁπîÁöÑÈ†êÊ∏¨Ôºå2030 Âπ¥Âè∞ÁÅ£ÂçäÂ∞éÈ´îË£ΩÈÄ†Áî¢Ê•≠‰πãËÄóÈõªÈáèÔºåÂ∞áÊúÉÁõ∏Áï∂Êñº 2021 Âπ¥Á¥êË•øËò≠ÂÖ®Â≥∂ÁöÑËÄóÈõªÈáèÁöÑÂÖ©ÂÄçÔºåÂÖ∂‰∏≠Â∞áÊúâ 82% ÁöÑÈúÄÊ±Ç‰æÜËá™Âè∞Á©çÈõª„ÄÇ\nÂè∞ÁÅ£ÁÇ∫‰ΩïÈô∑ÂÖ•ËÉΩÊ∫êÂç±Ê©ü„ÄÅËàáÊ∑®Èõ∂ÁõÆÊ®ôÈÅôÈÅôÁÑ°ÊúüÔºü\nAI ÊôÇ‰ª£Áî®ÈõªÊøÄÂ¢ûÔºåÂè∞ÁÅ£ÂçªÂú®Ê≠§ÊôÇÈù¢Ëá®ÁâΩÊ∂âÂà∞ÂúãÂÆâ„ÄÅÊ∞£ÂÄô„ÄÅÊîøÊ≤ªÊåëÊà∞ÁöÑËÉΩÊ∫êÂõ∞Â¢ÉÔºöÂè∞ÁÅ£ÂÖ®Â≥∂ÊúâÈ´òÈÅî 9 ÊàêÁöÑÁî®ÈõªÈáèÈù†ÁöÑÊòØÈÄ≤Âè£Áü≥ÂåñÁáÉÊñôÔºõÂÖ©Â≤∏ÊÉÖÂã¢‰πüÊåÅÁ∫åÁ∑äÁπÉÔºåÂè∞ÁÅ£ÂøÖÈ†àÈù¢Â∞ç‰æÜËá™‰∏≠ÂúãÁöÑÁ∂ìÊøüÂ∞ÅÈéñ„ÄÅÂúãÈöõÂ≠§Á´ãÔºåÁîöÊàñÊ≠¶ÂäõÂÖ•‰æµÂ®ÅËÑÖÔºõËÄåÂá∫ÊñºÊîøÊ≤ªËÄÉÈáèÔºåÂü∑ÊîøÈª®‰∏ªÂºµÂú® 2025 Âπ¥ÂâçÂëäÂà•Ê†∏ÈõªÔºåÊâìÈÄ†„ÄåÈùûÊ†∏ÂÆ∂Âúí„ÄçÔºå‰∏¶Âú®Âêå‰∏ÄÂπ¥ÈÅîÊàê„ÄåÁáÉÁÖ§ÁôºÈõª 30%„ÄÅÂ§©ÁÑ∂Ê∞£ÁôºÈõª 50%„ÄÅÂÜçÁîüËÉΩÊ∫ê 20%„ÄçÁöÑËÉΩÊ∫êÁµêÊßãÁõÆÊ®ô„ÄÇ\nÂÜçËÄÖÔºåÊîøÂ∫úÈÇÑË®

### Node Parser (HTML Data)

In [15]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="../data/html/").load_data()

In [16]:
%pip install -q bs4


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [17]:
from llama_index.core.node_parser import HTMLNodeParser
html_parser = HTMLNodeParser()
# DEFAULT_TAGS = ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "b", "i", "u", "section"]
nodes = html_parser.get_nodes_from_documents(documents)
len(nodes)

3

In [18]:
for each in nodes:
    print(each.get_content())

Welcome to My Webpage
This is a simple HTML example with some basic styling.
Visit
Example.com
to
      learn more.
HTML
CSS
JavaScript


### Node Parser (Markdown data)

In [19]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="../data/markdown/").load_data()

In [20]:
from llama_index.core.node_parser import MarkdownNodeParser
md_parser = MarkdownNodeParser()
nodes = md_parser.get_nodes_from_documents(documents)
print(len(nodes))

1


In [21]:
nodes[0].get_content()

'üêç Python Á∞°ÂñÆË™ûÊ≥ïÁ≠ÜË®ò\n\n1. ËÆäÊï∏ËàáË≥áÊñôÂûãÂà•\n\n   Python ÊòØ‰∏ÄÂÄãÂãïÊÖãÂûãÂà•Ë™ûË®ÄÔºåÁÑ°ÈúÄÊòéÁ¢∫ËÅ≤ÊòéËÆäÊï∏ÂûãÂà•„ÄÇ\n\n```python\n# Êï¥Êï∏\nx = 10\n\n# ÊµÆÈªûÊï∏\ny = 3.14\n\n# Â≠ó‰∏≤\nname = "Alice"\n\n# Â∏ÉÊûóÂÄº\nis_active = True\n```\n\n2. Âü∫Êú¨ÈÅãÁÆó\n\n   Python ÊîØÊè¥Â∏∏Ë¶ãÁöÑÊï∏Â≠∏ÈÅãÁÆóÔºå‰∏¶‰ΩøÁî®Á∞°ÂñÆÁöÑÁ¨¶Ëôü‰æÜË°®Á§∫„ÄÇ\n\n```python\n# Âä†Ê≥ï\nresult = 5 + 3 # 8\n\n# Ê∏õÊ≥ï\nresult = 10 - 7 # 3\n\n# ‰πòÊ≥ï\nresult = 4 \\* 2 # 8\n\n# Èô§Ê≥ï\nresult = 10 / 2 # 5.0 (ÊµÆÈªûÊï∏ÁµêÊûú)\n\n# Ê¨°Êñπ\nresult = 2 \\*\\* 3 # 8\n```\n\n3. Ê¢ù‰ª∂Âà§Êñ∑\n\n   ‰ΩøÁî® if„ÄÅelif Âíå else ‰æÜÈÄ≤Ë°åÊ¢ù‰ª∂Âà§Êñ∑„ÄÇ\n\n```python\nage = 18\n\nif age >= 18:\nprint("Êàê‰∫∫")\nelif age > 12:\nprint("ÈùíÂ∞ëÂπ¥")\nelse:\nprint("ÂÖíÁ´•")\n```\n\n4. Ëø¥Âúà\n\n   Python ÊúâÂÖ©Á®Æ‰∏ªË¶ÅÁöÑËø¥ÂúàÔºöfor Âíå while„ÄÇ\n\n- for Ëø¥Âúà\n\n  ```python\n  # ÂàóÂç∞ 1 Âà∞ 5\n  for i in range(1, 6):\n  print(i)\n  ```\n\n- while Ëø¥Âúà\n\n  ```python\n  # ÂàóÂç∞Áõ¥Âà∞Ê¢ù‰ª∂ÁÇ∫ÂÅá\n  count = 

### Node Parser (SimpleFileNodeParser-No matter file type)

In [22]:
documents = [
    *SimpleDirectoryReader(input_dir="../data/json/").load_data(),
    *SimpleDirectoryReader(input_dir="../data/html/").load_data(),
    *SimpleDirectoryReader(input_dir="../data/markdown/").load_data()
]

In [23]:
len(documents)

3

In [24]:
from llama_index.core.node_parser import SimpleFileNodeParser
parser = SimpleFileNodeParser()
nodes = parser.get_nodes_from_documents(documents)

In [25]:
len(nodes)

3

### Splitter (SenstenceSplitter)

In [26]:
documents = [
    Document(text="Âú®‰ªäÂπ¥ÂÇôÂèóÁüöÁõÆÁöÑ NBA Á∏ΩÂÜ†ËªçË≥Ω‰∏≠ÔºåÊ¥õÊùâÁ£ØÊπñ‰∫∫Èöä‰ª•È©öÈö™ÁöÑË°®ÁèæÊìäÊïó‰∫ÜÊ≥¢Â£´È†ìÂ°ûÁàæÊèêÂÖãÈöäÔºå\
        ÊàêÂäüÊçßËµ∑ÈöäÂè≤Á¨¨ 18 Â∫ßÁ∏ΩÂÜ†ËªçÁçéÊùØÔºåÂà∑Êñ∞ËÅØÁõüÊ≠∑Âè≤Ë®òÈåÑ„ÄÇÁ≥ªÂàóË≥Ω‰∏≠ÔºåLeBron James Âíå Anthony Davis \
        ÊåÅÁ∫åÂ±ïÁèæÈ†ÇÂ∞ñË°®ÁèæÔºåÂ∞§ÂÖ∂Âú®ÈóúÈçµÁöÑÁ¨¨‰∏ÉÂ†¥ÊØîË≥Ω‰∏≠ÔºåLeBron Âú®ÊúÄÂæå‰∏âÁßíÊäï‰∏≠Ëá¥ÂãùÁêÉÔºåÊàêÁÇ∫ÁêÉÈöäÂ•™ÂÜ†ÁöÑÊúÄÂ§ßÂäüËá£„ÄÇ\
        ÂÑòÁÆ°Â°ûÁàæÊèêÂÖãÈöäÁöÑÂπ¥ËºïÊ†∏ÂøÉ Jayson Tatum Âíå Jaylen Brown Ë°®ÁèæÂá∫Ëâ≤Ôºå‰ΩÜÊúÄÁµÇÁÑ°Ê≥ïÈÄÜËΩâÊπñ‰∫∫ÈöäÁöÑ\
        ÂÖ®Èù¢ÂÑ™Âã¢„ÄÇÈÄôÂ†¥Âè≤Ë©©Ëà¨ÁöÑÂ∞çÊ±∫‰∏çÂÉÖÂ±ïÁèæ‰∫ÜÈõôÊñπÁêÉÂì°ÁöÑÊäÄË°ìËàáÈüåÊÄßÔºå‰πüÁÇ∫ÁêÉËø∑Áïô‰∏ã‰∫ÜÁÑ°Êï∏Á∂ìÂÖ∏Áû¨ÈñìÔºå\
        ÊàêÁÇ∫ NBA Ê≠∑Âè≤‰∏äÁöÑÂèà‰∏ÄÈáåÁ®ãÁ¢ë")
]

In [27]:
from llama_index.core.node_parser import SentenceSplitter
# Default chunk_size is 1024, chunk_overlap is 200
# Be aware that the chunk_overlap cannot larger than chunk_size
splitter = SentenceSplitter(chunk_size=100, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents(documents)

In [28]:
len(nodes)

5

In [29]:
nodes[0].get_content()

'Âú®‰ªäÂπ¥ÂÇôÂèóÁüöÁõÆÁöÑ NBA Á∏ΩÂÜ†ËªçË≥Ω‰∏≠ÔºåÊ¥õÊùâÁ£ØÊπñ‰∫∫Èöä‰ª•È©öÈö™ÁöÑË°®ÁèæÊìäÊïó‰∫ÜÊ≥¢Â£´È†ìÂ°ûÁàæÊèêÂÖãÈöäÔºå        ÊàêÂäüÊçßËµ∑ÈöäÂè≤Á¨¨ 18'

### Manually create node

In [30]:
from llama_index.core.schema import TextNode
nodes = [
    TextNode(text="1th chunk"),
    TextNode(text="2nd chunk", metadata={"kind": "special chunk"}),
    TextNode(text="3rd chunk", id=3),
]
nodes

[TextNode(id_='560ba749-c35f-4736-92f2-b74f46d75b31', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='1th chunk', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'),
 TextNode(id_='f1209f8c-9d51-4200-a9c6-4ee359bfc1ed', embedding=None, metadata={'kind': 'special chunk'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='2nd chunk', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'),
 TextNode(id_='13873cec-1735-4973-bc27-31ab1b28708b', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='