# Documents
A document serves as a container. It holds not just the raw text or data from wherever it originated but also any extra bits of information you decide to tag along. Here os how it can be created manually


In [6]:
from llama_index.core import Document

text = "The quick brown fox jumps over the lazy dog"

doc = Document(
 text = text,
 metadata = {"author": "John Doe", "category": "others"},
 id="1"
)

print(doc)

Doc ID: f3bf2752-ba8f-4e6b-8aae-2862cda096b9
Text: The quick brown fox jumps over the lazy dog


In practical applications, these documents are generated in bulk by sourcing them from various data sources. This bulk ingestion of data uses predefined data loaders - sometimes called connectors  or simply readers- from an extensive library called Llamahub

In [7]:
from llama_index.readers.wikipedia import WikipediaReader

loader = WikipediaReader()

documents = loader.load_data(
    pages = {'Pythagorean theorem', 'General relativity'}
)

print(f"Loaded {len(documents)} documents")

Loaded 2 documents


Creating documents is a straightforward process. But how do the raw document objects get converted into a format that LLMs can efficiently process and reason over? This is here Nodes come in

# Nodes
While Documenst represent the raw data and can be used as such, Nodes are smaller chunks of content extracted from the documents. The goal is to break down documents into smaller, more manageable pieces of text. This serves the following purposes:
* Allow our proprietary knowledge to fit within the model's prompt limits
* Allows the creation of relationships between Nodes

In LlamaIndex, nodes can also store images. 
Here is a list of attributes of the TextNode Class:
* **text**: The chunk of text derived from an original document
* **start_char_idx** and **end_char_idx** are optional index integer values that store the starting and ending character positions of the text within the Document. This could be helpful when the text is part of a larger document, and you need to pin point the exact location
* **text_template** and **metadata_template** are template fields that define how the text and metadata are formatted. They help produce a more structured and readable representation of TextNode
* **metadata_separator**: This is a string field that defines the separator between metadata fields

## Manually creating the Node Objects

In [8]:
from llama_index.core import Document
from llama_index.core.schema import TextNode

doc = Document(text="This is a sample Document text")
n1 = TextNode(text = doc.text[0:16], doc_id=doc.id_)
n2 = TextNode(text=doc.text[17:30], doc_id=doc.id_)


print(n1)
print(n2)

Node ID: ea929b60-7c90-4c21-8d4a-db3414a8ca92
Text: This is a sample
Node ID: c540d961-ea1c-413e-911a-7a84f3f8fcac
Text: Document text


## Automatically extracting nodes from documents using splitters
Because Document chunking is very important in a RAG workflow, LlamaIndex comes with built-in tools for this purpose

In [11]:
from llama_index.core.node_parser import TokenTextSplitter

doc = Document(
    text=("This is sentence 1. This is sentence 2. " 
          "Sentence 3 here"),
    metadata={"author": "John Smith"}
)

splitter = TokenTextSplitter(
    chunk_size=12,
    chunk_overlap=0,
    separator=" "
)

nodes = splitter.get_nodes_from_documents([doc])
for node in nodes:
    print(node.text)
    print(node.metadata)

Metadata length (6) is close to chunk size (12). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
This is sentence 1.
{'author': 'John Smith'}
This is sentence 2.
{'author': 'John Smith'}
Sentence 3 here
{'author': 'John Smith'}


## Nodes don't like to be alone- they crave relationships
We can now add relationships to nodes

In [13]:
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo

doc = Document(text="First sentence. Second sentence")
n1 = TextNode(text="First sentence", node_id=doc.doc_id)
n2 = TextNode(text="Second sentence", node_id= doc.doc_id)

n1.relationships[NodeRelationship.NEXT] = n2.node_id
n2.relationships[NodeRelationship.PREVIOUS] = n1.node_id

print(n1.relationships)
print(n2.relationships)

{<NodeRelationship.NEXT: '3'>: '3d4aa0a8-1b23-42ef-a4cb-c9a2e6d906e7'}
{<NodeRelationship.PREVIOUS: '2'>: 'd97af7bc-f8c0-41e4-b168-5ebb8b59e6cb'}


LlamaIndex contains the necessary tiiks to automatically create relationships between the nodes. For example, when using the automated node parsers discussed previously in ntheir default configuration, LlamaIndex will automatically create previous ir bext relationships between the nodes it creates

There are other types of relationships that we could define. In addition to somple relationships such as previous or next, Nodes can be connected using the following:
* **SOURCE-** The source relationship represents the original document that a node was extracted or parsed from. When you parse a document into multiple nodes, you can track which document each node originated from using the source relationship
* **PARENT-** The parent relationship indicates a hierarchical structure where the node with this relationship is one level higher that the associated node, In a tree structure, a parent would have one or more children
* **CHILDREN-** This is the opposite of parent

## Why are relationships important
* **Enables more contextual querying-** By linking nodes together, you can leverage their relationships during querying to retrieve additional relevant documents
* **Allows tracking provenance-** Relationships encode provenance- where source nodes originated and how they ate connected
* **Enables navigation through nodes-** Traversing nodes by their relationships enables new types of queries
* **Supports the construction of knowledge graphs-** Nodes and relationships are the building blocks of knowledge graphs
* **Improves the index structure-** Some LlamaIndexes such as trees and graphs, utilize node relationships to build their internal structure


## Indexes
The index refers to a specific data structure used to organize a collection of nodes for optimized storage and retrieval
Without indexing, your data is a messy pile of disorganized facts and files. Proper indexing neatly sorts information intop categories that make sense.
LlamaIndex supports different types of indexes, each with its own strength and trade-offs. Here is a list of available index types:
* **SummaryIndex-** This is a very similar to a box of recipes - it keeps your nodes in order so that you can access them one by one.
* **DocumentSummaryIndex-** This constructs a concise summary for each document, mapping these summaries back to their respective nodes. 
* **VectorStoreIndex-** It converts text into vector embeddings and uses math to group similar nodes, helping you to locate nodes that are alike
* **TreeIndex-** This index behaves similarly to putting smaller boxes inside bigger ones, organizing nodes in a tre-like structure
* **KeywordTableIndex-** The keyword index connects important words to the nodes they are in
* **KnowledgeGraphIndex-** This is useful when you need to link facts in a big network of data stored as a knowledge graph
* **Composable Graph-** This allows you to create index structures in which Document-level indexes are indexed in higher-level collections


All the index types in llamaIndex share some common features:
* **Building the index-** Each index can be constructed by passing in a set of nodes during initialization. This builds the underlying index structure
* **Inserting new nodes-** After an index is built, new nodes can be manually inserted. This adds to the existing index structure
* **Querying the index-** Once built, indexes provide a query interface to retrieve relevant nodes based on a specific query

Here is a simple exampel to illustrate the creation of SummaryIndex:

In [14]:
from llama_index.core import SummaryIndex, Document
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="Lionel Messi is a footbal player from Argentina"
    ),
    TextNode(
        text="He has won Ballon d'or trophy 7 times"
    ),
    TextNode(
        text="Lionel Messi's hometown is Rosario"
    ),
    TextNode(
        text="He was born on June 24, 1987"
    )
]

index = SummaryIndex(nodes)

In [None]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get the OpenAI API key
openai_api_key = os.getenv('OPENAI_API_KEY')


In [21]:
query_engine = index.as_query_engine()
response = query_engine.query("What is Messi's hometown?")
print(response)

Rosario


QueryEngine contains a retriever, which nis responsible for retrieving Nodes from the index for the query. The retriever does a lookup to fetch and rank relevant nodes from the index for that query. It grabs nodes from the index that are likely ti contain information about Messi's hometown. 
But just getting back a list of nodes isn't very useful. Another part of QueryEngine called the node postprocessor comes into play at this point. This part enables the transformation, re-ranking, or fiktering of nodes after thy've been reytrieved and befire the final response is crafted.
The QueryEngine object contains a response synthesizer, which takes the retrieved nodes and crafts the final response using the following steps:
1. The response synthesizer takes the nodes selected by the retriever and processed by the node postprocessor and formats them into an LLM prompt.
2. The prompt contains the query along with context from the nodes
3. This prompt is given to the LLM to generate a response
4. Any necessary postprocessing is done on the raw response using the LLM to return the final natural language amswer

So index.as_query_engine() is creating a full query engine for us, containing a default version of the three elements: retriever, node postprocessor and response synthesizer

## Building our first interactive, augmented LLM Application


In [23]:
from llama_index.core import Document, SummaryIndex
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.readers.wikipedia import WikipediaReader

loader = WikipediaReader()

documents = loader.load_data(pages=["Messi Lionel"])
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents)
index = SummaryIndex(nodes)
query_engine = index.as_query_engine()

print("Ask me anything about Lionel Messi")
while True:
    question = input("Your Question: ")
    if question.lower() == "exit":
        break

    response = query_engine.query(question)
    print(response)

Ask me anything about Lionel Messi
Lionel Messi was born on 24 June 1987.
Inter Miami
Inter Miami.
Antonela Roccuzzo is Messi's wife, and they have three sons named Thiago, Mateo, and Ciro.
Antonela Roccuzzo is Messi's wife, and they have three sons named Thiago, Mateo, and Ciro.


## Using the logging features of LlamaIndex to understand the logic and debug our applications


In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)


## Temperature Parameter
The temperature values for the OpenAI values range from 0 to 2. Higher values produce more random, creative output. Lower values produce more focused, deterministic output. A temperature of 0 will produce almost the same output every time for the same input prompt. For code generation and data analysis tasks, a temperature value of 0.2 would be appropriate, while kore creative-focused tasks such as writing or chatbot responses will benefit from a setting of 0.5 or higher

## Understanding how Settings can be used for customization
Settings is a key component in LlamaIndex that allows you to customize and configure the elements used during indexing and querying. It contains common objects needed across LlamaIndex such as the following:
* **LLM-** This allows for the overriding of the default LLM witrh a custom one as we've seen
* **Embedding model-** This is used for generating vectors for text to enable semantic search
* **NodeParser-** This is used for setting the default node parser
* **CallbackManager-** This handles callbacks for events within LlamaIndex.