# Working with documents

`Document`: Represent an unstructured textual data to be processed or saved into the `DocumentMemory`, it can represent a text, text chunk, table row or a claim (unstructured fact)

`DocumentList`: The DSPy type used by data processing modules and memory
  
```python
import dspy
from pydantic import BaseModel, Field
from typing import Optional, List, Dict

class Document(BaseModel):
	id: str = Field(description="Unique identifier for the document", default_factory=uuid4)
	text: str = Field(description="The actual text content of the document")
	parent_id: str = Field(description="Identifier for the parent document", default="")
	vector: Optional[List[float]] = Field(description="Vector representation of the document", default=None)
	metadata: Optional[Dict[str, Any]] = Field(description="Additional information about the document", default=None)

class DocumentList(BaseModel, dspy.Prediction):
	docs: List[Document] = Field(description="List of documents", default=[])

``` 


In [1]:
import hybridagi.core.datatypes as dt

input_data = [
    {
        "title": "The Catcher in the Rye",
        "content": "The Catcher in the Rye is a novel by J. D. Salinger, partially published in serial form in 1945–1946 and as a novel in 1951. It is widely considered one of the greatest American novels of the 20th century. The novel's protagonist, Holden Caulfield, has become an icon for teenage rebellion and angst. The novel also deals with complex issues of innocence, identity, belonging, loss, and connection."
    },
    {
        "title": "To Kill a Mockingbird",
        "content": "To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The plot and characters are loosely based on the author's observations of her family and neighbors, as well as on an event that occurred near her hometown in 1936, when she was 10 years old. The novel is renowned for its sensitivity and depth in addressing racial injustice, class, gender roles, and destruction of innocence."
    }
]

books = dt.DocumentList()
books.docs = [dt.Document(text=d["content"], metadata={"title": d["title"]}) for d in input_data]

# The DocumentList type is used as input/output for the modules/memory

print(books)

  from .autonotebook import tqdm as notebook_tqdm


docs=[Document(id=UUID('d16c8404-62d2-4d4f-8fe9-a32d69ecd229'), text="The Catcher in the Rye is a novel by J. D. Salinger, partially published in serial form in 1945–1946 and as a novel in 1951. It is widely considered one of the greatest American novels of the 20th century. The novel's protagonist, Holden Caulfield, has become an icon for teenage rebellion and angst. The novel also deals with complex issues of innocence, identity, belonging, loss, and connection.", parent_id=None, vector=None, metadata={'title': 'The Catcher in the Rye'}, created_at=datetime.datetime(2024, 8, 2, 17, 10, 34, 26183)), Document(id=UUID('cc9d0c01-3985-4826-bfa4-24d52d140fa5'), text="To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The plot and characters are loosely based on the author's observations of her family and neighbors, as well as on an event that occurred near her h

#### Loading tabular data

Loading tabular data is an important aspect for businesses, the easiest way of doing it is to use panda

Let's imagine that we have the following data in a file called `salaries_and_bonuses.csv`:

| EmployeeID | FirstName | LastName | Position | HireDate | Salary | Bonus | TotalCompensation |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | John | Doe | Software Engineer | 2021-01-01 | 80000.00 | 5000.00 | 85000.00 |
| 2 | Jane | Smith | Product Manager | 2020-06-01 | 90000.00 | 10000.00 | 100000.00 |
| 3 | Michael | Johnson | Data Analyst | 2021-03-15 | 70000.00 | 2000.00 | 72000.00 |
| 4 | Emily | Davis | Marketing Manager | 2019-09-01 | 100000.00 | 15000.00 | 115000.00 |
| 5 | Robert | Brown | Sales Representative | 2021-11-15 | 60000.00 | 0.00 | 60000.00 |

This file is located into `data/salaries_and_bonuses.csv`

In [2]:
from hybridagi.readers import CSVReader

reader = CSVReader()
# Load the CSV file
table_rows = reader("data/salaries_and_bonuses.csv")

# Check the last document representing the last row
print(table_rows.docs[0].text)

EmployeeID                           1
FirstName                         John
LastName                           Doe
Position             Software Engineer
HireDate                    2021-01-01
Salary                         80000.0
Bonus                           5000.0
TotalCompensation              85000.0


### Loading documents into memory

To make available the documents for the Agent system, we need to load the documents into memory.

In [3]:
from hybridagi.memory.integration.local import LocalDocumentMemory

doc_memory = LocalDocumentMemory(index_name="tests")

# We load them into memory, ready to work !
doc_memory.update(books)
doc_memory.update(table_rows)

# Let's show what it look like, let the mouse over the node to show the content
doc_memory.show()


NameError: name 'table_rows' is not defined