# Working with the Documents

`Document`: Represent an unstructured textual data to be processed or saved into the `DocumentMemory`, it can represent a text, text chunk, table row or a claim (unstructured fact)

`DocumentList`: The DSPy type used by data processing modules and memory
  
```python
import dspy
from pydantic import BaseModel, Field
from typing import Optional, List, Dict

class Document(BaseModel):
	id: str = Field(description="Unique identifier for the document", default_factory=uuid4)
	text: str = Field(description="The actual text content of the document")
	parent_id: str = Field(description="Identifier for the parent document", default="")
	vector: Optional[List[float]] = Field(description="Vector representation of the document", default=None)
	metadata: Optional[Dict[str, Any]] = Field(description="Additional information about the document", default=None)

class DocumentList(BaseModel, dspy.Prediction):
	docs: List[Document] = Field(description="List of documents", default=[])

``` 


In [1]:
import hybridagi.core.datatypes as dt

input_data = [
    {
        "title": "The Catcher in the Rye",
        "content": "The Catcher in the Rye is a novel by J. D. Salinger, partially published in serial form in 1945–1946 and as a novel in 1951. It is widely considered one of the greatest American novels of the 20th century. The novel's protagonist, Holden Caulfield, has become an icon for teenage rebellion and angst. The novel also deals with complex issues of innocence, identity, belonging, loss, and connection."
    },
    {
        "title": "To Kill a Mockingbird",
        "content": "To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The plot and characters are loosely based on the author's observations of her family and neighbors, as well as on an event that occurred near her hometown in 1936, when she was 10 years old. The novel is renowned for its sensitivity and depth in addressing racial injustice, class, gender roles, and destruction of innocence."
    }
]

books = dt.DocumentList()
books.docs = [dt.Document(text=d["content"], metadata={"title": d["title"]}) for d in input_data]

# The DocumentList type is used as input/output for the modules/memory

print(books)


  from .autonotebook import tqdm as notebook_tqdm


docs=[Document(id=UUID('3942f12b-7de3-43fb-a4c4-a3214cd27b7a'), text="The Catcher in the Rye is a novel by J. D. Salinger, partially published in serial form in 1945–1946 and as a novel in 1951. It is widely considered one of the greatest American novels of the 20th century. The novel's protagonist, Holden Caulfield, has become an icon for teenage rebellion and angst. The novel also deals with complex issues of innocence, identity, belonging, loss, and connection.", parent_id=None, vector=None, metadata={'title': 'The Catcher in the Rye'}, created_at=datetime.datetime(2024, 8, 3, 18, 21, 30, 859950)), Document(id=UUID('caaabddb-dc48-4b73-b6cd-3b4851e0123c'), text="To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The plot and characters are loosely based on the author's observations of her family and neighbors, as well as on an event that occurred near her 

# Working with the Facts

`Entity`: Represent an entity like a person, object, place or document to be processed or saved into the `FactMemory`

`Relationship`: Represent an relation to be processed or saved into the `FactMemory`

`Fact`: Represent a first order predicate to be processed or saved into the `FactMemory`

`EntityList`: A list of entities to be processed or saved into memory

`FactList`: A list of facts to be processed or saved into memory

```python

class Entity(BaseModel):
    id: Union[UUID, str] = Field(description="Unique identifier for the entity", default_factory=uuid4)
    label: str = Field(description="Label or category of the entity")
    name: str = Field(description="Name or title of the entity")
    description: Optional[str] = Field(description="Description of the entity", default=None)
    vector: Optional[List[float]] = Field(description="Vector representation of the document", default=None)
    metadata: Optional[Dict[str, Any]] = Field(description="Additional information about the document", default={})
    created_at: datetime = Field(description="Time when the entity was created", default_factory=datetime.now)
    
    def to_dict(self):
        if self.description is not None:
            return {"name": self.name, "label": self.label, "description": self.description, "metadata": self.metadata}
        else:
            return {"name": self.name, "label": self.label, "metadata": self.metadata}

class EntityList(BaseModel, dspy.Prediction):
    entities: List[Entity] = Field(description="List of entities", default=[])
    
    def __init__(self, **kwargs):
        BaseModel.__init__(self, **kwargs)
        dspy.Prediction.__init__(self, **kwargs)
        
    def to_dict(self):
        return {"entities": [e.to_dict() for e in self.entities]}

class Relationship(BaseModel):
    id: Union[UUID, str] = Field(description="Unique identifier for the relation", default_factory=uuid4)
    name: str = Field(description="Relationship name")
    vector: Optional[List[float]] = Field(description="Vector representation of the relationship", default=None)
    metadata: Optional[Dict[str, Any]] = Field(description="Additional information about the relationship", default={})
    created_at: datetime = Field(description="Time when the relationship was created", default_factory=datetime.now)
    
    def to_dict(self):
        return {"name": self.name, "metadata": self.metadata}

class Fact(BaseModel):
    id: Union[UUID, str] = Field(description="Unique identifier for the fact", default_factory=uuid4)
    subj: Entity = Field(description="Entity that is the subject of the fact")
    rel: Relationship = Field(description="Relation between the subject and object entities")
    obj: Entity = Field(description="Entity that is the object of the fact")
    weight: float = Field(description="The fact weight (between 0.0 and 1.0, default 1.0)", default=1.0)
    vector: Optional[List[float]] = Field(description="Vector representation of the fact", default=None)
    metadata: Optional[Dict[str, Any]] = Field(description="Additional information about the fact", default={})
    created_at: datetime = Field(description="Time when the fact was created", default_factory=datetime.now)
    
    def to_dict(self):
        return {"fact": "(:"+self.subj.label+" {name:\""+self.subj.name+"\"})-[:"+self.rel.name+"]->(:"+self.obj.label+" {name:\""+self.obj.name+"\"})"}

class FactList(BaseModel, dspy.Prediction):
    facts: List[Fact] = Field(description="List of facts", default=[])
    
    def __init__(self, **kwargs):
        BaseModel.__init__(self, **kwargs)
        dspy.Prediction.__init__(self, **kwargs)
        
    def to_dict(self):
        return {"facts": [f.to_dict() for f in self.facts]}
```

## Loading CSV data

Loading tabular data is an important aspect for businesses.

Let's imagine that we have the following data in a file called `salaries_and_bonuses.csv`:

| EmployeeID | FirstName | LastName | Position | HireDate | Salary | Bonus | TotalCompensation |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | John | Doe | Software Engineer | 2021-01-01 | 80000.00 | 5000.00 | 85000.00 |
| 2 | Jane | Smith | Product Manager | 2020-06-01 | 90000.00 | 10000.00 | 100000.00 |
| 3 | Michael | Johnson | Data Analyst | 2021-03-15 | 70000.00 | 2000.00 | 72000.00 |
| 4 | Emily | Davis | Marketing Manager | 2019-09-01 | 100000.00 | 15000.00 | 115000.00 |
| 5 | Robert | Brown | Sales Representative | 2021-11-15 | 60000.00 | 0.00 | 60000.00 |

In [2]:
from hybridagi.readers import CSVReader

reader = CSVReader()
# Load the CSV file
table_rows = reader("data/salaries_and_bonuses.csv")

# Check the last document representing the last row
print(table_rows.docs[-1].text)

EmployeeID                              5
FirstName                          Robert
LastName                            Brown
Position             Sales Representative
HireDate                       2021-11-15
Salary                            60000.0
Bonus                                 0.0
TotalCompensation                 60000.0


## Loading PDF data

In [3]:
from hybridagi.readers import PDFReader

reader = PDFReader()

pdf_pages = reader("data/SpelkeKinzlerCoreKnowledge.pdf")

print(pdf_pages.docs[0].text)

Developmental Science 10:1 (2007), pp 89–96
DOI: 10.1111/j.1467-7687.2007.00569.x
© 2007 The Authors. Journal compilation © 2007 Blackwell Publishing Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 
350 Main Street, Malden, MA 02148, USA.
Blackwell Publishing Ltd
Core knowledge
Elizabeth S. Spelke and Katherine D. Kinzler
Department of Psychology, Harvard University, USA 
Abstract
Human cognition is founded, in part, on four systems for representing objects, actions, number, and space. It may be based, as
well, on a ﬁfth system for representing social partners. Each system has deep roots in human phylogeny and ontogeny, and it
guides and shapes the mental lives of adults. Converging research on human infants, non-human primates, children and adults
in diverse cultures can aid both understanding of these systems and attempts to overcome their limits.
Introduction
Cognitive science has been dominated by two views of
human nature. On one view, the human mind is a ﬂexible
and adaptable m

## Loading Textual data

In [4]:

from hybridagi.readers import TextReader

reader = TextReader()

documents = reader("data/SynaLinks_presentation.md")

print(documents.docs[0].text)

# SynaLinks SAS presentation

SynaLinks is a young French start-up founded in Toulouse in 2023. Our mission is to promote a responsible and pragmatic approach to general artificial intelligence. To achieve this, we integrate deep learning models with symbolic artificial intelligence models, the traditional domain of AI before the era of deep learning.

At SynaLinks, our approach aims to combine the efficiency of deep learning models with the transparency and explicability of symbolic models, thus creating more robust and ethical artificial intelligence systems. We work on cutting-edge technologies that enable businesses to fully harness the potential of AI while retaining significant control over their systems, reducing the risks associated with opacity and dependence on deep learning algorithms.

We work closely with our clients to customize our solutions to meet their specific needs. Our neuro-symbolic approach offers the flexibility necessary to address the diverse requirements of b