<img align="left" src="imgs/fonduer-logo.png" width="100px" style="margin-right:20px">

# Tutorial: Navigating the Fonduer Data Model

## Running locally?

If you're running this tutorial interactively on your own machine, you'll need to create a new PostgreSQL database named `intro_data_model`.

If you have already the database `intro_data_model` in your postgresql, please uncomment the first line to drop it. Otherwise, download our database snapshots by executing `./download_data.sh` in the intro tutorial directory.

In [1]:
#! dropdb --if-exists intro_data_model
! createdb intro_data_model
! psql intro_data_model < data/intro_data_model.sql > /dev/null

# The Fonduer Data Model API
_Complete Fonduer API documentation is available on [Read the Docs](https://fonduer.readthedocs.io)_

## The Fonduer Data Model
The Fonduer Data Model serves two high-level purposes. First, it allows Fonduer to capture the immense data variety of richly formatted data in a unified representation. Second, it allows users to provide multimodal supervision that leverages document-level context. Nearly everything in the Fonduer pipeline uses information stored in the Fonduer Data Model. The Fonduer Data Model is heirarchical, as shown below.

<img src="imgs/dag.png" width="250px">

Each box represents a `Context` object, which are units that a document can be broken down into. For example, a `Sentence` can be part of a `Paragraph` in a `Cell` in a `Table`. The default `Context` objects provided by Fonduer are shown above.

To explore the data model, we first load the Fonduer `Meta` class, which creates a connection with our PostgreSQL database.

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import sys
import logging
from pprint import pprint

from fonduer.parser.models import Document, Sentence, Table, Figure
from fonduer import Meta

# Configure logging for Fonduer
logging.basicConfig(stream=sys.stdout, format='[%(levelname)s] %(name)s - %(message)s')
log = logging.getLogger('fonduer')
log.setLevel(logging.INFO)

ATTRIBUTE = "intro_data_model"
conn_string = 'postgresql://localhost:5432/' + ATTRIBUTE

session = Meta.init(conn_string).Session()

# Print metadata about the tutorial corpus
print("Num Docs: {}".format(session.query(Document).count()))
print("Num Sentences: {}".format(session.query(Sentence).count()))
print("Num Tables: {}".format(session.query(Table).count()))
print("Num Figures: {}".format(session.query(Figure).count()))

[INFO] fonduer.meta - Connecting user:None to localhost:5432/intro_data_model
[INFO] fonduer.meta - Initializing the storage schema
Num Docs: 10
Num Sentences: 5548
Num Tables: 93
Num Figures: 229


### Parsing Documents into the Data Model

The process of parsing input documents containing richly formatted data into Fonduer's data model typically consists of two steps. First, documents can be run through a [Preprocessor](http://fonduer.readthedocs.io/en/latest/user/preprocessors.html), and then fed through the Fonduer Parser to turn the input documents into the directed acyclic graph shown above. For more information about parsing documents, please check out our [end-to-end tutorials](../hardware/max_storage_temp_tutorial.ipynb). The rest of this tutorial assumes that your documents have already been parsed, and instead focuses on the data model itself.


### Navigating the Data Model

With the `session` object, you can make queries for any of `Context` objects in the data model using the [SQLAlchemy Query API](http://docs.sqlalchemy.org/en/latest/orm/query.html). For example, if we wanted to get a list of all of the Documents in the database, we can issue the following query, which gets all of the Documents ordered by name.

In [3]:
docs = session.query(Document).order_by(Document.name).all()

pprint(docs)

[Document 112823,
 Document 2N3906,
 Document 2N3906-D,
 Document 2N4123-D,
 Document 2N4124,
 Document 2N6426-D,
 Document 2N6427,
 Document AUKCS04635-1,
 Document BC182,
 Document BC182-D]


Similarly, if we wanted to inspect all of the Sentences parsed, we could issue the following query.

In [4]:
sentences = session.query(Sentence).order_by(Sentence.position).all()
pprint(sentences[:5])

[Sentence (Doc: '112823', Sec: 0, Par: 0, Idx: 0, Text: 'BC546'),
 Sentence (Doc: 'BC182', Sec: 0, Par: 0, Idx: 0, Text: 'BC182 NPN General Purpose Amplifier'),
 Sentence (Doc: '2N4123-D', Sec: 0, Par: 0, Idx: 0, Text: '2n4123.rev4'),
 Sentence (Doc: '2N3906', Sec: 0, Par: 0, Idx: 0, Text: '2N3906 MMBT3906 PZT3906 - PNP General-Purpose Amplifier'),
 Sentence (Doc: '2N3906-D', Sec: 0, Par: 0, Idx: 0, Text: '2N3906 - General Purpose Transistors, PNP Silicon')]


In addition to simply querying for specific `Context` objects in the database, we can also navigate the data model heirarchy from each `Context` object. The core `Context` objects are described on [Read the Docs](https://fonduer.readthedocs.io/en/stable/user/data_model.html). Let's see some examples.

#### Ex. 1: Getting a Sentence's Document

In [5]:
bc182 = sentences[0].document
print(bc182)

Document 112823


#### Ex. 2: Iterate over all a Document's Sentences

In [6]:
pprint([sentence for sentence in bc182.sentences][:10])

[Sentence (Doc: '112823', Sec: 0, Par: 0, Idx: 0, Text: 'BC546'),
 Sentence (Doc: '112823', Sec: 0, Par: 1, Idx: 1, Text: 'BC546B, BC547A, B, C, BC548B, C'),
 Sentence (Doc: '112823', Sec: 0, Par: 2, Idx: 2, Text: 'Amplifier Transistors'),
 Sentence (Doc: '112823', Sec: 0, Par: 3, Idx: 3, Text: 'NPN Silicon'),
 Sentence (Doc: '112823', Sec: 0, Par: 4, Idx: 4, Text: 'Features'),
 Sentence (Doc: '112823', Sec: 0, Par: 5, Idx: 5, Text: 'Pb-Free Packages are Available*'),
 Sentence (Doc: '112823', Sec: 0, Par: 6, Idx: 6, Text: 'http://onsemi.com'),
 Sentence (Doc: '112823', Sec: 0, Par: 7, Idx: 7, Text: 'MAXIMUM RATINGS'),
 Sentence (Doc: '112823', Sec: 0, Par: 8, Idx: 8, Text: '2'),
 Sentence (Doc: '112823', Sec: 0, Par: 9, Idx: 9, Text: 'BASE')]


#### Ex. 3: Find the first Sentence in the first Table in the Document

In [7]:
print(bc182.tables[0].sentences[0])

Sentence (Doc: '112823', Table: 0, Row: 0, Col: 0, Index: 11, Text: 'Rating')


#### Ex. 4: Inspect the html attributes of a Sentence

In [8]:
pprint(bc182.tables[0].sentences[0].html_attrs)

['class=s4',
 'style=padding-top: 2pt;text-indent: 0pt;text-align: center; color: black; '
 'font-family:Arial, sans-serif; font-style: normal; font-weight: bold; '
 'text-decoration: none; font-size: 8pt; ']


#### Ex. 5: Listing the attributes of a Context Object

If you forget the API and want to inspect the attributes of a particular Context object, you can just call dir() for a full list if you're working interactively. Otherwise, we'd recommend referring to [Read the Docs](https://fonduer.readthedocs.io/en/stable/user/data_model.html).


In [9]:
dir(bc182.tables[0].sentences[0])

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__mapper__',
 '__mapper_args__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__table__',
 '__table_args__',
 '__tablename__',
 '__weakref__',
 '_asdict',
 '_decl_class_registry',
 '_sa_class_manager',
 '_sa_instance_state',
 'abs_char_offsets',
 'bottom',
 'cell',
 'cell_id',
 'char_offsets',
 'col_end',
 'col_start',
 'dep_labels',
 'dep_parents',
 'document',
 'document_id',
 'html_attrs',
 'html_tag',
 'id',
 'implicit_spans',
 'is_cellular',
 'is_lingual',
 'is_structural',
 'is_tabular',
 'is_visual',
 'left',
 'lemmas',
 'metadata',
 'ner_tags',
 'page',
 'paragraph',
 'paragraph_id',
 'pos_tags',
 'position',
 'right',
 'row_end',
 'row_start',
 'section',
 'section_id',

#### Ex. 6: Inspect the Cell containing a particular Sentence

In [10]:
print(bc182.tables[0].sentences[0].cell)

Cell(Doc: 112823, Table: 0, Row: (0,), Col: (0,), Pos: 0)


#### Ex. 7: Iterate over all the Cells in a particular row

Note that Cells in a Table can span rows or columns. Thus, each Cell's row and column is indicated with `(row_start, row_end)` and `(col_start, col_end)`.

In [11]:
table = bc182.tables[0].sentences[0].cell.table
pprint([cell for cell in table.cells if cell.row_start == 1])

[Cell(Doc: 112823, Table: 0, Row: (1,), Col: (0,), Pos: 4),
 Cell(Doc: 112823, Table: 0, Row: (1, 3), Col: (1,), Pos: 5),
 Cell(Doc: 112823, Table: 0, Row: (1,), Col: (2,), Pos: 6),
 Cell(Doc: 112823, Table: 0, Row: (1, 3), Col: (3,), Pos: 7)]


## Summary
The Fonduer Data Model is a heirarchical representation of the input documents. Using the data model APIs, you can traverse anywhere in the data model and inspect the attributes of your data. Visit [Fonduer's Documentation](https://fonduer.readthedocs.io/en/stable/user/data_model.html) for a reference of the API.