# API Usage (Automation)
The GUI is designed to demonstrate what is possible. To see how something in the GUI was accomplished, find the tab in `./ipypdf/widgets/node_tools.py` This is where all of the tools are defined. e.g. The `AutoTools` Tab visible in the GUI is a class defined in `node_tools.py` called `AutoTools`.

This notebook just goes through some common interactions.
* Layout extraction
* Cropping
* Table parsing
* Raw text extraction

In [None]:
from pathlib import Path
dir_name = Path("../tests/fixture_data/sample_pdfs")
fname = dir_name / "doc.pdf"

You can call the `parse_layout` or `get_text_blocks` utility functions directly on a pdf without needing to load the ipypdf widget.

### Parse Layout
This function iterates through each page of the doc (you can limit this with start/stop args) and passes the rendered
images through the paddlepaddle model to determine bbox types and boundaries. It returns an iterator of lists. The elements are the default `TextBlock` objects returned by layoutparser with some extra convinience attributes (`relative_coordinates`, and `text`)


This cell complains about needing to initiallize the model. Either of these options will stop the complaints. If you don't mind reloading the model everytime you can safely ignore them.
1. Initialize the model beforehand
```python
import layoutparser as lp
model = lp.models.PaddleDetectionLayoutModel("lp://PubLayNet/ppyolov2_r50vd_dcn_365e/config")
blocks = list(parse_layout(fname, model))
```
2. Supress the Warning
```python
blocks = list(parse_layout(fname, ignore_warning=True))
```

In [None]:
from ipypdf.utils.lp_util import parse_layout
blocks = list(parse_layout(fname))

In [None]:
# blocks[page][index].attribute
b = blocks[0][0]
print(f"{b.type}: {b.text}")

### Crop out the original rendered section
From the `coordinates` attribute you can crop out the portion of the document pertaining to the Text block

> Note: The ImageContainer object by default renders the pdf at 300 dpi. If this scaling changes, then the pil coordinates will be wrong.<br>
This is the reasoning behind `relative_coordinates`

In [None]:
from ipypdf.utils.image_utils import ImageContainer

imgs = ImageContainer(fname) # Render the pages
im = imgs[0].crop(b.coordinates) # Crop the section out of the page
im.resize((im.width//3,im.height//3)) # Show

## Table Parsing Example

In [None]:
tables = []

# Iterate through blocks until we find a "Table"
for page, page_blocks in enumerate(blocks):
    for block in page_blocks:
        if block.type == "Table":
            # Crop out the table
            tables.append(
                imgs[page].crop(block.coordinates)
            )
im = tables[0]
im.resize((im.width//3,im.height//3))

In [None]:
from ipypdf.utils.table_extraction import img_2_table
# Parse the table using the img_2_table utility function
img_2_table(tables[0])

In [None]:
import pandas as pd

rows = img_2_table(tables[0], no_coords=True)
pd.DataFrame(rows)

### Get Text Blocks
This is passes each page through Tesseract to get text boxes. The text blocks are indexed the same way the `LP` results. But in this case each block is just a dictionary.


In [None]:
from ipypdf.utils.tess_utils import get_text_blocks
[[b["value"] for b in page] for page in get_text_blocks(fname)]