# Grobid demo

## Content
- Process via command line (batch mode)
- Process programmatically (single PDF documents)
- Extract coordinates programmatically




In [13]:
!pip install -U git+https://github.com/kermitt2/grobid_client_python

Collecting git+https://github.com/kermitt2/grobid_client_python
  Cloning https://github.com/kermitt2/grobid_client_python to /private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-req-build-f47yesj3
  Running command git clone --filter=blob:none --quiet https://github.com/kermitt2/grobid_client_python /private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-req-build-f47yesj3
  Resolved https://github.com/kermitt2/grobid_client_python to commit 7232dcc4d9aa967aa3d7dda975df9e559210d814
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


## Process a PDF file



Process a PDF file can be done via command line or programmatically



In [14]:
!grobid_client --help

usage: grobid_client [-h] [--input INPUT] [--output OUTPUT] [--config CONFIG]
                     [--n N] [--generateIDs] [--consolidate_header]
                     [--consolidate_citations] [--include_raw_citations]
                     [--include_raw_affiliations] [--force] [--teiCoordinates]
                     [--segmentSentences] [--verbose] [--flavor FLAVOR]
                     service

Client for GROBID services

positional arguments:
  service               one of ['processFulltextDocumentBlank',
                        'processFulltextDocument', 'processHeaderDocument',
                        'processReferences', 'processCitationList',
                        'processCitationPatentST36',
                        'processCitationPatentPDF']

options:
  -h, --help            show this help message and exit
  --input INPUT         path to the directory containing files to process: PDF
                        or .txt (for processCitationList only, one refere

In [15]:
!grobid_client --input samples --output output/standard processFulltextDocument --verbose

GROBID server is up and running
error503_snapshot.pdf
journal.pcbi.1011775.pdf
PIIS0720048X22000304.pdf
3 files to process in current batch
Adding samples/letters/error503_snapshot.pdf to the queue.
Adding samples/articles/journal.pcbi.1011775.pdf to the queue.
Adding samples/erratum/PIIS0720048X22000304.pdf to the queue.
runtime: 36.362 seconds 


In [16]:
!grobid_client --input samples --output output/standard+coords processFulltextDocument --segmentSentences --generateIDs --teiCoordinates

GROBID server is up and running
runtime: 9.217 seconds 


In [19]:
!grobid_client --input samples --output output/light+coords processFulltextDocument --segmentSentences --generateIDs --teiCoordinates --flavor article/light-ref

GROBID server is up and running
output/light+coords/letters/error503_snapshot.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
output/light+coords/articles/journal.pcbi.1011775.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
output/light+coords/erratum/PIIS0720048X22000304.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
runtime: 0.006 seconds 


## Process programmatically

In [20]:
from grobid_client.grobid_client import GrobidClient

grobid_client = GrobidClient(
    grobid_server="https://lfoppiano-grobid-dev.hf.space",
    batch_size=1000,
    coordinates=["p", "s", "persName", "biblStruct", "figure", "formula", "head", "note", "title", "ref", "affiliation"],
    sleep_time=5,
    timeout=240,
    check_server=True
)

pdf_file, status, text = grobid_client.process_pdf(
    "processFulltextDocument",
    "samples/articles/journal.pcbi.1011775.pdf",
    consolidate_header=True,
    consolidate_citations=False,
    segment_sentences=True,
    tei_coordinates=True,
    include_raw_citations=False,
    include_raw_affiliations=False,
    generateIDs=True
)

status, text[:1000]

GROBID server is up and running


(200,
 '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" \nxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" \nxsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"\n xmlns:xlink="http://www.w3.org/1999/xlink">\n\t<teiHeader xml:lang="en">\n\t\t<fileDesc>\n\t\t\t<titleStmt>\n\t\t\t\t<title level="a" type="main" xml:id="_Vv6KduP" coords="1,200.01,111.81,317.08,15.30;1,200.01,134.32,343.79,15.30;1,200.01,156.83,67.87,15.30">Inferring country-specific import risk of diseases from the world air transportation network</title>\n\t\t\t\t<funder ref="#_xqahcry">\n\t\t\t\t\t<orgName type="full">Carlsberg Foundation</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder ref="#_g2zgeJm">\n\t\t\t\t\t<orgName type="full">Joachim Herz Stiftung</orgName>\n\t\t\t\t</funder>\n\t\t\t\t<funder>\n\t\t\t\t\t<orgName type="full">Germany&apos;s Federal Ministry of Health</org

In [None]:
from bs4 import BeautifulSoup

COLORS = {
    "persName": "rgba(0, 0, 255, 1)",  # Blue
    "s": "rgba(0, 128, 0, 1)",  # Green
    "p": "rgba(0, 100, 0, 1)",  # Dark Green
    "ref": "rgba(255, 255, 0, 1)",  # ??
    "biblStruct": "rgba(139, 0, 0, 1)",  # Dark Red
    "head": "rgba(139, 139, 0, 1)",  # Dark Yellow
    "formula": "rgba(255, 165, 0, 1)",  # Orange
    "figure": "rgba(165, 42, 42, 1)",  # Brown
    "title": "rgba(255, 0, 0, 1)",  # Red
    "affiliation": "rgba(255, 165, 0, 1)"  # red-orengi
}


def get_color(name, param):
    color = COLORS[name] if name in COLORS else "rgba(128, 128, 128, 1.0)"
    if param:
        color = color.replace("1)", "0.4)")

    return color


class GrobidProcessor:
    def __init__(self, grobid_client):
        self.grobid_client = grobid_client

    def process_structure(self, input_path) -> (dict, int):
        pdf_file, status, text = self.grobid_client.process_pdf(
            "processFulltextDocument",
            input_path,
            consolidate_header=True,
            consolidate_citations=False,
            segment_sentences=True,
            tei_coordinates=True,
            include_raw_citations=False,
            include_raw_affiliations=False,
            generateIDs=True
        )

        if status != 200:
            return

        coordinates = self.get_coordinates(text)
        pages = self.get_pages(text)

        return coordinates, len(pages)

    @staticmethod
    def box_to_dict(box, color=None, type=None):

        item = {"page": box[0], "x": box[1], "y": box[2], "width": box[3], "height": box[4]}
        if color is not None:
            item['color'] = color

        if type:
            item['type'] = type

        return item

    def get_coordinates(self, text):
        soup = BeautifulSoup(text, 'xml')
        all_blocks_with_coordinates = soup.find_all(coords=True)

        # if use_sentences:
        #     all_blocks_with_coordinates = filter(lambda b: b.name != "p", all_blocks_with_coordinates)

        coordinates = []
        count = 0
        for block_id, block in enumerate(all_blocks_with_coordinates):
            for box in filter(lambda c: len(c) > 0 and c[0] != "", block['coords'].split(";")):
                coordinates.append(
                    self.box_to_dict(
                        box.split(","),
                        get_color(block.name, count % 2 == 0),
                        type=block.name
                    ),
                )
            count += 1
        return coordinates

    def get_pages(self, text):
        soup = BeautifulSoup(text, 'xml')
        pages_infos = soup.find_all("surface")

        pages = [{'width': float(page['lrx']) - float(page['ulx']), 'height': float(page['lry']) - float(page['uly'])}
                 for page in pages_infos]

        return pages



In [None]:
processor = GrobidProcessor(grobid_client)

coordinates, page_lenghts = processor.process_structure("samples/articles/journal.pcbi.1011775.pdf")

In [None]:
processor.get_coordinates(text)