# Document MetaData Extraction

## Getting started


### Set notebook parameters

In [1]:
from dsnotebooks.settings import ProjectNotebookSettings

# notebook settings auto-loaded from .env / env vars
notebook_settings = ProjectNotebookSettings()

PROFILE_NAME = notebook_settings.profile  # the profile to use
PROJ_KEY = notebook_settings.proj_key     # the project to use

# default project_key = 1234567890abcdefghijklmnopqrstvwyz123456

Project key:  1234567890abcdefghijklmnopqrstvwyz123456


### Import example dependencies

In [2]:
import os
import json

import textwrap

import pandas as pd

import deepsearch as ds

from pathlib import Path
from zipfile import ZipFile

from deepsearch.documents.core.export import export_to_markdown
from IPython.display import display, Markdown, HTML, display_html

from deepsearch_glm.utils.load_pretrained_models import load_pretrained_nlp_models

from deepsearch_glm.nlp_utils import (
    extract_references_from_doc,
    init_nlp_model,
    list_nlp_model_configs,
)

from tabulate import tabulate

models = load_pretrained_nlp_models(force=False, verbose=True)

 -> already downloaded part-of-speech
 -> already downloaded reference
 -> already downloaded material
 -> already downloaded language
 -> already downloaded name
 -> already downloaded semantic
 -> already downloaded metadata
 -> already downloaded geoloc


### Connect to Deep Search

In [4]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

## Convert Document

In [4]:
output_dir = Path("./converted_docs")

fname = "2206.00785.pdf"

documents = ds.convert_documents(
    api=api,
    proj_key=PROJ_KEY,
    source_path=f"../../data/samples/{fname}",
    progress_bar=True
)           
documents.download_all(result_dir=output_dir)
info = documents.generate_report(result_dir=output_dir)
print(info) 

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00, 124.39it/s][38;2;15;98;254m                                                                                                                                                         [0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:06<00:00,  6.66s/it][38;2;15;98;254m                                                                                                                                                          [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:26<00:00, 26.56s/it][38;2;15;98;254m                                                                                                                                                          [0m


{'Total documents': 1, 'Successfully converted documents': 1}


In [5]:
# Iterare output files and visualize the output
for output_file in output_dir.rglob("json*.zip"):
    with ZipFile(output_file) as archive:
        all_files = archive.namelist()
        for name in all_files:
            if not name.endswith(".json"):
                continue
            
            #basename = name.rstrip('.json')
            doc_json = json.loads(archive.read(name))
            
            ofile = output_dir / name
            print(f"writing {ofile}")
            with ofile.open("w") as fw:
                fw.write(json.dumps(doc_json, indent=2))
                
            doc_md = export_to_markdown(doc_json)

            ofile = output_dir / name.replace(".json", ".md")
            print(f"writing {ofile}")
            with ofile.open("w") as fw:
                fw.write(doc_md)

            

writing converted_docs/2206.00785.json
writing converted_docs/2206.00785.md


In [6]:
# display last document
# display(Markdown(doc_md))

## Extract references from converted Document

In [7]:
def resolve(path, doc):

    if len(path)>1 and path[0]=="#":
        return resolve(path[1:], doc)
        
    if len(path)==1 and isinstance(doc, dict):
        return doc[path[0]]

    elif len(path)==1 and isinstance(doc, list):
        ind = int(path[0])
        return doc[ind]
    
    elif len(path)>1 and isinstance(doc, dict):
        return resolve(path[1:], doc[path[0]])

    elif len(path)>1 and isinstance(doc, list):
        ind = int(path[0])
        return resolve(path[1:], doc[ind])

    else:
        return None
    

In [13]:
ifile = "converted_docs/2206.00785.json"

with open(ifile) as fr:
    doc = json.load(fr)

model = init_nlp_model("language;reference;metadata")
res = model.apply_on_doc(doc)

props = pd.DataFrame(res["properties"]["data"], columns=res["properties"]["headers"])
insts = pd.DataFrame(res["instances"]["data"], columns=res["instances"]["headers"])

In [15]:
if "title" in res["description"]:
    print("TITLE")
    print(res["description"]["title"])

if "abstract" in res["description"]:
    print("ABSTRACT")
    print(res["description"]["abstract"])

doc_props = props[props["type"]=="semantic"]
print(doc_props[0:20])

doc_props = props[props["type"]=="metadata"]
print(doc_props)

doc_insts = insts[insts["subj_name"]=="DOCUMENT"][["subtype", "subj_path", "name"]]
print(doc_insts)

TITLE
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness
ABSTRACT
['Abstract-Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a stro

## Extract MetaData from public documents

In [4]:
# Import standard dependenices
from copy import deepcopy
import pandas as pd
from numerize.numerize import numerize
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

# IPython utilities
from IPython.display import display, HTML

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.client.components.elastic import ElasticDataCollectionSource
from deepsearch.cps.queries import DataQuery

In [5]:
# Fetch list of all data collections
collections = api.elastic.list()
collections.sort(key=lambda c: c.name.lower())

In [7]:
# Visualize summary table
results = [
    {
        "Name": c.name,
        "Type": c.metadata.type,
        "Num entries": numerize(c.documents),
        "Date": c.metadata.created.strftime("%Y-%m-%d"),
        "Coords": f"{c.source.elastic_id}/{c.source.index_key}",
    }
    for c in collections
]
display(pd.DataFrame(results))

Unnamed: 0,Name,Type,Num entries,Date,Coords
0,AAAI,Document,16.02K,2023-08-29,default/aaai
1,ACL Anthology,Document,55.28K,2023-08-22,default/acl
2,Annual Reports,Document,107.38K,2024-04-15,default/annual-report
3,arXiv abstracts,Document,2.37M,2023-12-07,default/arxiv-abstract
4,arXiv category taxonomy,Record,155,2023-12-05,default/arxiv-category
...,...,...,...,...,...
56,UMLS,Record,2.69M,2023-01-03,default/umls
57,UniProt,Record,567.48K,2023-01-03,default/uniprot
58,USPTO patents for NER,Document,2.64K,2023-03-20,default/uspto-for-ner
59,VHDL articles,Document,215,2024-04-23,default/vhdl


In [8]:
# Input query: search for papers which mention `DocLayNet` or `PubLayNet` in the main-text
search_query = "main-text.text:(\"DocLayNet\" OR \"PubLayNet\")"

# Iterate through the data collections
results = []
for c in (pbar := tqdm(collections)):
    pbar.set_description(f"Querying {c.name}")

    # Search only on document collections
    if c.metadata.type != "Document":
        continue

    # Execute the query
    query = DataQuery(search_query, source=[], limit=0, coordinates=c.source)
    query_results = api.queries.run(query)
    results.append({
        "name": c.name,
        "matches": query_results.outputs["data_count"]
    })

# Sort and display results
results.sort(reverse=True, key=lambda r: r["matches"])
display(pd.DataFrame(results[0:5]))

  0%|          | 0/61 [00:00<?, ?it/s]

Unnamed: 0,name,matches
0,arXiv full documents,165
1,Semantic Scholar Academic Graph,40
2,OpenCVF,31
3,arXiv abstracts,24
4,ACL Anthology,16


In [11]:
data_collection = ElasticDataCollectionSource(elastic_id="default", index_key="arxiv")
page_size = 5

# Prepare the data query
query = DataQuery(
    search_query, # The search query to be executed
    #source=["description.title", "description.authors", "identifiers"], # Which fields of documents we want to fetch
    limit=page_size, # The size of each request page
    coordinates=data_collection # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula

model = init_nlp_model("language;reference;metadata")

# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    # Iterate through the results of a single page, and add to the total list
    for row in result_page.outputs["data_outputs"]:
        doc = row["_source"]
        print(doc["file-info"]["filename"])

        res = model.apply_on_doc(doc)

        if "title" in res["description"]:
            print("title: ", res["description"]["title"])

        if "abstract" in res["description"]:
            print("abstract: ", res["description"]["abstract"])

        props = pd.DataFrame(res["properties"]["data"], columns=res["properties"]["headers"])
        #print(props[0:12])

        #insts = pd.DataFrame(res["instances"]["data"], columns=res["instances"]["headers"])

        #doc_insts = insts[insts["subj_name"]=="DOCUMENT"][["subtype", "subj_path", "name"]]
        #print(doc_insts)



  0%|          | 0/33 [00:00<?, ?it/s]

2007.12238.pdf
title:  MiniConf-A Virtual Conference Framework
abstract:  Abstract MiniConf is a framework for hosting virtual academic conferences motivated by the sudden inability for these events to be hosted globally. The framework is designed to be global and asynchronous, interactive, and to promote browsing and discovery. We developed the system to be sustainable and maintainable, in particular ensuring that it is open-source, easy to setup, and scalable on minimal hardware. In this technical report, we discuss design decisions, provide technical detail, and show examples of a case study deployment. Keywords Conference Management-Academic Communication-Software Development $^{1}$CS+Cornell Tech, Cornell University, New York NY, USA $^{2}$MIT-IBM Watson AI Lab, IBM Research, Cambridge MA, USA Correspondence : info@mini-conf.org
        type             subj_hash subj_name  subj_path      label  confidence
0   language   2265028778467379955  DOCUMENT          #         en        1

## Extract MetaData from private documents

In [5]:
import os
import json
import argparse

# Import standard dependenices
from copy import deepcopy
import pandas as pd
from numerize.numerize import numerize
from tqdm import tqdm
import matplotlib.pyplot as plt

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.client.components.elastic import ElasticDataCollectionSource
from deepsearch.cps.queries import DataQuery

In [6]:
def get_indices_in_project(api, proj_key, coll_name):

    data_indices = api.data_indices.list(proj_key=proj_key)

    for index in data_indices:
        if coll_name==index.name:
            return index

    print("Could not find collection in project. Please select one of the following collections")
    for index in data_indices:
        print(" -> collection: ", index)
    
    return None

def search_documents(api, proj_key, coll_name, query, max_docs=100, page_size=1):

    index = get_indices_in_project(api, coll_name=coll_name,
                                   proj_key=proj_key)

    if index==None:
        return

    try:
        data_query = DataQuery(query, coordinates=index.source, limit=page_size) # The size of each request page)
        cursor = api.queries.run_paginated_query(data_query)

        # [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
        count_query = deepcopy(data_query)
        count_query.paginated_task.parameters["limit"] = 0
        count_results = api.queries.run(count_query)
        expected_total = count_results.outputs["data_count"]
        expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula

        print("#-documents: ", expected_total)

        cur_docs = 0
        for result_page in tqdm(cursor):

            if cur_docs>max_docs:
                break

            for row in result_page.outputs["data_outputs"]:

                #print(cur_docs, max_docs)
                if cur_docs>max_docs:
                    break


                yield row["_source"]
                cur_docs += 1

    except Exception as e:
        print(" => ", e)

In [12]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

model = init_nlp_model("language;reference;metadata")

proj_key = "c4ae6545156c5f99770fdfd161102a01567d8ecd"
#coll_name = "GeoArabia"
#coll_name = "BasinResearch1"
coll_name = "African_ES"

query = "*"

for doc in search_documents(api, proj_key, coll_name, query, max_docs=5, page_size=1):
    
    print("document-hash: ", doc["file-info"]["document-hash"])
    
    res = model.apply_on_doc(doc)
    #print(res["description"].keys())
  
    if "title" in res["description"]:
        text = res["description"]["title"]
        text = "\n".join(textwrap.wrap(text, width=70))

        print("title: \n", text, "\n")

    if "authors" in res["description"]:
        print("authors: ", json.dumps(res["description"]["authors"], indent=2))

    if "affiliations" in res["description"]:
        print("affiliations: ", json.dumps(res["description"]["affiliations"], indent=2))
    
    if "abstract" in res["description"]:

        print("abstract: \n")
        for _ in res["description"]["abstract"]:
            text = "\n".join(textwrap.wrap(_, width=70))
            print(text, "\n")


#-documents:  9


1it [00:01,  1.12s/it]

document-hash:  0f43aba61158df5f5a00d91434bee8dd47e9dad2a6252ab7607408e2e6057b7d
title: 
 Source area and tectonic provenance of Paleocene-Eocene red bed
clastics from the Kurdistan area NE Iraq: Bulk-rock geochemistry
constraints 

authors:  [
  {
    "name": "Brian G Jones"
  },
  {
    "name": "Muatasam Mahmood Hassan"
  },
  {
    "name": "Solomon Buckman"
  },
  {
    "name": "Ali Ismail Al Jubory"
  },
  {
    "name": "Sabah Ahmed Ismail"
  }
]
affiliations:  [
  {
    "name": "School of Earth and Environmental Sciences"
  },
  {
    "name": "University of Wollongong"
  },
  {
    "name": "School of Earth Science"
  },
  {
    "name": "School of Earth Science"
  },
  {
    "name": "University of Kirkuk"
  }
]
abstract: 

abstract 

Paleocene-Eocene Red Beds exist along a narrow belt in the NW-SE
oriented imbricate zone in northeastern Iraq and are composed of
clastic rocks including conglomerate, sandstone and mudstone. 

Trace elements show that the lower part of the Red Beds (u

2it [00:01,  1.18it/s]

document-hash:  44cd3953cb824628f2d7fe8976afc9beb2ed07c26ae83f0c79ca357af85af9d4
title: 
 Facies analysis and diagenetic features of the Aptian Dariyan
Formation in Zagros Fold-Thrust Belt, SW Iran 

authors:  [
  {
    "name": "Arash Shaabanpour Haghighi"
  },
  {
    "name": "Mohammad Sahraeyan"
  }
]
affiliations:  []
abstract: 

abstract 

The Aptian Dariyan Formation (upper part of the Khami Group), is one
of the important reservoir rocks in the Zagros Fold-Thrust Belt. The
Zagros Fold-Thrust Belt is located on the boundary between the Arabian
and Eurasian lithospheric plates and formed from collision between
Eurasia and advancing Arabia during the Cenozoic. In these studied
area, the Dariyan Formation with a thickness of 136 meters (Fahliyan
section) and 100 meters (Kuh-e-Rahmat section), consists of carbonate
rocks. Based on the facies analysis and sedimentological data, 16
microfacies were identified. The microfacies are attributed to five
facies belts: tidal flat (lime mudston

3it [00:02,  1.26it/s]

document-hash:  45319f285bb4544209fb74269a72a17c3a3525246945441aec927928a105bf04
title: 
 Integrated provenance analysis of Zakeen (Devonian) and Faraghan
(early Permian) sandstones in the Zagros belt, SW Iran 

authors:  [
  {
    "name": "S Mohammad Zamanzadeh"
  },
  {
    "name": "Yousef Zoleikhaei"
  },
  {
    "name": "Abdolhossein Amini"
  }
]
affiliations:  [
  {
    "name": "College of Science"
  },
  {
    "name": "University of Tehran"
  },
  {
    "name": "Faculty of Geography"
  },
  {
    "name": "University of Tehran"
  }
]
abstract: 

abstract 

Successions of a controversial period of time in the Zagros and
Arabian Plate stratigraphic column, including Zakeen (Devonian) and
Faraghan (early Permian) formations are investigated for their
provenance characteristics. Nearly similar depositional environments
of the formations, regardless of 70-80 My hiatus between them, is the
main motivation for this study. Evidence from various methods are put
together to reconstruct a co

4it [00:03,  1.42it/s]

document-hash:  71b9d4a7505055da7d78886e41abc80602eecdaed0863a0f51add493f38968ba
title: 
 Multi-phase inversion tectonics related to the Hendijan e Nowrooz e
Khafji Fault activity, Zagros Mountains, SW Iran 

authors:  [
  {
    "name": "Sadjad Kazem Shiroodi"
  },
  {
    "name": "Mohammad Ghafoori"
  },
  {
    "name": "Ali Faghih"
  },
  {
    "name": "Mostafa Ghanadian"
  },
  {
    "name": "Gholamreza Lashkaripour"
  },
  {
    "name": "Naser Hafezi Moghadas"
  }
]
affiliations:  [
  {
    "name": "Department of Geology"
  },
  {
    "name": "Faculty of Sciences"
  },
  {
    "name": "Ferdowsi University of Mashhad"
  }
]
abstract: 

abstract 

Distinctive characteristics of inverted structures make them important
criteria for the identification of certain structural styles of folded
belts. The interpretation of 3D seismic reflection and well data sheds
new light on the structural evolution and age of inverted structures
associated to the Hendijan$_{e}$Nowrooz $_{e}$Khafji Fault w

5it [00:03,  1.57it/s]

document-hash:  7594495bb7872d4aa3bfa7bacbc4f598fa8c84fddc6c553effaf4f1b101935c0
title: 
 Lithofacies, architectural elements and tectonic provenance of the
siliciclastic rocks of the Lower Permian Dorud Formation in the Alborz
Mountain Range, Northern Iran 

authors:  [
  {
    "name": "Mojtaba Javidan"
  },
  {
    "name": "Hosseinali Mokhtarpour"
  },
  {
    "name": "Mohammad Sahraeyan"
  },
  {
    "name": "Hojatollah Kheyrandish"
  }
]
affiliations:  [
  {
    "name": "Department of Geology"
  },
  {
    "name": "College of Basic Sciences"
  },
  {
    "name": "Department of Geology"
  },
  {
    "name": "Department of Geology"
  }
]
abstract: 

abstract 

The siliciclastic deposits of the Lower Permian Dorud Formation widely
crop out in the eastern part of the Alborz Mountain Range (northern
Iran). In order to interpret the sedimentary environments and tectonic
provenance of these deposits, two sections in the Kiyasar and
Talmadareh with 112 and 122 m thickness, respectively; ha

6it [00:04,  1.64it/s]

document-hash:  7c5d4947280cec27fbb01892eea145933df0813be615ccd5fb5bb5503254d0f1
title: 
 Stratigraphy, mineralogy and depositional environment of the evaporite
unit in the As ¸ kale (Erzurum) sub-basin, Eastern Anatolia (Turkey) 

authors:  [
  {
    "name": "Emel Abdio"
  },
  {
    "name": "Mehmet Arslan"
  },
  {
    "name": "Cahit Helvac"
  }
]
affiliations:  []
abstract: 

abstract 

The study area is situated in the As¸ kale sub-basin where the Early-
Middle Miocene aged As¸ kale Formation was deposited in a shallow
marine to lagoonal environment, and consists of interstratifications
of clastic sediments, carbonates and evaporites. The successions of
the As¸ kale Formation can be divided into four main members
interfingering with one another both vertically and laterally, and
composed of the sandstone-mudstone-limestone member, the evaporite
member, the gravelstone-sandstone-mudstone intercalations and the
limestone member. The evaporite unit comprises of secondary gypsum
lithof

6it [00:04,  1.32it/s]
