# Text and Tables Extraction

This notebook presents how to use our pipeline to extract text and tables from arXiv papers with available LaTeX source code.

In [1]:
from pathlib import Path
from axcell.helpers.paper_extractor import PaperExtractor

### Structure of Directories

We cache the artifacts produced by successful execution of the intermediate steps of extraction pipeline. The `root` argument of `PaperExtractor` is a path under which the following directory structue is created:

```
root
├── sources                       # e-print archives
├── unpacked_sources              # extracted latex sources (generated automatically)
├── htmls                         # converted html files (generated automatically)
└── papers                        # extracted text and tables (generated automatically)
```

In [2]:
ROOT_PATH = Path('data')
ROOT_PATH.absolute().__str__()

'/home/lazoark/OneDrive/Curriculum/NLP_2024/CLEF2024/axcell/notebooks/data'

In our case there's a single e-print archive:

In [3]:
# !tree {ROOT_PATH}

In [4]:
extract = PaperExtractor(ROOT_PATH)

To extract text and tables from a single paper just pass the path to the archive:

In [5]:
SOURCES_PATH = ROOT_PATH / 'sources'
# extract(SOURCES_PATH / '1903' / '1903.11816v1')

# from glob import glob
# glob(ROOT_PATH)

all_file = [_.as_posix() for _ in SOURCES_PATH.glob("*//*.tex")]
print(all_file)
# extract(all_file[0])

[]


In [6]:
# _ID = "0811.3247"
_ID = "1703.03400"
# _ID = "2105.05348"
# _ID = "1805.09843"
# _ID = "2005.06723v1"

extract(Path("data/sources").joinpath(_ID.split(".")[0]).joinpath(_ID))

[DEBUG] Entering `latex.to_html`
Entering `TemporaryDirectory() as output_dir:`...
Entering `output_dir = Path(output_dir)`...
Entering `self.latex2html(source_dir, output_dir)`...
####################################################################################################
{PosixPath('/home/lazoark/miniconda3/envs/axcell/lib/python3.7/site-packages/axcell/scripts/latex2html.sh'): {'bind': '/files/latex2html.sh', 'mode': 'ro'}, PosixPath('/home/lazoark/miniconda3/envs/axcell/lib/python3.7/site-packages/axcell/scripts/guess_main.py'): {'bind': '/files/guess_main.py', 'mode': 'ro'}, PosixPath('/home/lazoark/miniconda3/envs/axcell/lib/python3.7/site-packages/axcell/scripts/patches'): {'bind': '/files/patches', 'mode': 'ro'}, PosixPath('/home/lazoark/OneDrive/Curriculum/NLP_2024/CLEF2024/axcell/notebooks/data/unpacked_sources/1703/1703.03400'): {'bind': '/files/ro-source', 'mode': 'ro'}, PosixPath('/tmp/tmpusa3fik7'): {'bind': '/files/htmls', 'mode': 'rw'}}
['/files/latex2html.sh', 

Command '['/files/latex2html.sh', 'index.html']' in image 'arxivvanity/engrafo:latest' returned non-zero exit status 1: b''
processing-error


axcell.errors.LatexConversionError()

In [7]:
# {
#   Path('/home/lazoark/miniconda3/envs/axcell/lib/python3.7/site-packages/axcell/scripts/latex2html.sh'): {'bind': '/files/latex2html.sh', 'mode': 'ro'}, 
#   Path('/home/lazoark/miniconda3/envs/axcell/lib/python3.7/site-packages/axcell/scripts/guess_main.py'): {'bind': '/files/guess_main.py', 'mode': 'ro'}, 
#   Path('/home/lazoark/miniconda3/envs/axcell/lib/python3.7/site-packages/axcell/scripts/patches'): {'bind': '/files/patches', 'mode': 'ro'}, 
#   Path('/home/lazoark/OneDrive/Curriculum/NLP_2024/CLEF2024/axcell/notebooks/data/unpacked_sources/1703/1703.03400'): {'bind': '/files/ro-source', 'mode': 'ro'}, 
#   Path('/tmp/tmp7kel2899'): {'bind': '/files/htmls', 'mode': 'rw'}
# }


In [8]:
# # SOURCES_PATH = ROOT_PATH / 'sources'
# extract(SOURCES_PATH / '1903' / '1903.11816v1')

# from glob import glob
# glob(ROOT_PATH)

# all_file = [_.as_posix() for _ in SOURCES_PATH.glob("*//*.tex")]
# print(all_file)
# # extract(all_file[0])

The subdirectory structure under `sources` directory will be replicated in the other top-level directories.

In [9]:
# !tree -L 4 {ROOT_PATH}

The extracted data is stored in `papers` directory. We can read it using `PaperCollection` class. `PaperCollection` is a wrapper for `list` of papers with additional functions added for convenience. Due to large number of papers it is recommended to load the dataset in parallel (default uses number of processes equal to number of CPU cores) and store it in a pickle file. Set jobs=1 to disable multiprocessing.

In [10]:
from axcell.data.paper_collection import PaperCollection

PAPERS_PATH = ROOT_PATH / 'papers'
pc = PaperCollection.from_files(PAPERS_PATH)
# pc.to_pickle('mypapers.pkl')
# pc = PaperCollection.from_pickle('mypapers.pkl')
print(PAPERS_PATH)
paper = pc.get_by_id(_ID)
paper.text

data/papers


AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
paper.text.title

AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
# paper.tables[7]
paper.tables[0]

IndexError: list index out of range

As *FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation* (Wu et al., 2019) is present in our **SegmentedTables** dataset, we can use `PaperCollection` to import annotations (table segmentation and results):

In [None]:
from axcell.helpers.datasets import read_tables_annotations

V1_URL = 'https://github.com/paperswithcode/axcell/releases/download/v1.0/'
SEGMENTED_TABLES_URL = V1_URL + 'segmented-tables.json.xz'

segmented_tables = read_tables_annotations(SEGMENTED_TABLES_URL)

pc = PaperCollection.from_files(PAPERS_PATH, annotations=segmented_tables.to_dict('record'))

In [None]:
# paper = pc.get_by_id('1903.11816')
paper = pc.get_by_id(_ID)
paper.tables[7]

IndexError: list index out of range

In [None]:
pc.cells_gold_tags_legend()

0,1
Tag,description
model-best,the best performing model introduced in the paper
model-paper,model introduced in the paper
model-ensemble,ensemble of models introduced in the paper
model-competing,model from another paper used for comparison
dataset-task,Task
dataset,Dataset
dataset-sub,Subdataset
dataset-metric,Metric
model-params,"Params, f.e., number of layers or inference time"


In [None]:
paper.tables[1].sota_records
# paper.tables[2]

IndexError: list index out of range

## Parallel Extraction

For a single paper extraction can take from several seconds to a few minutes (the longest phase of converting LaTeX source into HTML is timed-out after 5 minutes), so to process multiple files we run extraction in parallel.

In [None]:
%%time

from joblib import delayed, Parallel

# access extract from the global context to avoid serialization
def extract_single(file): return extract(file)

files = sorted([path for path in SOURCES_PATH.glob('**/*') if path.is_file()])

statuses = Parallel(backend='multiprocessing', n_jobs=-1)(delayed(extract_single)(file) for file in files)

CPU times: user 100 ms, sys: 40.5 ms, total: 141 ms
Wall time: 30.1 s
