Download more `PMC` files: `oa_other` and `historical_ocr` #584

FrancescoCasalegno · 2022-02-18T15:32:52Z

Context

As explained on the PMC website here, the following types of files are available for bulk download.

PMC Open Access Subset, divided in other 3 groupings:
1.1 oa_comm – Commercial Use Allowed - CC0, CC BY, CC BY-SA, CC BY-ND licenses
1.2 oa_noncomm – Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND licenses
1.3 oa_other – No machine-readable Creative Commons license, no license, or a custom license
Author Manuscript Dataset
Historical OCR Dataset

But currently, when we run bbs_database download pmc, we are only downloading the subsets 1.1, 1.2, 2., and 3.

Search/src/bluesearch/database/download.py

Line 119 in 08fc2b0

"Only {'author_manuscript', 'oa_comm', 'oa_noncomm'} are supported."

Actions

Support download for subset oa_other.
Support download for subset historical_ocr .

The text was updated successfully, but these errors were encountered:

FrancescoCasalegno · 2022-03-25T10:33:15Z

`historical_ocr` subset

Useful Links

[1] PMC description of scanning (OCR) process: https://www.ncbi.nlm.nih.gov/pmc/about/scanning/
[2] Files for bulk download (FTP) on PMC servers: https://ftp.ncbi.nlm.nih.gov/pub/pmc/historical_ocr/

Description

As explained in [1], the scanning process produced OCR text of the full texts. Some of those files were then made available as unedited .txt files for bulk download at [2]. For instance, article PMC5421081 was processed with OCR scanned and gave the following output (downloaded from [2]):

    Harrison's CD-ROM. Version 1.0.
    (Harrison's principles of internal
    medicine, 13th edition, 1994 with
    selected Medline? abstracts).
    McGraw-Hill, New York, 1995.
    $165.00.
    A major medical textbook, a phar-
    macopeia* and selected Medline?
    abstracts are available on one CD,
    ...

As you can see this is just the raw, unedited .txt output of an OCR tool run on an old, scanned article.

Issues

The file is a raw .txt without any recognizable or general structure (different from article to article). We currently have no parsers to read this kind of files.
There seem to be no way in general to get the article metadata (title, authors, date, journal, ...) or the article topics from the .txt itself. We would need to do some lookup to external databases to get this kind of info.

Proposed Solution

Because of the issues above, I sugges we don't address this historical_ocr for the moment, but we can then do that later on.

FrancescoCasalegno changed the title ~~Download more PMC files: oa_other and~~ Download more PMC files: oa_other and historical_ocr Feb 18, 2022

FrancescoCasalegno linked a pull request Mar 25, 2022 that will close this issue

Support download of PMC oa_other #596

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download more `PMC` files: `oa_other` and `historical_ocr` #584

Download more `PMC` files: `oa_other` and `historical_ocr` #584

FrancescoCasalegno commented Feb 18, 2022

FrancescoCasalegno commented Mar 25, 2022

Download more PMC files: oa_other and historical_ocr #584

Download more PMC files: oa_other and historical_ocr #584

Comments

FrancescoCasalegno commented Feb 18, 2022

Context

Actions

FrancescoCasalegno commented Mar 25, 2022

historical_ocr subset

Useful Links

Description

Issues

Proposed Solution

Download more `PMC` files: `oa_other` and `historical_ocr` #584

Download more `PMC` files: `oa_other` and `historical_ocr` #584

`historical_ocr` subset