Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Download more PMC files: oa_other and historical_ocr #584

Open
2 tasks
FrancescoCasalegno opened this issue Feb 18, 2022 · 1 comment · May be fixed by #596
Open
2 tasks

Download more PMC files: oa_other and historical_ocr #584

FrancescoCasalegno opened this issue Feb 18, 2022 · 1 comment · May be fixed by #596

Comments

@FrancescoCasalegno
Copy link
Contributor

Context

As explained on the PMC website here, the following types of files are available for bulk download.

  1. PMC Open Access Subset, divided in other 3 groupings:
    1.1 oa_comm – Commercial Use Allowed - CC0, CC BY, CC BY-SA, CC BY-ND licenses
    1.2 oa_noncomm – Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND licenses
    1.3 oa_other – No machine-readable Creative Commons license, no license, or a custom license
  2. Author Manuscript Dataset
  3. Historical OCR Dataset

But currently, when we run bbs_database download pmc, we are only downloading the subsets 1.1, 1.2, 2., and 3.

"Only {'author_manuscript', 'oa_comm', 'oa_noncomm'} are supported."

Actions

@FrancescoCasalegno FrancescoCasalegno changed the title Download more PMC files: oa_other and Download more PMC files: oa_other and historical_ocr Feb 18, 2022
@FrancescoCasalegno
Copy link
Contributor Author

historical_ocr subset

Useful Links

[1] PMC description of scanning (OCR) process: https://www.ncbi.nlm.nih.gov/pmc/about/scanning/
[2] Files for bulk download (FTP) on PMC servers: https://ftp.ncbi.nlm.nih.gov/pub/pmc/historical_ocr/

Description

As explained in [1], the scanning process produced OCR text of the full texts. Some of those files were then made available as unedited .txt files for bulk download at [2]. For instance, article PMC5421081 was processed with OCR scanned and gave the following output (downloaded from [2]):

    Harrison's CD-ROM. Version 1.0.
    (Harrison's principles of internal
    medicine, 13th edition, 1994 with
    selected Medline? abstracts).
    McGraw-Hill, New York, 1995.
    $165.00.
    A major medical textbook, a phar-
    macopeia* and selected Medline?
    abstracts are available on one CD,
    ...

As you can see this is just the raw, unedited .txt output of an OCR tool run on an old, scanned article.

Issues

  1. The file is a raw .txt without any recognizable or general structure (different from article to article). We currently have no parsers to read this kind of files.
  2. There seem to be no way in general to get the article metadata (title, authors, date, journal, ...) or the article topics from the .txt itself. We would need to do some lookup to external databases to get this kind of info.

Proposed Solution

Because of the issues above, I sugges we don't address this historical_ocr for the moment, but we can then do that later on.

@FrancescoCasalegno FrancescoCasalegno linked a pull request Mar 25, 2022 that will close this issue
4 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant