You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.
As explained on the PMC website here, the following types of files are available for bulk download.
PMC Open Access Subset, divided in other 3 groupings:
1.1 oa_comm – Commercial Use Allowed - CC0, CC BY, CC BY-SA, CC BY-ND licenses
1.2 oa_noncomm – Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND licenses
1.3 oa_other – No machine-readable Creative Commons license, no license, or a custom license
As explained in [1], the scanning process produced OCR text of the full texts. Some of those files were then made available as unedited .txt files for bulk download at [2]. For instance, article PMC5421081 was processed with OCR scanned and gave the following output (downloaded from [2]):
Harrison's CD-ROM. Version 1.0.
(Harrison's principles of internal
medicine, 13th edition, 1994 with
selected Medline? abstracts).
McGraw-Hill, New York, 1995.
$165.00.
A major medical textbook, a phar-
macopeia* and selected Medline?
abstracts are available on one CD,
...
As you can see this is just the raw, unedited .txt output of an OCR tool run on an old, scanned article.
Issues
The file is a raw .txt without any recognizable or general structure (different from article to article). We currently have no parsers to read this kind of files.
There seem to be no way in general to get the article metadata (title, authors, date, journal, ...) or the article topics from the .txt itself. We would need to do some lookup to external databases to get this kind of info.
Proposed Solution
Because of the issues above, I sugges we don't address this historical_ocr for the moment, but we can then do that later on.
Context
As explained on the PMC website here, the following types of files are available for bulk download.
1.1
oa_comm
– Commercial Use Allowed - CC0, CC BY, CC BY-SA, CC BY-ND licenses1.2
oa_noncomm
– Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND licenses1.3
oa_other
– No machine-readable Creative Commons license, no license, or a custom licenseBut currently, when we run
bbs_database download pmc
, we are only downloading the subsets 1.1, 1.2, 2., and 3.Search/src/bluesearch/database/download.py
Line 119 in 08fc2b0
Actions
oa_other
.historical_ocr
.The text was updated successfully, but these errors were encountered: