pubmed_standardization

This library takes a PubMed abstract collection in xml format stored in a working directory and standarize the content, generating individual plain text file for each abstract.

Description

The input directory contains PubMed's *.gz files.

The first task executed for the library is unzipping the files. Thereafter, PubMed xml files containing the abstracts are read and for each article a new text file is generated. The files are named using the PMID identifier, e.g. PMIDXXX.txt.

This library can be use as an intermediate step in any pipeline required to generate plain text from PubMed abstracts in xml format. It is useful for NLP tasks such as classification and topic mining.

After standardization, each text file contains the following (in this order):

year month pmid title abstract

Actual Version: 1.0, 2020-05-12

Changelog

Docker

debbieproject/pubmed_standardization

Run the Docker

#To run the docker, just set the input_folder and the output
docker run -v ${PWD}/pubmed:/in -v ${PWD}/standardization_output:/out pubmed_standardization:version python3 /app/pubmed_standardization.py -i /in -o /out

Parameters:

-i input folder. Will process subfolder also.

-o output folder.

Built With

Docker - Docker Containers

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

**Javier Corvi - Austin Mckitrick - Osnat Hakimi **

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 751277

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
CHANGELOG		CHANGELOG
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
eu_emblem.png		eu_emblem.png
pubmed_standardization.py		pubmed_standardization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

.project

.project

.pydevproject

.pydevproject

CHANGELOG

CHANGELOG

Dockerfile

Dockerfile

LICENSE.txt

LICENSE.txt

README.md

README.md

eu_emblem.png

eu_emblem.png