A repo for PDF processing of certain journals

TODO

Solve fi problem -> ligatures
Parsers for journals: 30
Test out diff for test3_jgra(_fi)
Add missing parsers: ARX
Test CLIMD, ECOAPP, NPJCLIAC, NPJCLISCI, GCB,
Solve Acknowledgement problem in PNAS -> Check Notes
Solve Missing text problems in IJOC -> Check Notes
Add new PNAS sample to all devices

Setup

With Docker

Create Docker image from the folder containing Dockerfile

docker build -t pdf_text:1.1.2 .

Run docker compose in the folder where docker-compose.yml is and open it in VS code:

docker compose up

Note: Adjust volumes in docker-compose.yml according to your need!

Without Docker

Just pip install requirements.txt duuh

Data structure

Columns	Descriptions
1. Title	Paper title in a list: `["Effects of pretraining corpora"]`
2. Authors_and_Affiliations	List of author and affil number tuples: `[("Andrija Poleksic", "1, 2"), (...)]`
3. Affiliations	Affiliation text and number tuples: `[(1, "FIDIT"), (...)]`
4. DOI	Paper doi number in a list: `["10.23919/mipro57284.2023.10159770"]`
5. Authors	String containing all authors or detailed list: `"Poleksic, Andrija and ..."` or `[{'ORCID': '123', 'creator': 'Poleksic, Andrija'}, {...}]`
6. Journal	Name of the journal: `"Nature Geoscience"`
7. Date	Date of publishing: `5-30-2034`
8. Subjects	List of topics in the paper: `["Earth Sciences", "..."`
9. Abstract	Abstract text of the paper: `"The amount of data ..."`
10. References	List of references: `["Matching the Blanks: Distributio ..."]`
11. Content	Full text from paper: `"Reading text to identify and ..."`
12. Keywords	Keywords or keypoints from a paper, list, or string: `["Internal variability", "..."]` or `"A significant interdecadal variation ..."`
13. Style	Debug data: `"1"`

Data template:

# Original
paper_data = {
            "Title": title,
            "Authors_and_Affiliations": authors_and_affiliations,
            "Affiliations": affiliations,
            "DOI": doi,
            "Authors": authors,
            "Journal": journal,
            "Date": date,
            "Subjects": subjects,
            "Abstract": abstract,
            "References": references,
            "Content": content,
            "Keywords": keywords,
            "Style": style,
        }

# Default
paper_data = {
            "Title": "no_title",
            "Authors_and_Affiliations": "no_auth_and_affil",
            "Affiliations": "no_affil",
            "DOI": "no_doi",
            "Authors": "no_author",
            "Journal": "no_journal",
            "Date": "no_date",
            "Subjects": "no_subjects",
            "Abstract": "no_abstract",
            "References": "no_references",
            "Content": "no_content",
            "Keywords": "no_keywords",
            "Style": s,
        }

Journals

climd_htmlpars.py
1. Climate Dynamics
ecoapp_htmlpars.py
2. Ecological Applications
ehs_htmlpars.py
3. Ecosystem Health and Sustainability
enerpol_htmlpars.py
4. Energy Policy
gcb_htmlpars.py
5. Global Change Biology
ijoc_htmlpars.py
6. International Journal of Climatology
jclimate_htmlpars.py
7. Journal of Climate
jgra_htmlpars.py
8. Journal of Geophysical Research: Atmospheres
mdpi_htmlpars.py
9. MDPI Air
10. MDPI Atmosphere
11. MDPI Climate
12. MDPI Earth
13. MDPI Ecologies
14. MDPI Energies
15. MDPI Environments
16. MDPI Forests
17. MDPI Fuels
18. MDPI Hydrology
19. MDPI Meteorology
20. MDPI Oceans
21. MDPI Recycling
22. MDPI Sustainable Chemistry
23. MDPI Water
nature_htmlpars.py
24. Nature Climate Change
ngeo_htmlpars.py
25. Nature Geoscience
npjcliac_htmlpars.py:
26. NPJ Climate Action
npjclisci_htmlpars.py:
27. NPJ Climate and Atmospheric Science
28. NPJ Ocean Sustainability
pnas_htmlpars.py
29. PNAS

Miscellaneous ArXiv

Test reports

based on try16.py and outputs in test25.txt

PNAS report

IJOC report

JCLIMATE report

CLIMD report

JGRA report

Dataset preparation

Processing with run_htmlpars_parallel.py and desired *_htmlpars.py -> OUTPUT: Fragmented pickled pandas files with defined data structure
Concatenation of the fragmented dataframes with concat_dataframes.py -> OUTPUT: Dataframes concatenated per journal
Deduplication of the papers per journal with check_for_duplicates.py -> OUTPUT: Deduplicated dataframes
Concatenation into a single file with check_for_duplicates.py -> OUTPUT: Dataframe containing all journals
Saving in multiple formats for different tasks (p2csv_tc.py) -> OUTPUT: CSV dataset, CSV with Title-Content pairs, and pickle with Title-Content pairs

Vocabulary creation

Dataset statistics

POS and NER reports

BERT_PRETRAINING

Continuation of this work

Vocabulary reports

Made with vocab_compare.py

Entity and term data

Dictinary sources

Cite

@inproceedings{poleksic2024towards,
  title        = {Towards Dataset for Extracting Relations in the Climate-Change Domain},
  author       = {Andrija Poleksi{\'c} and Sanda Martin{\v{c}}i{\'c}-Ip{\v{s}}i{\'c}},
  booktitle    = {Proceedings of the Third International Workshop on Knowledge Graph Generation from Text, co-located with Extended Semantic Web Conference (ESWC)},
  year         = 2024,
  address      = {Hersonissos, Greece},
  pages        = {xx--yy},
  date         = {May 26--30},
}

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
PDF_TXT		PDF_TXT
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
extra_requirements.txt		extra_requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A repo for PDF processing of certain journals

TODO

Setup

With Docker

Without Docker

Data structure

Journals

Test reports

PNAS report

IJOC report

JCLIMATE report

CLIMD report

JGRA report

Dataset preparation

Vocabulary creation

Dataset statistics

POS and NER reports

BERT_PRETRAINING

Vocabulary reports

Entity and term data

Dictinary sources

Cite

About

Releases

Packages

Languages

License

P0L3/PDF2TXT

Folders and files

Latest commit

History

Repository files navigation

A repo for PDF processing of certain journals

TODO

Setup

With Docker

Without Docker

Data structure

Journals

Test reports

Dataset preparation

Vocabulary creation

Entity and term data

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Languages