- Solve fi problem -> ligatures
- Parsers for journals: 30
- Test out diff for test3_jgra(_fi)
- Add missing parsers: ARX
- Test CLIMD, ECOAPP, NPJCLIAC, NPJCLISCI, GCB,
- Solve Acknowledgement problem in PNAS -> Check Notes
- Solve Missing text problems in IJOC -> Check Notes
- Add new PNAS sample to all devices
- Create Docker image from the folder containing Dockerfile
docker build -t pdf_text:1.1.2 .
- Run docker compose in the folder where docker-compose.yml is and open it in VS code:
docker compose up
Note: Adjust volumes in docker-compose.yml according to your need!
- Just pip install requirements.txt duuh
Columns | Descriptions |
---|---|
1. Title | Paper title in a list: ["Effects of pretraining corpora"] |
2. Authors_and_Affiliations | List of author and affil number tuples: [("Andrija Poleksic", "1, 2"), (...)] |
3. Affiliations | Affiliation text and number tuples: [(1, "FIDIT"), (...)] |
4. DOI | Paper doi number in a list: ["10.23919/mipro57284.2023.10159770"] |
5. Authors | String containing all authors or detailed list: "Poleksic, Andrija and ..." or [{'ORCID': '123', 'creator': 'Poleksic, Andrija'}, {...}] |
6. Journal | Name of the journal: "Nature Geoscience" |
7. Date | Date of publishing: 5-30-2034 |
8. Subjects | List of topics in the paper: ["Earth Sciences", "..." |
9. Abstract | Abstract text of the paper: "The amount of data ..." |
10. References | List of references: ["Matching the Blanks: Distributio ..."] |
11. Content | Full text from paper: "Reading text to identify and ..." |
12. Keywords | Keywords or keypoints from a paper, list, or string: ["Internal variability", "..."] or "A significant interdecadal variation ..." |
13. Style | Debug data: "1" |
Data template:
# Original
paper_data = {
"Title": title,
"Authors_and_Affiliations": authors_and_affiliations,
"Affiliations": affiliations,
"DOI": doi,
"Authors": authors,
"Journal": journal,
"Date": date,
"Subjects": subjects,
"Abstract": abstract,
"References": references,
"Content": content,
"Keywords": keywords,
"Style": style,
}
# Default
paper_data = {
"Title": "no_title",
"Authors_and_Affiliations": "no_auth_and_affil",
"Affiliations": "no_affil",
"DOI": "no_doi",
"Authors": "no_author",
"Journal": "no_journal",
"Date": "no_date",
"Subjects": "no_subjects",
"Abstract": "no_abstract",
"References": "no_references",
"Content": "no_content",
"Keywords": "no_keywords",
"Style": s,
}
- climd_htmlpars.py
- ecoapp_htmlpars.py
2. Ecological Applications - ehs_htmlpars.py
3. Ecosystem Health and Sustainability - enerpol_htmlpars.py
4. Energy Policy - gcb_htmlpars.py
5. Global Change Biology - ijoc_htmlpars.py
6. International Journal of Climatology - jclimate_htmlpars.py
7. Journal of Climate - jgra_htmlpars.py
8. Journal of Geophysical Research: Atmospheres - mdpi_htmlpars.py
9. MDPI Air
10. MDPI Atmosphere
11. MDPI Climate
12. MDPI Earth
13. MDPI Ecologies
14. MDPI Energies
15. MDPI Environments
16. MDPI Forests
17. MDPI Fuels
18. MDPI Hydrology
19. MDPI Meteorology
20. MDPI Oceans
21. MDPI Recycling
22. MDPI Sustainable Chemistry
23. MDPI Water - nature_htmlpars.py
24. Nature Climate Change - ngeo_htmlpars.py
25. Nature Geoscience - npjcliac_htmlpars.py:
26. NPJ Climate Action - npjclisci_htmlpars.py:
27. NPJ Climate and Atmospheric Science
28. NPJ Ocean Sustainability - pnas_htmlpars.py
29. PNAS
- based on try16.py and outputs in test25.txt
- Processing with run_htmlpars_parallel.py and desired *_htmlpars.py -> OUTPUT: Fragmented pickled pandas files with defined data structure
- Concatenation of the fragmented dataframes with concat_dataframes.py -> OUTPUT: Dataframes concatenated per journal
- Deduplication of the papers per journal with check_for_duplicates.py -> OUTPUT: Deduplicated dataframes
- Concatenation into a single file with check_for_duplicates.py -> OUTPUT: Dataframe containing all journals
- Saving in multiple formats for different tasks (p2csv_tc.py) -> OUTPUT: CSV dataset, CSV with Title-Content pairs, and pickle with Title-Content pairs
- Continuation of this work
- Made with vocab_compare.py
@inproceedings{poleksic2024towards,
title = {Towards Dataset for Extracting Relations in the Climate-Change Domain},
author = {Andrija Poleksi{\'c} and Sanda Martin{\v{c}}i{\'c}-Ip{\v{s}}i{\'c}},
booktitle = {Proceedings of the Third International Workshop on Knowledge Graph Generation from Text, co-located with Extended Semantic Web Conference (ESWC)},
year = 2024,
address = {Hersonissos, Greece},
pages = {xx--yy},
date = {May 26--30},
}