Skip to content
/ PDF2TXT Public

Repository for content extraction from PDF and HTML files.

License

Notifications You must be signed in to change notification settings

P0L3/PDF2TXT

Repository files navigation

A repo for PDF processing of certain journals

TODO

  • Solve fi problem -> ligatures
  • Parsers for journals: 30
  • Test out diff for test3_jgra(_fi)
  • Add missing parsers: ARX
  • Test CLIMD, ECOAPP, NPJCLIAC, NPJCLISCI, GCB,
  • Solve Acknowledgement problem in PNAS -> Check Notes
  • Solve Missing text problems in IJOC -> Check Notes
  • Add new PNAS sample to all devices

Setup

With Docker

  1. Create Docker image from the folder containing Dockerfile
docker build -t pdf_text:1.1.2 . 
  1. Run docker compose in the folder where docker-compose.yml is and open it in VS code:
docker compose up

Note: Adjust volumes in docker-compose.yml according to your need!

Without Docker

  1. Just pip install requirements.txt duuh

Data structure

Columns Descriptions
1. Title Paper title in a list: ["Effects of pretraining corpora"]
2. Authors_and_Affiliations List of author and affil number tuples: [("Andrija Poleksic", "1, 2"), (...)]
3. Affiliations Affiliation text and number tuples: [(1, "FIDIT"), (...)]
4. DOI Paper doi number in a list: ["10.23919/mipro57284.2023.10159770"]
5. Authors String containing all authors or detailed list: "Poleksic, Andrija and ..." or [{'ORCID': '123', 'creator': 'Poleksic, Andrija'}, {...}]
6. Journal Name of the journal: "Nature Geoscience"
7. Date Date of publishing: 5-30-2034
8. Subjects List of topics in the paper: ["Earth Sciences", "..."
9. Abstract Abstract text of the paper: "The amount of data ..."
10. References List of references: ["Matching the Blanks: Distributio ..."]
11. Content Full text from paper: "Reading text to identify and ..."
12. Keywords Keywords or keypoints from a paper, list, or string: ["Internal variability", "..."] or "A significant interdecadal variation ..."
13. Style Debug data: "1"

Data template:

# Original
paper_data = {
            "Title": title,
            "Authors_and_Affiliations": authors_and_affiliations,
            "Affiliations": affiliations,
            "DOI": doi,
            "Authors": authors,
            "Journal": journal,
            "Date": date,
            "Subjects": subjects,
            "Abstract": abstract,
            "References": references,
            "Content": content,
            "Keywords": keywords,
            "Style": style,
        }

# Default
paper_data = {
            "Title": "no_title",
            "Authors_and_Affiliations": "no_auth_and_affil",
            "Affiliations": "no_affil",
            "DOI": "no_doi",
            "Authors": "no_author",
            "Journal": "no_journal",
            "Date": "no_date",
            "Subjects": "no_subjects",
            "Abstract": "no_abstract",
            "References": "no_references",
            "Content": "no_content",
            "Keywords": "no_keywords",
            "Style": s,
        }

Journals

  1. Miscellaneous ArXiv

Test reports

Dataset preparation

  1. Processing with run_htmlpars_parallel.py and desired *_htmlpars.py -> OUTPUT: Fragmented pickled pandas files with defined data structure
  2. Concatenation of the fragmented dataframes with concat_dataframes.py -> OUTPUT: Dataframes concatenated per journal
  3. Deduplication of the papers per journal with check_for_duplicates.py -> OUTPUT: Deduplicated dataframes
  4. Concatenation into a single file with check_for_duplicates.py -> OUTPUT: Dataframe containing all journals
  5. Saving in multiple formats for different tasks (p2csv_tc.py) -> OUTPUT: CSV dataset, CSV with Title-Content pairs, and pickle with Title-Content pairs

Vocabulary creation

  • Continuation of this work

Entity and term data

Cite

@inproceedings{poleksic2024towards,
  title        = {Towards Dataset for Extracting Relations in the Climate-Change Domain},
  author       = {Andrija Poleksi{\'c} and Sanda Martin{\v{c}}i{\'c}-Ip{\v{s}}i{\'c}},
  booktitle    = {Proceedings of the Third International Workshop on Knowledge Graph Generation from Text, co-located with Extended Semantic Web Conference (ESWC)},
  year         = 2024,
  address      = {Hersonissos, Greece},
  pages        = {xx--yy},
  date         = {May 26--30},
}

About

Repository for content extraction from PDF and HTML files.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages