-
Notifications
You must be signed in to change notification settings - Fork 1
PDF to TEXT
the conversion of pdf's to text files and the parsing of these text files happens in this module: https://github.com/mrzl/ECO/tree/master/src/python/pdf2text
there are two steps:
1. convert pdf to txt
happens via the library textract(https://github.com/deanmalmgren/textract)
2. extract valid sentences from the txt
happens in this module: https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/textparser.py
there are a few tests/rules that are applied to each sentence of a corpus. only if a sentence passes
all tests, it is saved and used for further use:
- the sentence needs to consist of more than 10 words
- more than half of the words need to consist of more than one character
- the first word may not be a number, or any other form of number, check CD tag from here http://www.clips.ua.ac.be/pages/mbsp-tags
- the sentence may not contain ( ) or \ characters
- the sentence needs to contain less than 3 comma (,)
- the sentence is in any other language than english
with this module https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/batch_postprocess_text.py, the conversion can be started:
# example of how to run it
workon eco_pdf2text
pip install -r requirements.txt
cd /home/marcel/projects/eco/src/python/pdf2text
python batch_postprocess_text.py --input_path /home/marcel/drive/data/eco/NAIL_DATAFIELD/arts_arthistory_aesthetics/ --output_path /home/marcel/drive/data/eco/NAIL_DATAFIELD_txt/arts_arthistory_aesthetics/
this will post some rough info in the terminal. more detailed statistics will be saved to
/home/marcel/projects/eco/src/python/pdf2text/arts_arthistory_aesthetics_statistics.txt
these statistics can be visualized with the module https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/visualize_statistics.py:
python https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/visualize_statistics.py --statistics_path arts_arthistory_aesthetics.txt
this gives some insight into how many sentences were successfully parsed, how many sentences have been dropped because of which of the above rules. here's an example from our pdf collection (135pdfs):