Skip to content

PDF to TEXT

Marcel Schwittlick edited this page Nov 30, 2016 · 4 revisions

the conversion of pdf's to text files and the parsing of these text files happens in this module: https://github.com/mrzl/ECO/tree/master/src/python/pdf2text

there are two steps:

1. convert pdf to txt

happens via the library textract(https://github.com/deanmalmgren/textract)

2. extract valid sentences from the txt

happens in this module: https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/textparser.py
there are a few tests/rules that are applied to each sentence of a corpus. only if a sentence passes 
all tests, it is saved and used for further use:
- the sentence needs to consist of more than 10 words
- more than half of the words need to consist of more than one character
- the first word may not be a number, or any other form of number, check CD tag from here http://www.clips.ua.ac.be/pages/mbsp-tags
- the sentence may not contain ( ) or \ characters
- the sentence needs to contain less than 3 comma (,)
- the sentence is in any other language than english

with this module https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/batch_postprocess_text.py, the conversion can be started:

# example of how to run it
workon eco_pdf2text
pip install -r requirements.txt
cd /home/marcel/projects/eco/src/python/pdf2text
python batch_postprocess_text.py --input_path /home/marcel/drive/data/eco/NAIL_DATAFIELD/arts_arthistory_aesthetics/ --output_path /home/marcel/drive/data/eco/NAIL_DATAFIELD_txt/arts_arthistory_aesthetics/

this will post some rough info in the terminal. more detailed statistics will be saved to

/home/marcel/projects/eco/src/python/pdf2text/arts_arthistory_aesthetics_statistics.txt

these statistics can be visualized with the module https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/visualize_statistics.py:

python https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/visualize_statistics.py --statistics_path arts_arthistory_aesthetics.txt

this gives some insight into how many sentences were successfully parsed, how many sentences have been dropped because of which of the above rules. here's an example from our pdf collection (135pdfs):

statistics

Clone this wiki locally