# Interesting sources

- https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
- See also : https://superuser.com/questions/92615/cannot-copy-non-latin-characters-from-pdf-document


# Key takeways

Reading [The PDF 1.3 guidelines](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.3.pdf), it looks like PDF only supports 8-bits encodings (i.e. 256 chars), which, however, can be modified by custom tables of correspondances between codes and glyphs. Also note that each div can have its own encoding, which allows for a great flexibility (at the price of a high complexity). In our case, the encodings are custom... (see acrobat > file > properties > fonts)

I think that Acrobat should be the best tool to use here, as the guidelines are defined by adobe themselves. However, I can't explain why adobe succeeds in rendering the file but fails exporting plain text.

Note: pdf should be _tagged_ (i.e. have an explicit conversion map) in order to be easily exported. It is not the case with ours.


# Try with adobe acrobat

## I first tried to export **plain text** :
- export UTF-8 --> fails
- export latin-1 --> fails
- export UTF-16 --> fails
- export ucs-4 --> fails
- In all the cases above, acrobat crashes before export.

## I then tried to export to various other formats (.rtf, .xml ...) with the same encodings :

- all failed rendering greek chars

# Try with python librairies

## Using pdfplumber # ❌ fails on greek chars

In [None]:
path = "/Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/Finglass2011/images/pdf/Finglass2011.pdf"
path_test = '/Users/sven/Desktop/Finglass2011_p24.pdf'

In [None]:
import pdfplumber

pages = []
words = []
with pdfplumber.open(path) as pdf:

    for i in range(24,25):
        print(i)
        pages.append(pdf.pages[i].extract_text())
        words.append(pdf.pages[i].extract_words())

## Using pdfminer.six # ❌ fails on greek chars

In [2]:
from pdfminer.high_level import extract_text
text = extract_text(path)



In [6]:
text.encode('utf-8')



## Using Apache Tika  ❌ fails on greek chars

In [1]:
from tika import parser
path = '/Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/Finglass2011/images/pdf/Finglass2011.pdf'
parsed_pdf = parser.from_file(path)

parsed_pdf['content']


2022-10-07 18:24:44,516 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /var/folders/tg/_6zh_tz94ddb53tskzx6vdhw0000gn/T/tika-server.jar.
2022-10-07 18:24:50,106 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /var/folders/tg/_6zh_tz94ddb53tskzx6vdhw0000gn/T/tika-server.jar.md5.
2022-10-07 18:24:50,582 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2022-10-07 18:24:55,587 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...




# Try opening the binary file directly and decode manually

In [None]:
with open(path_test, 'rb') as f:
    text = f.read()


# text_utf = text.decode('utf-8') # fails
text_utf = text.decode('latin-1') # succeeds, but misses greek chars (unsurprisingly).



# Conclusion :

I see two possibility :

1. Asking the publisher / the author directly for the source text as a curtesy to the project.
2. OCRing the text, possibly with automatically generated groundtruth.
