Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selected searchable text in PDF displays as just boxes in Evince PDF reader #8

Closed
ElectricRCAircraftGuy opened this issue Aug 18, 2020 · 14 comments

Comments

@ElectricRCAircraftGuy
Copy link
Owner

ElectricRCAircraftGuy commented Aug 18, 2020

Continued from here: #7 (comment).

@ElectricRCAircraftGuy
Copy link
Owner Author

@michaelsjackson, I don't see the problem at all on any converted PDF, and I've been using them for a long time (months? years?) regularly. I've never seen the problem you show once. I think it's an Evince problem. Try some other PDF readers on Linux and see what happens. My preferred PDF reader, hands-down, by far, is Foxit Reader. It works on Windows, Mac and Linux, but is NOT free software. It is no cost, however, for the basic version, which is the best PDF reader for Linux I've ever seen. Here's a screenshot:

image

In the screenshot above, I've converted this PDF (https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.pdf ) to be searchable, using my tool. I then used Foxit Reader to underline in red and higlight in yellow on the left, and I saved the PDF. The highlighted text on the right is just text I have selected. As you can see, it works and looks fine.

Note: the screenshot was taken with Shutter, which is the best screenshot tool by far in Linux, in my opinion. Install in Ubuntu 18.04 with sudo apt install shutter, then do this. Or, for Ubuntu 20.04.

I guess using the hOCR option in tesseract might help.

I think this only allows you to save a metafile with text outside the PDF, so I don't think it applies in this situation. I think the problem is just with your PDF reader. Try something else, like Foxit Reader.

@ElectricRCAircraftGuy
Copy link
Owner Author

ElectricRCAircraftGuy commented Aug 18, 2020

Update: in the "Document Viewer" (Evince), which comes with Ubuntu 18.04, I see what you're seeing now:

image

I still think this is a bug in their software. Please file a bug report with them and post a link to it here and I can go upvote it or whatever too. Linux PDF viewers seem to be obsolete and don't work well and I don't like them much, with the exception of Foxit Reader.

@ElectricRCAircraftGuy
Copy link
Owner Author

ElectricRCAircraftGuy commented Aug 18, 2020

@michaelsjackson, I just filed a bug report on Evince's gitlab page, here: https://gitlab.gnome.org/GNOME/evince/-/issues/1478.
Please go there and upvote it to get it some attention. 👍

@michaelsjackson
Copy link

Thanks for posting, I hit the up button there, I guess this is what you meant with upvoting. I do not care much about evince normally, as I do not work there, it got only my attention in this situation. My working tool for marking pdf documents is Xournal, and sometimes also Xournal++, having a few more features, in case I want those. The file formats are compatible anyway, nothing lost.

What do you like mostly in foxit reader?

@ElectricRCAircraftGuy
Copy link
Owner Author

ElectricRCAircraftGuy commented Aug 18, 2020

What do you like mostly in foxit reader?

It allows marking up the PDF: underlining, crossing out, highlighting, making notes, etc. This was one of my holdups in moving to Linux for years (I only made the permanent switch from Windows 2 yrs ago), and it is what allowed me to fiiiiinally quit printing hundreds of pages of paper just so I could take notes, as now I can do it digitally, which is soooo much better!

If Xournal can do that too I'll take a look, but Foxit Reader is the only PDF tool I've found thus far that can do PDF markup in Linux, and it also happens to be no-cost.

@michaelsjackson
Copy link

I am using shutter as well, cool example with Dijkstra.

@michaelsjackson
Copy link

Yeah you should definitely check Xournal and Xournal++, but start with Xournal as it is the original more stable and faster one. Then you will throw Foxit out of the window. 👍

Xournal is developed by a maths professor I guess, so he uses it himself as well. It uses a subset of svg for its format. There are many interesting tools for it as well, from command line you can generate your new pdf's with your marks for example. Of course later you could edit again the source file then re-export. It is just loading the pdf as background, and you start painting on it, like photoshop or so, only optimized for hand writing and resaving using svg kind of vector format. For lecture scenarious just perfect.

original xournal format
.xoj files are gzip compressed xml files

rename to .gz
gunzip name.gz

.xopp (from xournal++), same technique
rename to .gz
gunzip name.gz

@michaelsjackson
Copy link

michaelsjackson commented Aug 18, 2020

Which hardware are you using for digital writing, then my example with graphic tablet driver was just perfect kind of, I am using currently XP-Pen Deco 03, but its linux driver is lacking free setup of the rotary wheel on top left, this is why I bought this device actually, but now still waiting for the linux driver update, contacted already its developer, well not sure when such a feature update will appear. And I bought it for its wireless operation feature, also interesting for lecture scenarios.

@ElectricRCAircraftGuy
Copy link
Owner Author

Which hardware are you using for digital writing?

A keyboard and mouse. :) I type into the PDF to take notes. Thanks for the info on Xournal and Xournal++. I'll check them out.

@ElectricRCAircraftGuy
Copy link
Owner Author

This is off-topic, but side note: I don't want to misrepresent the value of goto here by citing Djikstra's paper out of context. I use goto all the time, under certain, well-defined error-handling cases. See my answer here, including my links at the end: https://stackoverflow.com/a/54488289/4561887.

@michaelsjackson
Copy link

michaelsjackson commented Aug 18, 2020

In case you want to switch over to hand writing, because more powerful as you can draw whatever you want, add paintings and so on you know at least my choice. First I thought a tablet with a screen is better, but later I recognized just the opposite is true, this device, or such devices without any screen are first cheaper, but forget the price, I see it only as a replacement for a regular mouse, and they just work, needing only a usb plug, like a mouse. You will never have any problems regarding beamers and multiple different resolutions. It just works, cheap. Problem free solution just doing its job.

@michaelsjackson
Copy link

michaelsjackson commented Aug 18, 2020

Here is Xournal developers website, if you want to check: http://people.math.harvard.edu/~auroux/ and here you can see how he is using it, just one example, there are more under lecture notes: http://people.math.harvard.edu/~auroux/papers/slides-curvemirrors-zoominar-may2020.pdf

@ElectricRCAircraftGuy
Copy link
Owner Author

@michaelsjackson , update: the problem lies upstream of evince even, in a package called Poppler. Please upvote this issue here to get it some attention from the Poppler team: Updated link to the upstream issue for Poppler: https://gitlab.freedesktop.org/poppler/poppler/-/issues/157. Thanks.

@ElectricRCAircraftGuy
Copy link
Owner Author

closing this issue since the problem lies with upstream dependencies, not with pdf2searchablepdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants