Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCRmyPDF with docker image #49

Closed
l00v3 opened this issue Feb 2, 2021 · 2 comments
Closed

OCRmyPDF with docker image #49

l00v3 opened this issue Feb 2, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@l00v3
Copy link

l00v3 commented Feb 2, 2021

Hello,
as I was having a lot of issues installing all the dependencies on our system, I tried to provide the OCRmyPDF with docker image. Can you please check if this is done the right way, and it's compatible with workflow_ocr. Things are working, but it's not fully tested. Maybe this can help some people to not worry about so much dependencies. Thank you very much for the amazing work!

Providing OCRmyPDF with docker

/opt/ocrmypdf/dockerfile

FROM jbarlow83/ocrmypdf
RUN apt install tesseract-ocr-yourlang

Build docker image

docker build .

/usr/bin/ocrmypdf

#!/bin/bash
image_id=447214babbb4_insert_your_image_id
docker run --rm --user "$(id -u):$(id -g)" --workdir /tmp -v "/tmp:/tmp" -i $image_id -l eng+yourlang "$@"

Chmod

chmod +x /usr/bin/ocrmypdf

On our system I had to add apache user to docker group

usermod -aG docker apache
@R0Wi
Copy link
Contributor

R0Wi commented Feb 2, 2021

Hi @l00v3, sounds like an interesting idea. Of course in general it is possible to replace the command line binary ocrmypdf through a docker call as long as it is able to stream in the stdin and stream out the stdout (because the app relies on this feature).
Regarding your recipe i would give you the following feedback:

  1. I'd suggest to tag your custom image with docker build -t jbarlow83/ocrmypdf-custom . (see here). You could then just run the command like docker run ... jbarlow83/ocrmypdf-custom "$@" (see here) and you don't have to keep track of your image id.
  2. The command issued by the app is defined here and mainly runs ocrmypdf --redo-ocr -q - -. As you can see currently there is no language information added (-l eng+yourlang) but this could be a feature for the future and might then conflict with your docker spin up command if the parameter is added twice (once by your script and once by the app itself).
  3. Depending on the system you're running the user executing the webserver process might differ from apache. On debian systems it is most likely www-data. But of course this user has to be added to the docker group to run docker containers.

From a security point of view point 3 could be quite dangerous because then an attacker taking over control of your webserver is able to control your complete docker ecosystem via the docker daemon. This is a point i'd really like to avoid even though i find your idea very interesting. Maybe we have some ideas to get this under control @bahnwaerter ? Then we could offer an alternative installation method of course.

@R0Wi R0Wi added the enhancement New feature or request label Feb 2, 2021
@R0Wi
Copy link
Contributor

R0Wi commented Feb 3, 2021

Moved this to #51

@R0Wi R0Wi closed this as completed Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants