Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

Open
JoKalliauer opened this issue Apr 24, 2024 · 0 comments
Assignees
Labels

Comments

@JoKalliauer
Copy link

JoKalliauer commented Apr 24, 2024

Describe the bug

If I run ocrmypdf with --skip-text on some pdf-files with "real text", than the existing text gets replaced by ��

The reason is, to run it on real text because I have a document with sensitive information similar to this one that also contains screenshots.

Steps to reproduce

  1. wget https://www.fcp.at/sites/default/files/2019-08/abstract_diplomarbeit_moschen.pdf
  2. ocrmypdf -j 1 --optimize 01 -l deu+eng abstract_diplomarbeit_moschen.pdf output.pdf --skip-text -v1
  3. Open output.pdf
  4. Copy text into any text-application (notepad++/editor/writer/libre office/...)

Files

abstract_diplomarbeit_moschen.pdf

output.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.1.1

Relevant log output

log output (click to open)
ocrmypdf 16.1.1                                                                                                                                                               __main__.py:59
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Found tesseract 5.3.4.post44                                                                                                                                                 __init__.py:342
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Found gs 10.2.1                                                                                                                                                              __init__.py:342
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['tesseract', '--list-langs']                                                                                                                                       __init__.py:133
stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3):                                                                                    __init__.py:73
deu
eng
osd

pikepdf mmap enabled                                                                                                                                                          helpers.py:326
os.symlink(abstract_diplomarbeit_moschen_ink.pdf, /tmp/ocrmypdf.io.esbdwxy5/origin)                                                                                           helpers.py:179
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/origin, /tmp/ocrmypdf.io.esbdwxy5/origin.pdf)                                                                                            helpers.py:179
Gathering info with 1 thread workers                                                                                                                                             info.py:772
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Using Tesseract OpenMP thread limit 1                                                                                                                                   tesseract_ocr.py:183
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
    1 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    2 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Postprocessing...                                                                                                                                                                 ocr.py:146
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/graft_layers.pdf, /tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf)                                                                             helpers.py:179
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None',                                               __init__.py:133
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2',
'-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf', '/tmp/ocrmypdf.io.esbdwxy5/pdfa.ps']
GPL Ghostscript 10.02.1 (2023-11-01)                                                                                                                                         __init__.py:108
Copyright (C) 2023 Artifex Software, Inc.  All rights reserved.                                                                                                              __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                                                   __init__.py:108
see the file COPYING for details.                                                                                                                                            __init__.py:108
Processing pages 1 through 2.                                                                                                                                                __init__.py:108
Page 1                                                                                                                                                                       __init__.py:108
Page 2                                                                                                                                                                       __init__.py:108
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Optimizable images: JPEGs: 0 PNGs: 0                                                                                                                                         optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Optimizable images: JBIG2 groups: 0                                                                                                                                          optimize.py:360
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/optimize.opt.pdf, /tmp/ocrmypdf.io.esbdwxy5/optimize.pdf)                                                                                helpers.py:179
Running: ['jbig2', '--version']                                                                                                                                              __init__.py:133
Running: ['pngquant', '--version']                                                                                                                                           __init__.py:133
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                _pipeline.py:976
Total file size ratio: 0.73 savings: -37.0%
@JoKalliauer JoKalliauer changed the title [Bug]: real text replaced by � � [Bug]: real text replaced by � � (visually unchanged, only by copying) Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants