Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genPathname corrupts filenames with backslashes in them via convertStepCharsInPath #558

Closed
MerlijnWajer opened this issue Dec 10, 2020 · 9 comments

Comments

@MerlijnWajer
Copy link

Noticed as a problem in Tesseract initially: tesseract-ocr/tesseract#3178

It looks like genPathname does not like backslashes in filenames on UNIX, even though this is valid:

$ wc -c /tmp/test\\.jp2
455359 /tmp/test\.jp2
merlijn@gentoo-x230 /tmp $ tesseract /tmp/test\\.jp2 -
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
Breakpoint 1, 0x00007ffff7aa53b0 in fopenReadStream () from /usr/lib64/liblept.so.5
(gdb) print (char*)$rdi
$9 = 0x7fffffffd094 "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
(gdb) c
Continuing.Breakpoint 4, 0x00007ffff7aa50b0 in genPathname () from /usr/lib64/liblept.so.5
(gdb) print (char*)$rdi
$10 = 0x7fffffffd094 "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
(gdb) step
Single stepping until exit from function genPathname,
which has no line number information.
(gdb) print (char*)$rax
$19 = 0x5555555b0570 "/tmp/example/427527-/nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"

It looks like convertStepCharsInPath is causing this problem, it's converting '\\' to / even on UNIX, which is not what it should be doing, as far as I can tell.

@MerlijnWajer MerlijnWajer changed the title genPathname corrupts filenames with backslashes in them via convertStepCharsInPath genPathname corrupts filenames with backslashes in them via convertStepCharsInPath Dec 10, 2020
@stweil
Copy link
Collaborator

stweil commented Dec 10, 2020

Yes, I can confirm this problem. The replacement should not be done for Linux, MacOS and others which only use / as path separator.

@DanBloomberg
Copy link
Owner

Question: are there valid path names in unix that have these double back-slashes? How would such a path, as you show above, be interpreted by the file system?

Also, are there valid path names in unix with a single back-slash?

@DanBloomberg
Copy link
Owner

And finally, is that a valid path name in Windows?

@MerlijnWajer
Copy link
Author

MerlijnWajer commented Dec 10, 2020

Question: are there valid path names in unix that have these double back-slashes? How would such a path, as you show above, be interpreted by the file system?

Yes, those exist, every backslash just needs to be escaped with another backslash from the shell, but the string (in bytes) would just contain one backslash for each backslash.

merlijn@gentoo-x230 ~ $ touch /tmp/test\\\\test2\\\\\\\\
merlijn@gentoo-x230 ~ $ ls -lsh /tmp/test\\\\test2\\\\\\\\
0 -rw-r--r-- 1 merlijn merlijn 0 Dec 10 23:18 '/tmp/test\\test2\\\\'

Quick test in Python (every backslash is also escaped with a backslash):

>>> s = '/tmp/test\\\\test2\\\\\\\\'
>>> s.count('\\')
6
>>> open(s, 'rb').read()
b''

The s.count('\\') shows that six backslashes are in the path in total (the '\\' is a single backslash). Two after test, four after test2. The open(s, 'rb') call shows that the open succeeds.

Also, are there valid path names in unix with a single back-slash?

Yes, the example I showed actually had just a single back slash in the actual file name, it's just that I (and gdb) had to escape the back slashes for the string literals and in the shell itself.

EDIT: I think that \ has no special meaning in a filename on UNIX, and it's just treated as any other byte. You can even use newlines in UNIX filenames, as far as I know.

@MerlijnWajer
Copy link
Author

And finally, is that a valid path name in Windows?

My windows-fu isn't that strong, but I know that both / and \ are not allowed in file names, but can both be part of a path in Windows. If a forward slash actually works outside of Windows APIs is something I do not know.

@DanBloomberg
Copy link
Owner

Thank you! From what you have said, on unix systems all backslashes must be preserved. (Which is what you said in your initial posting. I will fix this.

@MerlijnWajer
Copy link
Author

Thank you!

@MerlijnWajer
Copy link
Author

I can confirm that this solves the problem for me, by the way. Applied the patch it on top of Ubuntu 20.04's liblept5 and the problem in Tesseract is gone. So I guess the issue can be closed?

@DanBloomberg
Copy link
Owner

Thank you for confirming we're ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants