Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter files that have non utf-8 characters in their filenames #2626

Merged
merged 2 commits into from Mar 3, 2023

Conversation

johrstrom
Copy link
Contributor

@johrstrom johrstrom commented Mar 1, 2023

Fixes #2624 by hiding non ascii filenames.

We may be able to show them at some point, but will never be able to edit, download or update them. We cannot build routes with non utf-8 characters. We may be able to map them to say question marks instead, but if you it'd be nearly impossible to know which file to choose if they're all ????.

In any case, this'll work in the interem and I can create a follow up ticket for CRUDing them.

┆Issue is synchronized with this Asana task by Unito

Copy link
Contributor

@CSC-swesters CSC-swesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ascii-only is way too exclusive, it will cause trouble for many languages. I'll suggest an alternative solution soon.

@@ -85,6 +85,10 @@ def directory?
def ls
path.each_child.map do |child_path|
PosixFile.stat(child_path)
end.select do |stats|
ascii_only = stats[:name].to_s.ascii_only?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ascii_only? puts a lot of UTF-8 characters in the wrong bin, I'm afraid. I tried this code on our sandbox instance, and for example these strings get ignored in error (i.e., they should be allowed):

App 22009 output: [2023-03-02 12:06:19 +0200 ]  WARN "Not showing file 'これは動作しません' because it is not a UTF-8 filename."
App 22009 output: [2023-03-02 12:06:19 +0200 ]  WARN "Not showing file 'тест' because it is not a UTF-8 filename."

@CSC-swesters
Copy link
Contributor

I'm not very knowledgeable in Ruby, but I managed to find a way to filter out only the file names that are invalid UTF-8 bytes. I tested it as a little example program that lists the files in my test directory:

puts "Listing files in directory"

entries = Dir.entries("/path/to/dir/")
entries.each do |e| 
  begin
    puts "Entry: #{e}" if e.unicode_normalized?
  rescue ArgumentError
    puts "Found invalid UTF-8 bytes in: #{e}"
  end 
end

puts "Done"

This will allow the Japanese and Russian text which I used in my testing, but filter out the example names I presented in #2624

...
Found invalid UTF-8 bytes in: ����-ood-test-file
...
Entry: тест
...
Entry: これは動作しません

Unless it is rescued, the ArgumentError` looks like this:

Traceback (most recent call last):
	5: from ./ood-utf8-test.rb:17:in `<main>'
	4: from ./ood-utf8-test.rb:17:in `each'
	3: from ./ood-utf8-test.rb:18:in `block in <main>'
	2: from ./ood-utf8-test.rb:18:in `unicode_normalized?'
	1: from /usr/share/ruby/unicode_normalize/normalize.rb:151:in `normalized?'
/usr/share/ruby/unicode_normalize/normalize.rb:151:in `scan': invalid byte sequence in UTF-8 (ArgumentError)

My ruby version in this testing was: ruby 2.5.9p229 (2021-04-05 revision 67939) [x86_64-linux]

@CSC-swesters
Copy link
Contributor

Additionally, in the current solution, there is no indication to the user that there are ignored files in the directory they listed. But I suppose that could be fixed later as well. At least the skipped files are logged on the server.

@CSC-swesters
Copy link
Contributor

I got a working implementation for the Files app with this:

      # ...
    end.select do |stats|
      name_is_valid_unicode = false
      begin
        name_is_valid_unicode = stats[:name].to_s.unicode_normalized?
      rescue ArgumentError
        Rails.logger.warn("Not showing file '#{stats[:name]}' because it is not a UTF-8 filename.")
      end
      name_is_valid_unicode
    end.sort_by { |p| p[:directory] ? 0 : 1 }
    #...

@johrstrom
Copy link
Contributor Author

Thanks for the info. You're right that we should include other languages. I'll work on something else. I'd like to avoid a rescue block just for speed but if that's the only way, then so be it.

@CSC-swesters
Copy link
Contributor

Thank you.

Perhaps there are better ways, I didn't look into this any further. I can imagine that rescue isn't the most efficient, but I would personally value correctness over speed at this point.

@johrstrom
Copy link
Contributor Author

Good looking out @CSC-swesters - I think valid_encoding? will suite us here. They're all UTF-8 encoded strings for whatever reason, but they're able to recognize that they're not actually valid.

When I look at this directory
image

I get these 3 files to show.
image

Copy link
Contributor

@CSC-swesters CSC-swesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you found a more performant way to test this! Tested the code on our 2.0.29 OOD version as well by patching /var/www/ood/apps/sys/dashboard/app/models/files.rb accordingly, and I see the same successful result as you.

Copy link
Contributor

@gerald-byrket gerald-byrket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@johrstrom johrstrom merged commit 08c1928 into master Mar 3, 2023
@johrstrom johrstrom deleted the non-utf8-files branch March 3, 2023 18:00
johrstrom added a commit that referenced this pull request Mar 20, 2023
@johrstrom johrstrom mentioned this pull request Mar 20, 2023
johrstrom added a commit that referenced this pull request Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Files app shows nothing if at least one file name has non-UTF-8 bytes
4 participants