Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Unicode path for projects work again #3229

Merged
merged 2 commits into from
Feb 26, 2020
Merged

Conversation

SuslikV
Copy link
Contributor

@SuslikV SuslikV commented Feb 15, 2020

Fixes: #3227

export and updates also uses json.dumps(), but it is slightly different thing. It needs more time to investigate if changes are required. In my tests it doesn't affect this issue itself.

Let's see if this fix is woks for you too.

@SuslikV
Copy link
Contributor Author

SuslikV commented Feb 15, 2020

non-English, what a mess... I should update commit messages. If this fix works, of course.

Copy link
Contributor

@ferdnyc ferdnyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I really don't like here is the change to the thumbnail requests.Response handling, I left a code comment explaining.

The rest, well, all those ensure_ascii=False arguments to json.dumps() make me concerned about the possible consequences, but maybe it is the right way to go. I'd have to see the effects of the change — and get an idea of the issues that the current code is causing — first. (Which I realize I can get at #3227, I just haven't gotten around to reading it yet.)

thumb_path = r.text
thumb_path = r.content.decode("utf-8", "ignore")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I like this change, I think it'll bite us.

requests.Response.text is documented as "Content of the response, in unicode". Which is exactly what we want.

If there's an issue with the encoding of the Unicode, we have two choices:

  1. We can have the server end explicitly specify UTF-8 in a content-type header (which the client will receive and respect)
  2. Or, on the client end, we can (as the Response documentation suggests) "set r.encoding appropriately before accessing this property."

Either option is much safer than accessing the raw bytes of the request and blindly force-decoding it.

Copy link
Contributor Author

@SuslikV SuslikV Feb 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to modify API again...

I think that the p1. is here, (something like charset=utf-8 can be specified):

self.send_header('Content-type', 'text/html')


As for the p2. - I see no much difference in this two:

response.encoding = 'utf-8'
response.text

and the decode() code - it surely receives utf-8:

self.wfile.write(bytes(thumb_path, "utf-8"))

no need to see the header's charset.

And it can't be safer. It cannot use other technique. I just take example from the https://docs.python.org/3/howto/unicode.html and know the result, but I don't know how the response.text works when ...the input string can’t be converted according to the encoding’s rules... . Edit: Of course, I cannot imaging how it can happen if it already was send in utf-8... Broken message?

SuslikV added 2 commits February 16, 2020 12:43
Fixes an issue when application fails to load files from the saved
project with non-English characters in the path to the clips.

Also removing json.loads strict=False key as obsolete in 3.x Python.
Fixes an issue that list view miniatures of the imported clips not
rendered if path contains non-English letters.

By defult the ".text" field in response uses ISO-8859-1 charset (RFC
2616 specs) while server sends utf-8 encoded path string just few
strings below in the code.
@SuslikV
Copy link
Contributor Author

SuslikV commented Feb 16, 2020

Updated. API fix was implemented (instead of programming solution).

@SuslikV
Copy link
Contributor Author

SuslikV commented Feb 25, 2020

It seems that fix from the main developer is available now: #3247

@jonoomph
Copy link
Member

@SuslikV I just retested this PR, and it works for my unicode examples now. I suggest we merge this PR, and close my related PR (#3247).

@ferdnyc
Copy link
Contributor

ferdnyc commented Feb 26, 2020

Any fix that works! 👍

Now the only question is, what about all of the mis-encoded project files our 2.5.0 users are sitting with? Will they be readable with OpenShot after this change? If (assuming) they aren't, is it worth putting together a sort of "project file fix-up" tool to correct their formatting? (...Hopefully it's feasible to do that programmatically?)

I'd be willing to slap together a standalone Python command-line tool to correct the contents of mis-encoded files, something we could distribute to users (as well as include in the 2.5.1 release that's presumably imminent) — it's not as convenient as having OpenShot directly support reading the corrupted files, but it's simpler and probably safer to make it a separate, explicit thing.

This week has me under uncharacteristically high real-world time pressures, unfortunately, so my availability is limited, but assuming the logic is relatively straightforward and well-understood, the few hours I have free tonight should be enough time to get that together... or at least get it into a good enough state where I can submit a branch and hand it off for completion. (Because if it's not done in the next 5-6 hours, then it'll be another 12-18 before I can return to it.)

@jonoomph
Copy link
Member

@ferdnyc Based on what I know, this only affects project files with certain unicode characters, and only version 2.5.0. I'm not sure how many people are really affected at this point, with broken projects. It could be dozens, or hundreds, but I don't imagine it's a huge amount.

Now, with that being said, I would be happy to create a page on openshot.org, which could take an *.osp upload file, and pops out a "fixed" encoding version. If anyone wants to create that "Python", I can adopt it into the official website, and make a tool for people. 👍

@jonoomph
Copy link
Member

Or... if want to integrate some "charset checking" code into openshot-qt, which could detetct a JSON parse failure, and then try and fix the JSON encoding at runtime. That would also be a good solution... especially if a large # of people are affected.

@ferdnyc
Copy link
Contributor

ferdnyc commented Feb 27, 2020

Oh, no, it's gonna affect anyone with any non-ASCII characters in their pathnames — I just created this file hierarchy:

/var/tmp/Jöerç
├── Charsetᛀᛊᛇᛅ_assets
│   ├── blender
│   ├── thumbnail
│   │   ├── 0U3BZFTQMS.png
│   │   └── K6QJISME3F.png
│   └── title
├── Charsetᛀᛊᛇᛅ.osp
├── ForagerNåkəd.mp4
└── ForagerNøtNækd.mp4

Both of the MP4 files were imported into the project. Here's what came out:

$ egrep '\/u[[:xdigit:]]{4}' /var/tmp/Jöerç/Charsetᛀᛊᛇᛅ.osp 
    "path": "../J/u00f6er/u00e7/ForagerN/u00e5k/u0259d.mp4",
   "image": "../J/u00f6er/u00e7/Charset/u16c0/u16ca/u16c7/u16c5_assets/thumbnail/0U3BZFTQMS.png"
    "path": "../J/u00f6er/u00e7/ForagerN/u00f8tN/u00e6kd.mp4",
   "image": "../J/u00f6er/u00e7/Charset/u16c0/u16ca/u16c7/u16c5_assets/thumbnail/K6QJISME3F.png"
   "path": "../J/u00f6er/u00e7/ForagerN/u00e5k/u0259d.mp4",
   "image": "../J/u00f6er/u00e7/Charset/u16c0/u16ca/u16c7/u16c5_assets/thumbnail/0U3BZFTQMS.png"
   "path": "../J/u00f6er/u00e7/ForagerN/u00f8tN/u00e6kd.mp4",
   "image": "../J/u00f6er/u00e7/Charset/u16c0/u16ca/u16c7/u16c5_assets/thumbnail/K6QJISME3F.png"
     "path": "../J/u00f6er/u00e7/ForagerN/u00e5k/u0259d.mp4",
     "path": "../J/u00f6er/u00e7/ForagerN/u00f8tN/u00e6kd.mp4",
      "path": "../J/u00f6er/u00e7/ForagerN/u00e5k/u0259d.mp4",
      "path": "../J/u00f6er/u00e7/ForagerN/u00e5k/u0259d.mp4",
      "path": "../J/u00f6er/u00e7/ForagerN/u00e5k/u0259d.mp4",
      "path": "../J/u00f6er/u00e7/ForagerN/u00f8tN/u00e6kd.mp4",
       "path": "../J/u00f6er/u00e7/ForagerN/u00e5k/u0259d.mp4",
      "path": "../J/u00f6er/u00e7/ForagerN/u00f8tN/u00e6kd.mp4",

...However, as my egrep demonstrates, the damage is very easy to detect, and correct. Basically, any forward-slash followed by a u and four hex digits is a corrupted Unicode escape that should be encoded as a backslash followed by a u and four hex digits. Any other forward slashes are path separators and should be left alone. Simple enough. And then the result needs to be read in with json.loads(), to unsubstitute the Unicode escapes.

@ferdnyc
Copy link
Contributor

ferdnyc commented Feb 27, 2020

(Hrm. And I just noticed, from the above, that we appear to be storing paths relative to the parent directory of the project file location, for some reason.)

  • The "../J\u00f6er\u00e7/filename" paths should really just be "filename".
  • The "../J\u00f6er\u00e7/subdir/filename" paths should just be "subdir/filename".

Edit: Correction: Relative to the project file location, but rooted in its parent dir.

@SuslikV
Copy link
Contributor Author

SuslikV commented Feb 27, 2020

I can only add that some users has usernames that can cause problems too. Because the settings file of OpenShot uses same routines to write/read the openshot.settings where the recent files list is stored (some of these paths can lead to the user's folder, so even if user don't uses other characters but English in the current project, the OpenShot will find them...)

The other way the issue may appear is the default names templates for localization (mentioned here: #3252 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot open properly just saved project
3 participants