Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 redirect for file downloads: UTF8 characters in filenames cause problems #4524

Closed
landreev opened this issue Mar 19, 2018 · 5 comments
Closed
Assignees

Comments

@landreev
Copy link
Contributor

A small fix is needed for the new redirect-to-S3 mechanism. Filenames with certain binary UTF8 characters in them are causing problems. (We insert the filename into the pre-signed S3 urls, so that the users can download files under their real names, and not the generated names under which they are stored...) It's just a matter of encoding these characters somehow, or adding the encoding the name UTF8 to the header... Should be a 1 line fix.

Details:

Reported by a user, RT 259323:

Problem downloading a specific file, error message:

     <?xml version="1.0" encoding="UTF-8"?>
    <Error><Code>InvalidArgument</Code><Message>Header value cannot be represented using ISO-8859-1.</Message><ArgumentName>response-content-disposition</ArgumentName><ArgumentValue>attachment; filename=codebook-us-president-1976–2016.md</ArgumentValue><RequestId>88A532942A488F6C</RequestId><HostId>RyUWYJYxzJ87kUtKN8CWVIEml3I6IWyjFvw0QINwqiEspE6lYxX6W2PFB5+7BTApSDtp/1LPjZY=</HostId></Error> 

the filename above looks "normal" enough, but the dash between 1976 and 2016 is actually the "long dash", which is a UTF8 character, represented by 3 binary bytes: xE2 x80 x93.
Note that this filename works just fine when we put it into the "content-disposition" header in our normal download code. It is something about passing it to S3 - as a URL argument - that causes a problem. So it's just a matter of encoding it somehow.

@landreev
Copy link
Contributor Author

P.S. I solved it quickly for the user who reported the problem above, by simply replacing the "funky dash" with the regular "-" in the filemetadata.
But we still need a real patch of course.

@pameyer
Copy link
Contributor

pameyer commented Mar 19, 2018

This might be a naive question, but why is a UTF-8 encoded document trying to represent something in ISO-88-59-1?

@landreev
Copy link
Contributor Author

I don't know the answer tbh just yet.
But that may not be the most accurate way to phrase the question; we (the Dataverse app) are not really trying to represent anything as ISO-8855... we have the data in the db in UTF8, and we assume that it's ok to encode it as UTF8 - so we generate a string where this one character is represented as 3 UTF8 bytes (as above); once again, this is an entirely legal UTF8. The problem must be happening outside Dataverse - and on S3 side; where, I'm guessing, it assumes that ISO-8855 is the default encoding? In other words, S3 does not have an unambiguous way to tell if these 3 binary bytes should be treated as one UTF8 character, or 3 characters, in a single-byte-per-character encoding... something like that?

@pdurbin pdurbin self-assigned this Mar 22, 2018
@landreev landreev self-assigned this Mar 26, 2018
landreev added a commit that referenced this issue Mar 26, 2018
…content-disposition header in the pre-signed S3 download url. #4524
@landreev
Copy link
Contributor Author

I checked in a quick fix that encodes the filename, specifically as a UTF8 string:

responseHeaders.setContentDisposition("attachment; filename="+URLEncoder.encode(this.getDataFile().getDisplayName(), "UTF-8"));

It appears to be working, but I only tested with chrome. Stuff like this can behave differently in different browsers, so I'll test a little more.

@pdurbin
Copy link
Member

pdurbin commented Mar 26, 2018

I took a look at pull request #4542 and ran the code locally. It seems to work fine and makes sense. Moving to QA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants