Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case file character encoding #221

Open
techgique opened this issue Mar 12, 2018 · 7 comments
Open

Case file character encoding #221

techgique opened this issue Mar 12, 2018 · 7 comments
Assignees

Comments

@techgique
Copy link
Member

Rails started throwing the error: incompatible character encodings: ASCII-8BIT and UTF-8 for some case file pages.

E.g.

  • /cases/oscys.caseid.0105
  • /cases/oscys.caseid.0337

Karin removed some text from the generated HTML file and it would work again but haven't identified the offending character or anything yet.

Used a dirty fix from Stack Overflow (https://stackoverflow.com/a/9278713) in 292e018

@karindalziel
Copy link
Member

karindalziel commented Mar 12, 2018

I have identified the character causing the issue in oscys.caseid.0105:

https://cdrhdev1.unl.edu/earlywashingtondc/cases/oscys.caseid.0105

That doesn't really help, since we can't remove all em dashes.

So, Idea #1: is there a way to change the html creation script to explicitly set the encoding to UTF-8? (maybe this would be useful? I'm not quite sure how to implement it https://gist.github.com/arpith20/4fcf7682a9154bc777dfcd2199edecf4)

If that does not work, idea #2 will be to re-encode special characters with html tags, but I am hoping not to have to do that.

Update: I checked the XSLT file creating the HTML for oscys (scripts/tei.p5) and I think it is setting encoding correctly:

<xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>

The XML files also correctly set the encoding as UTF-8, though it is possible that the original file is using a non UTF-8 encoding of the em dash.

@karindalziel
Copy link
Member

A little more investigation:

I opened the file the HTML is transformed from, and the encoding of the em dash looks like this:

summons&#8212;

I believe 8212 is the HTML encoding of the em dash, but it doesn't work if I change it to &#2014; or either. So, we're back to having to try one of the ideas above.

@kacinash
Copy link
Contributor

Should we just change all of them to two minuses? -- Any idea why it only seems to be a problem on the case files and not the documents?

@karindalziel
Copy link
Member

I'm not sure if this is still an issue, but it would be good to find out.

@jduss4
Copy link
Contributor

jduss4 commented Mar 27, 2019

@kacinash do you know if this is resolved or you have a workaround?

@kacinash
Copy link
Contributor

I don't know. I'm not sure how to replicate the process Greg did that got him the error.

@techgique
Copy link
Member Author

I think we'd have to revert the change I added in 292e018 and review pages with the suspect characters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants