-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case file character encoding #221
Comments
I have identified the character causing the issue in oscys.caseid.0105: https://cdrhdev1.unl.edu/earlywashingtondc/cases/oscys.caseid.0105 That doesn't really help, since we can't remove all em dashes. So, Idea #1: is there a way to change the html creation script to explicitly set the encoding to UTF-8? (maybe this would be useful? I'm not quite sure how to implement it https://gist.github.com/arpith20/4fcf7682a9154bc777dfcd2199edecf4) If that does not work, idea #2 will be to re-encode special characters with html tags, but I am hoping not to have to do that. Update: I checked the XSLT file creating the HTML for oscys (scripts/tei.p5) and I think it is setting encoding correctly:
The XML files also correctly set the encoding as UTF-8, though it is possible that the original file is using a non UTF-8 encoding of the em dash. |
A little more investigation: I opened the file the HTML is transformed from, and the encoding of the em dash looks like this:
I believe 8212 is the HTML encoding of the em dash, but it doesn't work if I change it to |
Should we just change all of them to two minuses? -- Any idea why it only seems to be a problem on the case files and not the documents? |
I'm not sure if this is still an issue, but it would be good to find out. |
@kacinash do you know if this is resolved or you have a workaround? |
I don't know. I'm not sure how to replicate the process Greg did that got him the error. |
I think we'd have to revert the change I added in 292e018 and review pages with the suspect characters |
Rails started throwing the error:
incompatible character encodings: ASCII-8BIT and UTF-8
for some case file pages.E.g.
Karin removed some text from the generated HTML file and it would work again but haven't identified the offending character or anything yet.
Used a dirty fix from Stack Overflow (https://stackoverflow.com/a/9278713) in 292e018
The text was updated successfully, but these errors were encountered: