-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crap detector #50
Comments
How about this (culled from https://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 )?
Regarding control fields, Wikipedia has some material: I am not sure what you mean by “valid but illegal”. Does it depend on context? |
Yes along these lines. Valid but illegal: I meant valid e.g. 0x1E (Record Separator) is valid but you wouldn't expect this as a MARC field value (this gave us issues with indexing the data in Solr). I was thinking of some kind of Catmandu Cmd that could do an analysis of a bytestream and just provide you with some statistical information on characters used. This many alphanum, this many control codes, this many illegal utf8. |
Will Encode::Guess perhaps do the job? On an openSUSE server, you have /usr/bin/guess_encoding. It's a GPLv2 Perl script but weirdly I cannot find the code online, so I put it in a gist. |
Would be nice to have some crap detector software to find bad characters or encoding problems.
E.g.
--find patterns of double encoded UTF08 in data input
--find valid but illegal control fields (e.g. these darlings kept me busy for some hours today: http://unicode-search.net/unicode-namesearch.pl?term=separator)
Any ideas which modules could be of help?
The text was updated successfully, but these errors were encountered: