Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crap detector #50

Open
phochste opened this issue Mar 4, 2014 · 3 comments
Open

Crap detector #50

phochste opened this issue Mar 4, 2014 · 3 comments
Labels

Comments

@phochste
Copy link
Member

phochste commented Mar 4, 2014

Would be nice to have some crap detector software to find bad characters or encoding problems.

E.g.
--find patterns of double encoded UTF08 in data input
--find valid but illegal control fields (e.g. these darlings kept me busy for some hours today: http://unicode-search.net/unicode-namesearch.pl?term=separator)

Any ideas which modules could be of help?

@phochste phochste added the idea label Mar 4, 2014
@pietsch
Copy link
Member

pietsch commented Mar 5, 2014

How about this (culled from https://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 )?

perl -l -ne '/
 ^( ([\x00-\x7F])              # 1-byte pattern
   |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
   |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
   |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
  )*$ /x or print'

Regarding control fields, Wikipedia has some material:
https://en.wikipedia.org/wiki/C0_and_C1_control_codes
https://en.wikipedia.org/wiki/Unicode_control_characters

I am not sure what you mean by “valid but illegal”. Does it depend on context?

@phochste
Copy link
Member Author

phochste commented Mar 5, 2014

Yes along these lines. Valid but illegal: I meant valid e.g. 0x1E (Record Separator) is valid but you wouldn't expect this as a MARC field value (this gave us issues with indexing the data in Solr).

I was thinking of some kind of Catmandu Cmd that could do an analysis of a bytestream and just provide you with some statistical information on characters used. This many alphanum, this many control codes, this many illegal utf8.

@pietsch
Copy link
Member

pietsch commented Mar 5, 2014

Will Encode::Guess perhaps do the job?

On an openSUSE server, you have /usr/bin/guess_encoding. It's a GPLv2 Perl script but weirdly I cannot find the code online, so I put it in a gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants