Crap detector #50

phochste · 2014-03-04T15:23:13Z

Would be nice to have some crap detector software to find bad characters or encoding problems.

E.g.
--find patterns of double encoded UTF08 in data input
--find valid but illegal control fields (e.g. these darlings kept me busy for some hours today: http://unicode-search.net/unicode-namesearch.pl?term=separator)

Any ideas which modules could be of help?

pietsch · 2014-03-05T09:25:50Z

How about this (culled from https://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 )?

perl -l -ne '/
 ^( ([\x00-\x7F])              # 1-byte pattern
   |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
   |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
   |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
  )*$ /x or print'

Regarding control fields, Wikipedia has some material:
https://en.wikipedia.org/wiki/C0_and_C1_control_codes
https://en.wikipedia.org/wiki/Unicode_control_characters

I am not sure what you mean by “valid but illegal”. Does it depend on context?

phochste · 2014-03-05T14:49:46Z

Yes along these lines. Valid but illegal: I meant valid e.g. 0x1E (Record Separator) is valid but you wouldn't expect this as a MARC field value (this gave us issues with indexing the data in Solr).

I was thinking of some kind of Catmandu Cmd that could do an analysis of a bytestream and just provide you with some statistical information on characters used. This many alphanum, this many control codes, this many illegal utf8.

pietsch · 2014-03-05T16:17:31Z

Will Encode::Guess perhaps do the job?

On an openSUSE server, you have /usr/bin/guess_encoding. It's a GPLv2 Perl script but weirdly I cannot find the code online, so I put it in a gist.

phochste added the idea label Mar 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crap detector #50

Crap detector #50

phochste commented Mar 4, 2014

pietsch commented Mar 5, 2014

phochste commented Mar 5, 2014

pietsch commented Mar 5, 2014

Crap detector #50

Crap detector #50

Comments

phochste commented Mar 4, 2014

pietsch commented Mar 5, 2014

phochste commented Mar 5, 2014

pietsch commented Mar 5, 2014