Use more sophisticated encoding detection when utf8 decoding fails. #172

mikkeldenker · 2024-03-05T09:47:53Z

Closes #137.

Some websites, especially older ones, sometimes use a different encoding scheme than utf8 or latin1. Before, we simply tried different encoding schemes until one successfully decoded the bytes but this approach can fail unexpectedly as some encodings can erroneously get decoded by other encodings without errors being reported. We now use the encoding detection crate 'chardetng' which also seems to be the one used in firefox.

some websites, especially older ones, sometimes use a different encoding scheme than utf8 or latin1. before, we simply tried different encoding schemes until one successfully decoded the bytes but this approach can fail unexpectedly as some encodings can erronesly get decoded by other encodings without errors being reported. we now use the encoding detection crate 'chardetng' which is also [used in firefox](https://github.com/hsivonen/chardetng?tab=readme-ov-file#purpose).

mikkeldenker merged commit f8c58b3 into main Mar 5, 2024
3 checks passed

mikkeldenker deleted the detect-character-encodings branch March 5, 2024 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use more sophisticated encoding detection when utf8 decoding fails. #172

Use more sophisticated encoding detection when utf8 decoding fails. #172

mikkeldenker commented Mar 5, 2024

Use more sophisticated encoding detection when utf8 decoding fails. #172

Use more sophisticated encoding detection when utf8 decoding fails. #172

Conversation

mikkeldenker commented Mar 5, 2024