A less thorough (and faster) version of python-chardet
License
mattbasta/fastchardet
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
fastchardet v0.1.2 This is a super-fast and friendly version of chardet. During the development of the AMO validator for Mozilla (github.com/mattbasta/amo-validator), I used the chardet library to determine the encoding of L10n files. The problem with this was that chardet tested each file against a HUGE number of different encodings and ultimately performed very inefficiently. An add-on that would take one second to scan suddenly took three. In the case of the validator, I didn't need to know if the add-on was UTF-54 or had characters from a half-baked Esperanto implementation developed in the 1800s; I just needed to know whether it was UTF-8, ASCII, or something else. Spending so many cycles to get something so simple to work just seems absurd. This library should change that. I'm implementing it as a fork of chardet that will implement the same interface, but provide a much faster set of tests for the popular encodings (and that's it). Output -------------------------- There are a number of possible outputs: - ascii - utf_8 - utf_n - escaped (HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, or ISO-2022-KR) - latin1 (windows-1252) - unknown And that's it. It doesn't test for anything else. Oh, and unlike chardet, we'll actually return "unicode" as the encoding if you pass in a unicode string rather than throwing exceptions. How great is that?
About
A less thorough (and faster) version of python-chardet
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published