GitHub - mattbasta/fastchardet: A less thorough (and faster) version of python-chardet

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
fastchardet		fastchardet
tests		tests
.gitignore		.gitignore
COPYING		COPYING
README		README
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

fastchardet
v0.1.2

This is a super-fast and friendly version of chardet. During the development
of the AMO validator for Mozilla (github.com/mattbasta/amo-validator), I used
the chardet library to determine the encoding of L10n files. The problem with
this was that chardet tested each file against a HUGE number of different
encodings and ultimately performed very inefficiently. An add-on that would
take one second to scan suddenly took three.

In the case of the validator, I didn't need to know if the add-on was UTF-54
or had characters from a half-baked Esperanto implementation developed in the
1800s; I just needed to know whether it was UTF-8, ASCII, or something else.
Spending so many cycles to get something so simple to work just seems absurd.

This library should change that. I'm implementing it as a fork of chardet
that will implement the same interface, but provide a much faster set of tests
for the popular encodings (and that's it).


Output
--------------------------
There are a number of possible outputs:

- ascii
- utf_8
- utf_n
- escaped (HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, or ISO-2022-KR)
- latin1 (windows-1252)
- unknown

And that's it. It doesn't test for anything else.

Oh, and unlike chardet, we'll actually return "unicode" as the encoding if you
pass in a unicode string rather than throwing exceptions. How great is that?