Skip to content

mattbasta/fastchardet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fastchardet
v0.1.2

This is a super-fast and friendly version of chardet. During the development
of the AMO validator for Mozilla (github.com/mattbasta/amo-validator), I used
the chardet library to determine the encoding of L10n files. The problem with
this was that chardet tested each file against a HUGE number of different
encodings and ultimately performed very inefficiently. An add-on that would
take one second to scan suddenly took three.

In the case of the validator, I didn't need to know if the add-on was UTF-54
or had characters from a half-baked Esperanto implementation developed in the
1800s; I just needed to know whether it was UTF-8, ASCII, or something else.
Spending so many cycles to get something so simple to work just seems absurd.

This library should change that. I'm implementing it as a fork of chardet
that will implement the same interface, but provide a much faster set of tests
for the popular encodings (and that's it).


Output
--------------------------
There are a number of possible outputs:

- ascii
- utf_8
- utf_n
- escaped (HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, or ISO-2022-KR)
- latin1 (windows-1252)
- unknown

And that's it. It doesn't test for anything else.

Oh, and unlike chardet, we'll actually return "unicode" as the encoding if you
pass in a unicode string rather than throwing exceptions. How great is that?

About

A less thorough (and faster) version of python-chardet

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages