Detecting encoding takes too much time #70

eunovm · 2019-02-24T08:28:04Z

There are two places where Clade uses chardet to detect an encoding. For parsing includes it does not take very much time but an overhead seems to be about 50% of the total wall time. For parsing preprocessed source files it can take extremely much time.

We are not alone: https://github.com/kennethreitz/requests/issues/2359. Maybe we should switch to one of alternative libraries. Also, there are other options. For instance, Python likely uses locale.getpreferredencoding() implicitly at least for opening files. For explicit decoding you need to invoke this function yourself. Besides, users can specify an encoding or several encodings if necessary. After all, if users do not specify encodings or default encodings fail, Clade can use a library to detect encodings. I think that other implementations will not be very fast as well, so, the latter should be avoided as much as possible.

17451k · 2019-02-26T15:20:09Z

Starting from 354a62c you can ether specify needed encodings yourself, or use alternative cchardet module which should work much faster (it is written in C just like ujson).

Please test and report which way to detect encodings is better.

eunovm · 2019-03-01T05:56:35Z

cchardet operates fast, but I prefer to specify encodings manually since detecting does not seem to be well enough. I suggest to switch to cchardet by default.

eunovm added enhancement New feature or request High High priority labels Feb 24, 2019

17451k self-assigned this Feb 26, 2019

17451k added the question Further information is requested label Feb 26, 2019

eunovm removed the question Further information is requested label Mar 1, 2019

17451k added this to the 3.0 milestone Mar 1, 2019

17451k closed this as completed in 78aed60 Mar 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting encoding takes too much time #70

Detecting encoding takes too much time #70

eunovm commented Feb 24, 2019

17451k commented Feb 26, 2019

eunovm commented Mar 1, 2019

Detecting encoding takes too much time #70

Detecting encoding takes too much time #70

Comments

eunovm commented Feb 24, 2019

17451k commented Feb 26, 2019

eunovm commented Mar 1, 2019