Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting encoding takes too much time #70

Closed
eunovm opened this issue Feb 24, 2019 · 2 comments
Closed

Detecting encoding takes too much time #70

eunovm opened this issue Feb 24, 2019 · 2 comments
Assignees
Labels
enhancement New feature or request High High priority
Milestone

Comments

@eunovm
Copy link
Collaborator

eunovm commented Feb 24, 2019

There are two places where Clade uses chardet to detect an encoding. For parsing includes it does not take very much time but an overhead seems to be about 50% of the total wall time. For parsing preprocessed source files it can take extremely much time.

We are not alone: https://github.com/kennethreitz/requests/issues/2359. Maybe we should switch to one of alternative libraries. Also, there are other options. For instance, Python likely uses locale.getpreferredencoding() implicitly at least for opening files. For explicit decoding you need to invoke this function yourself. Besides, users can specify an encoding or several encodings if necessary. After all, if users do not specify encodings or default encodings fail, Clade can use a library to detect encodings. I think that other implementations will not be very fast as well, so, the latter should be avoided as much as possible.

@eunovm eunovm added enhancement New feature or request High High priority labels Feb 24, 2019
@17451k 17451k self-assigned this Feb 26, 2019
@17451k
Copy link
Owner

17451k commented Feb 26, 2019

Starting from 354a62c you can ether specify needed encodings yourself, or use alternative cchardet module which should work much faster (it is written in C just like ujson).

Please test and report which way to detect encodings is better.

@17451k 17451k added the question Further information is requested label Feb 26, 2019
@eunovm eunovm removed the question Further information is requested label Mar 1, 2019
@eunovm
Copy link
Collaborator Author

eunovm commented Mar 1, 2019

cchardet operates fast, but I prefer to specify encodings manually since detecting does not seem to be well enough. I suggest to switch to cchardet by default.

@17451k 17451k added this to the 3.0 milestone Mar 1, 2019
@17451k 17451k closed this as completed in 78aed60 Mar 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request High High priority
Projects
None yet
Development

No branches or pull requests

2 participants