You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two places where Clade uses chardet to detect an encoding. For parsing includes it does not take very much time but an overhead seems to be about 50% of the total wall time. For parsing preprocessed source files it can take extremely much time.
We are not alone: https://github.com/kennethreitz/requests/issues/2359. Maybe we should switch to one of alternative libraries. Also, there are other options. For instance, Python likely uses locale.getpreferredencoding() implicitly at least for opening files. For explicit decoding you need to invoke this function yourself. Besides, users can specify an encoding or several encodings if necessary. After all, if users do not specify encodings or default encodings fail, Clade can use a library to detect encodings. I think that other implementations will not be very fast as well, so, the latter should be avoided as much as possible.
The text was updated successfully, but these errors were encountered:
Starting from 354a62c you can ether specify needed encodings yourself, or use alternative cchardet module which should work much faster (it is written in C just like ujson).
Please test and report which way to detect encodings is better.
cchardet operates fast, but I prefer to specify encodings manually since detecting does not seem to be well enough. I suggest to switch to cchardet by default.
There are two places where Clade uses chardet to detect an encoding. For parsing includes it does not take very much time but an overhead seems to be about 50% of the total wall time. For parsing preprocessed source files it can take extremely much time.
We are not alone: https://github.com/kennethreitz/requests/issues/2359. Maybe we should switch to one of alternative libraries. Also, there are other options. For instance, Python likely uses locale.getpreferredencoding() implicitly at least for opening files. For explicit decoding you need to invoke this function yourself. Besides, users can specify an encoding or several encodings if necessary. After all, if users do not specify encodings or default encodings fail, Clade can use a library to detect encodings. I think that other implementations will not be very fast as well, so, the latter should be avoided as much as possible.
The text was updated successfully, but these errors were encountered: