-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to configure a list of encodings to use when guessing #36951
Comments
Yes, I'm totally agree because It is so weak for auto guess. |
I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D
So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected. |
Verification: There is now a Update: I decided to rename the setting to |
@octref you have to use a file that jschardet can detect properly. In your case it tells me: So it makes sense that UTF-8 if used |
To verify you can use |
@bpasero I see, the logic is
But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is If the user wants all files to be opened as
A setting like this would be more useful: {
"files.encodingAssociations": {
"gbk": ["gb2312", "gb18030"],
"cp950": ["big5hkscs"]
// Everything else falls back to "utf-8"
}
} |
Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa). |
@aadsm I am trying to add functionality for this issue to VS Code using the
I just submitted a pull request to fix this bug. I use Shift_JIS and EUC-JP on a daily basis, so fingers crossed that this fix will be incorporated. |
I've just published a new patch version to npm. Those hours learning github actions and workflows were really worth it in the end ahah. |
When using version 3.1.1 of jschardet and testing with my attached charset_test_file.php file it still does not return the correct encoding which is windows-1252. It returns { encoding: 'ISO-8859-2', confidence: 0.8496565744888162 } by default and it returns { encoding: null, confidence: 0 } when using { detectEncodings: ["UTF-8", "windows-1252"] }
In my project the files are either UTF-8 or windows-1252 which I suspect most projects are either exclusively UTF-8 or UTF-8 with one other local encoding so ideally we need an option so that if UTF-8 is not detected then fallback to the local encoding provided in the array. For the now I’ll have to continue modifying \src\vs\workbench\services\textfile\common\encoding.ts as described here #36951 (comment) |
@aadsm @nfrance709 |
@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you! |
@yutotnh sorry about this, but could you update your pr with 3.1.2? 😅 |
@yutotnh thank you for the fixes in 3.1.2 and yes use the file as a test case. Once @yutotnh pull request is accepted and a new version of VSCode is released I can go back to using the release version as there are a number of features missing from the OSS version that I would like to use. |
Thank you, I just compiled and tested your latest version using 3.1.2 and it works as expected. I hope your pull request is accepted soon. |
…g encoding (#36951) (#208550) * Allow to configure a list of encodings to use when guessing #36951 * Bump up the jschardet version into 3.1.2 #36951 * missing merge * some polish * renames * some polish * some polish * cleanup --------- Co-authored-by: Benjamin Pasero <benjamin.pasero@microsoft.com>
Thanks to #208550, a new setting This will be available in our insiders channel on Friday after we have released stable. |
would love to try, please remind us in this thread to try after published or also after available to public. |
This is now released in our insiders channel. You can give our preview releases a try from: https://code.visualstudio.com/insiders/ |
@bpasero do you want additional verification on this? |
I think its fine given #36951 (comment) |
I don't know whether the right place to point out was here or in the pull request, but I have found a particular situation in which the encoding detection is not working properly. |
The
files.autoGuessEncoding=true
doesn't work well in some circumstances.I think that would be good if you guys add some features like
files.forceEncoding="encode1:encode2,encode3:encode4"
.So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.
The text was updated successfully, but these errors were encountered: