Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to configure a list of encodings to use when guessing #36951

Closed
JasonJunMa opened this issue Oct 26, 2017 · 80 comments
Closed

Allow to configure a list of encodings to use when guessing #36951

JasonJunMa opened this issue Oct 26, 2017 · 80 comments
Assignees
Labels
feature-request Request for new features or functionality file-encoding File encoding type issues verified Verification succeeded
Milestone

Comments

@JasonJunMa
Copy link

The files.autoGuessEncoding=true doesn't work well in some circumstances.

I think that would be good if you guys add some features like files.forceEncoding="encode1:encode2,encode3:encode4".

So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.

@vscodebot vscodebot bot added the workbench label Oct 26, 2017
@isidorn isidorn assigned bpasero and unassigned isidorn Oct 26, 2017
@isidorn isidorn added the feature-request Request for new features or functionality label Oct 26, 2017
@bpasero bpasero removed their assignment Oct 26, 2017
@bpasero bpasero added the file-explorer Explorer widget issues label Oct 26, 2017
@fseasy
Copy link

fseasy commented Oct 27, 2017

Yes, I'm totally agree because It is so weak for auto guess.
Add a candidate may be better!
For me, of may be Many Chinese Coder, only UTF-8 and GB18030 are most commonly meet, but auto-guess give me the Windows 1532??? I think is is easier to detect in users' encoding candidates.

@phobos2077
Copy link

I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D
Definitely need a setting like

files.detectEncodings=["utf8","windows1251]

So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected.

@bpasero bpasero removed the workbench label Nov 16, 2017
@isidorn isidorn added file-encoding File encoding type issues and removed file-explorer Explorer widget issues labels Nov 17, 2017
@bpasero bpasero removed their assignment Nov 18, 2017
@bpasero bpasero changed the title Request feature in terms of encoding detection Allow to configure a list of encodings to force Sep 11, 2018
@bpasero bpasero changed the title Allow to configure a list of encodings to force Allow to configure a list of encodings to use when guessing Sep 11, 2018
@bpasero bpasero self-assigned this Sep 13, 2018
@bpasero bpasero added this to the September 2018 milestone Sep 13, 2018
@bpasero bpasero added the verification-needed Verification of issue is requested label Sep 13, 2018
@bpasero
Copy link
Member

bpasero commented Sep 13, 2018

Verification: There is now a files.guessableEncodings setting where you can fill in encodings to support when guessing. From the explanation: If provided, will restrict the list of encodings that can be used when guessing. If the guessed file encoding is not in the list, the default encoding will be used.

Update: I decided to rename the setting to files.guessableEncodings

@octref
Copy link
Contributor

octref commented Sep 26, 2018

@bpasero With these settings:

    "files.autoGuessEncoding": true,
    "files.guessableEncodings": [
      "gbk"
    ]

I still get this file as UTF-8. It is in gbk encoding with two Chinese characters.

foo.txt

@octref octref added the verification-found Issue verification failed label Sep 26, 2018
@bpasero
Copy link
Member

bpasero commented Sep 27, 2018

@octref you have to use a file that jschardet can detect properly. In your case it tells me:

image

So it makes sense that UTF-8 if used

@bpasero bpasero removed the verification-found Issue verification failed label Sep 27, 2018
@bpasero
Copy link
Member

bpasero commented Sep 27, 2018

To verify you can use src/vs/base/test/node/encoding/fixtures/some.cp1252.txt with CP1252 encoding!

@octref
Copy link
Contributor

octref commented Sep 27, 2018

@bpasero I see, the logic is

  • Guessed encoding is not in files.guessableEncodings
  • Fall back to utf-8

But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is gbk encoding, but jschardet could have guessed either of these:

image

If the user wants all files to be opened as gbk. This setting would not work for him.
The original request is more for being able to set fallbacks. For example,

  • If guessed encoding is gb2312, gb18030, fall back to gbk.
  • Otherwise, fall back to utf-8.

A setting like this would be more useful:

{
  "files.encodingAssociations": {
    "gbk": ["gb2312", "gb18030"],
    "cp950": ["big5hkscs"]
    // Everything else falls back to "utf-8"
  }
}

@bpasero
Copy link
Member

bpasero commented Sep 27, 2018

Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa).

@irudoy
Copy link

irudoy commented Sep 28, 2018

@bpasero in the implementation from original pull request, the encoding falls back to the first one in the list instead of utf-8. It was not a great solution, definitely. I consider that @octref solution will resolve an issue.

@yutotnh
Copy link
Contributor

yutotnh commented Mar 23, 2024

@aadsm
Thank you for releasing 3.1.0!

I am trying to add functionality for this issue to VS Code using the detectEncodings option that was added.
However, I found one bug in jschardet.
That is that jschardet throws an error if any of the following 6 encodings are specified in detectEncodings

  • Shift-JIS
  • EUC-JP
  • GB2312
  • EUC-KR
  • Big5
  • EUC-TW

I just submitted a pull request to fix this bug.
Is it possible to import this fix into jschardet and release it as 3.1.1?
aadsm/jschardet#91

I use Shift_JIS and EUC-JP on a daily basis, so fingers crossed that this fix will be incorporated.

@aadsm
Copy link

aadsm commented Mar 23, 2024

I've just published a new patch version to npm. Those hours learning github actions and workflows were really worth it in the end ahah.

@nfrance709
Copy link

nfrance709 commented Mar 23, 2024

When using version 3.1.1 of jschardet and testing with my attached charset_test_file.php file it still does not return the correct encoding which is windows-1252.

charset_test_file.php.txt

It returns { encoding: 'ISO-8859-2', confidence: 0.8496565744888162 } by default and it returns { encoding: null, confidence: 0 } when using { detectEncodings: ["UTF-8", "windows-1252"] }

const fs = require('fs');
const jschardet = require('jschardet');

jschardet.enableDebug();

const content = fs.readFileSync('charset_test_file.php');
// const result = jschardet.detect(content);
const result = jschardet.detect(content, { detectEncodings: ["UTF-8", "windows-1252"] });
console.log(result);

In my project the files are either UTF-8 or windows-1252 which I suspect most projects are either exclusively UTF-8 or UTF-8 with one other local encoding so ideally we need an option so that if UTF-8 is not detected then fallback to the local encoding provided in the array.

For the now I’ll have to continue modifying \src\vs\workbench\services\textfile\common\encoding.ts as described here #36951 (comment)

@yutotnh
Copy link
Contributor

yutotnh commented Mar 25, 2024

@aadsm
Thanks for releasing 3.1.1.
Thanks to you, I was able to create a pull request (#208550).

@nfrance709
This pull request will open with files.encoding if the encoding cannot be guessed, as in charset_test_file.php.txt.

@aadsm
Copy link

aadsm commented Mar 25, 2024

@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you!
I've released these fixes under version 3.1.2.

@aadsm
Copy link

aadsm commented Mar 25, 2024

@yutotnh sorry about this, but could you update your pr with 3.1.2? 😅

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024
@yutotnh
Copy link
Contributor

yutotnh commented Mar 25, 2024

@aadsm Updated jschardet to 3.1.2 at ff546d5.

@nfrance709
Copy link

@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you! I've released these fixes under version 3.1.2.

@yutotnh thank you for the fixes in 3.1.2 and yes use the file as a test case.

Once @yutotnh pull request is accepted and a new version of VSCode is released I can go back to using the release version as there are a number of features missing from the OSS version that I would like to use.

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024
@nfrance709
Copy link

@aadsm Thanks for releasing 3.1.1. Thanks to you, I was able to create a pull request (#208550).

@nfrance709 This pull request will open with files.encoding if the encoding cannot be guessed, as in charset_test_file.php.txt.

Thank you, I just compiled and tested your latest version using 3.1.2 and it works as expected. I hope your pull request is accepted soon.

bpasero added a commit that referenced this issue Jun 5, 2024
…g encoding (#36951) (#208550)

* Allow to configure a list of encodings to use when guessing #36951

* Bump up the jschardet version into 3.1.2 #36951

* missing merge

* some polish

* renames

* some polish

* some polish

* cleanup

---------

Co-authored-by: Benjamin Pasero <benjamin.pasero@microsoft.com>
@bpasero bpasero self-assigned this Jun 5, 2024
@bpasero bpasero modified the milestones: Backlog, June 2024 Jun 5, 2024
@bpasero
Copy link
Member

bpasero commented Jun 5, 2024

Thanks to #208550, a new setting files.candidateGuessEncodings allows to specify which encodings are allowed to be guessed based on a string[] list. The option directly maps to the detectEncodings option in jschardet: https://github.com/aadsm/jschardet?tab=readme-ov-file#options

This will be available in our insiders channel on Friday after we have released stable.

@bpasero bpasero closed this as completed Jun 5, 2024
@peminator
Copy link

would love to try, please remind us in this thread to try after published or also after available to public.

@bpasero
Copy link
Member

bpasero commented Jun 6, 2024

This is now released in our insiders channel. You can give our preview releases a try from: https://code.visualstudio.com/insiders/

@formigoni
Copy link

I have tried the implementation on Insiders versions!! It works like a charm!! Thanks @bpasero thanks @yutotnh thanks @aadsm !!
Here is how I have configured:

{
    "files.candidateGuessEncodings": [
        "windows1252",
        "utf8"
    ],
    "files.autoGuessEncoding": true
}

@Yoyokrazy Yoyokrazy added the verification-needed Verification of issue is requested label Jun 25, 2024
@alexr00
Copy link
Member

alexr00 commented Jun 25, 2024

@bpasero do you want additional verification on this?

@bpasero
Copy link
Member

bpasero commented Jun 25, 2024

I think its fine given #36951 (comment)

@bpasero bpasero added verified Verification succeeded and removed verification-needed Verification of issue is requested labels Jun 25, 2024
@formigoni
Copy link

I don't know whether the right place to point out was here or in the pull request, but I have found a particular situation in which the encoding detection is not working properly.
#208550 (comment)

@vs-code-engineering vs-code-engineering bot locked and limited conversation to collaborators Jul 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request Request for new features or functionality file-encoding File encoding type issues verified Verification succeeded
Projects
None yet
Development

No branches or pull requests