Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with encodings used in the Baltics #130

Closed
adbar opened this issue Oct 25, 2021 · 4 comments · Fixed by #133
Closed

Issues with encodings used in the Baltics #130

adbar opened this issue Oct 25, 2021 · 4 comments · Fixed by #133
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@adbar
Copy link
Contributor

adbar commented Oct 25, 2021

Describe the bug

  • ISO-8859-15 detected correctly by chardet and cchardet, charset_normalizer says cp850,cp857,cp858
  • WINDOWS-1252 detected correctly by chardet and cchardet, charset_normalizer says cp775,cp850,cp857,cp858

The language detection works in both cases (Estonian).

To Reproduce

Desktop (please complete the following information):

https://charsetnormalizerweb-ousret.vercel.app/

@adbar adbar added bug Something isn't working help wanted Extra attention is needed labels Oct 25, 2021
@Ousret
Copy link
Owner

Ousret commented Oct 25, 2021

I have taken a quick look over these.
Still borderline cases, looking at explain=True, it's a close call. The engine is already in a sweet spot, improving it further is difficult and challenging.

It's possible to fix those, but I lack the free time to make the fine-tuning for now.

@adbar
Copy link
Contributor Author

adbar commented Oct 26, 2021

I understand, but it's still something that other packages manage to handle, so if you find some time in the future...

Ousret added a commit that referenced this issue Nov 1, 2021
Ousret added a commit that referenced this issue Nov 1, 2021
* ❇️ Adjust the MD around obvious bad EU word rendered
Partially close #130
* 🔖 Bump version 2.0.8.dev1
@Ousret
Copy link
Owner

Ousret commented Nov 1, 2021

I have pushed/merged a solid patch for one of the two cases you presented me.

For the remaining file, I could not find a viable solution for now. It still produces some very readable Estonian.
https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/test/et/iso-8859-15.txt

Feel free to check against dev-master@2.0.8.dev1

@adbar
Copy link
Contributor Author

adbar commented Nov 1, 2021

Nice, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants