Issues with encodings used in the Baltics #130

adbar · 2021-10-25T10:28:35Z

Describe the bug

ISO-8859-15 detected correctly by chardet and cchardet, charset_normalizer says cp850,cp857,cp858
WINDOWS-1252 detected correctly by chardet and cchardet, charset_normalizer says cp775,cp850,cp857,cp858

The language detection works in both cases (Estonian).

To Reproduce

Desktop (please complete the following information):

https://charsetnormalizerweb-ousret.vercel.app/

Ousret · 2021-10-25T21:24:04Z

I have taken a quick look over these.
Still borderline cases, looking at explain=True, it's a close call. The engine is already in a sweet spot, improving it further is difficult and challenging.

It's possible to fix those, but I lack the free time to make the fine-tuning for now.

adbar · 2021-10-26T16:34:07Z

I understand, but it's still something that other packages manage to handle, so if you find some time in the future...

Partially close #130

* ❇️ Adjust the MD around obvious bad EU word rendered Partially close #130 * 🔖 Bump version 2.0.8.dev1

Ousret · 2021-11-01T16:45:44Z

I have pushed/merged a solid patch for one of the two cases you presented me.

For the remaining file, I could not find a viable solution for now. It still produces some very readable Estonian.
https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/test/et/iso-8859-15.txt

Feel free to check against dev-master@2.0.8.dev1

adbar · 2021-11-01T17:10:36Z

Nice, thanks!

adbar added bug Something isn't working help wanted Extra attention is needed labels Oct 25, 2021

Ousret added a commit that referenced this issue Nov 1, 2021

❇️ Adjust the MD around obvious bad EU word rendered

89b30a0

Partially close #130

Ousret mentioned this issue Nov 1, 2021

🛠️ Minor adjustement on the MD around european words #133

Merged

Ousret closed this as completed in #133 Nov 1, 2021

Ousret added a commit that referenced this issue Nov 1, 2021

🛠️ Minor adjustement on the MD around european words (#133)

00ffea0

* ❇️ Adjust the MD around obvious bad EU word rendered Partially close #130 * 🔖 Bump version 2.0.8.dev1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with encodings used in the Baltics #130

Issues with encodings used in the Baltics #130

adbar commented Oct 25, 2021

Ousret commented Oct 25, 2021

adbar commented Oct 26, 2021

Ousret commented Nov 1, 2021

adbar commented Nov 1, 2021

Issues with encodings used in the Baltics #130

Issues with encodings used in the Baltics #130

Comments

adbar commented Oct 25, 2021

Ousret commented Oct 25, 2021

adbar commented Oct 26, 2021

Ousret commented Nov 1, 2021

adbar commented Nov 1, 2021