Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: detect mixups between two single-byte encodings #18

Open
rspeer opened this issue Jan 29, 2014 · 11 comments
Open

Feature: detect mixups between two single-byte encodings #18

rspeer opened this issue Jan 29, 2014 · 11 comments

Comments

@rspeer
Copy link
Member

@rspeer rspeer commented Jan 29, 2014

There is apparently a fair amount of Spanish text out there that contains a mix-up between Windows-1252 and MacRoman before being encoded in UTF-8.

Because Latin-1 for Windows-1252 is the only single-byte mixup we detect, we assume that's what happened, and get text that looks like: "PrevŽn diputados inaugurar periodo de sesiones con c—digo penal".

This is not a false positive, because the encoding is in fact incorrect (it's actually got the UTF-8 encoding of the wrong characters in it), and ftfy is trying to fix it. It's in fact using the same fix that any web browser would use. However, the resulting text makes no sense, because it's not the correct fix.

This mixup is apparently common enough that it would be worth fixing as another special case.

@martinblech

This comment has been minimized.

Copy link

@martinblech martinblech commented Sep 30, 2014

Is this the same issue or a new one?

>>> s = u'Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,'
>>> print ftfy.fix_text_segment(s)
Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,

Source: http://184.107.166.66:8114/status.xsl

@rspeer

This comment has been minimized.

Copy link
Member Author

@rspeer rspeer commented Sep 30, 2014

That one's not an issue. Beneath the mojibake, that's exactly what the text says.

� in Windows-1252 is 0xEF 0xBF 0xBD, the UTF-8 encoding of �, aka U+FFFD REPLACEMENT CHARACTER. Whatever actual Unicode the string was supposed to contain has already been lost.

@rspeer rspeer changed the title Broken Spanish text turns into differently broken Spanish text Feature: detect mixups between two encodings that aren't UTF-8 Oct 2, 2014
@rspeer

This comment has been minimized.

Copy link
Member Author

@rspeer rspeer commented Oct 2, 2014

There are several open issues that are really the same thing. I'm merging them all into this issue.

@martinblech

This comment has been minimized.

Copy link

@martinblech martinblech commented Oct 3, 2014

@rspeer Cool! Let me know whether you'd like me to keep posting examples as I find them. I want to help but I don't want to spam :)

@rspeer

This comment has been minimized.

Copy link
Member Author

@rspeer rspeer commented Oct 12, 2014

The examples are helpful! I can use them as test cases.

@rspeer rspeer mentioned this issue Jun 16, 2015
@jpluimers

This comment has been minimized.

Copy link

@jpluimers jpluimers commented Jul 29, 2015

Related: the mixup of "v3/43/4r" (ASCII-printed high-byte characters) coming from "v¾¾r" (CP850) coming from "vóór" (Windows-1252). See http://stackoverflow.com/questions/17654898/which-encoding-failure-did-encode-v%C3%B3%C3%B3r-into-v3-43-4r

@rspeer

This comment has been minimized.

Copy link
Member Author

@rspeer rspeer commented Jul 29, 2015

Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with.

I should, however, look into "the infamous CP850" and whether ftfy should consider it as a possibility, so that for example it could decode UTF-8 re-interpreted as CP850.

@lrq3000

This comment has been minimized.

Copy link

@lrq3000 lrq3000 commented Feb 13, 2017

What about this:

a = '''Liège Avenue de l'Hôpital'''  # french sentence
print(ftfy.fix_text(a.decode('utf8')))

# Out: Liège Avenue de l'HĂ´pital, no change from input, where it should be: Liège Avenue de l'Hôpital

Does this fit into this issue? I could not find any way to correct this (using ftfy or any other method).

@rspeer

This comment has been minimized.

Copy link
Member Author

@rspeer rspeer commented Feb 13, 2017

It's been encoded in UTF-8 and decoded in Windows-1250. Here's the code that specifically fixes it (written in a way that should work in Python 2 or 3):

>>> text = u"Liège Avenue de l'Hôpital"
>>> print(text.encode('windows-1250').decode('utf-8'))
Liège Avenue de l'Hôpital

So this is within the scope of ftfy, it's just not a possibility that it currently checks for. I'm aware that Windows-1250 is used somewhat frequently in Eastern Europe, and it's probably a bias in my data collection that I haven't seen many examples of it.

I will open a new issue for this.

@Veki2808

This comment has been minimized.

Copy link

@Veki2808 Veki2808 commented Sep 21, 2017

If we have something like this that's not problem
>>> print(ftfy.fix_text('ünicode'))
ünicode

But if we use mixed encoding types something like this i.e
>>> print(ftfy.fix_text('Hi to ℙℽ☂ℌϕℿ ünicode'))
Hi to ℙℽ☂ℌϕℿ ünicode

Expected to be(Hi to ℙℽ☂ℌϕℿ ünicode)

Why is this happening? Is this something that this library cannot handle?

@rspeer

This comment has been minimized.

Copy link
Member Author

@rspeer rspeer commented Sep 21, 2017

ftfy makes kind of arbitrary decisions about how to handle mixed encodings: it allows the encoding to change at line breaks, and it also decodes the most common mojibake sequences like • even when they're inconsistent with the surrounding line.

Encoding a combining umlaut as ̈ isn't common enough to fall into that second case.

@rspeer rspeer mentioned this issue Jul 10, 2018
@rspeer rspeer changed the title Feature: detect mixups between two encodings that aren't UTF-8 Feature: detect mixups between two single-byte encodings Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.