Feature: detect mixups between two single-byte encodings #18

Open
rspeer opened this Issue Jan 29, 2014 · 11 comments

Comments

Projects
None yet
5 participants
@rspeer
Member

rspeer commented Jan 29, 2014

There is apparently a fair amount of Spanish text out there that contains a mix-up between Windows-1252 and MacRoman before being encoded in UTF-8.

Because Latin-1 for Windows-1252 is the only single-byte mixup we detect, we assume that's what happened, and get text that looks like: "PrevŽn diputados inaugurar periodo de sesiones con c—digo penal".

This is not a false positive, because the encoding is in fact incorrect (it's actually got the UTF-8 encoding of the wrong characters in it), and ftfy is trying to fix it. It's in fact using the same fix that any web browser would use. However, the resulting text makes no sense, because it's not the correct fix.

This mixup is apparently common enough that it would be worth fixing as another special case.

@martinblech

This comment has been minimized.

Show comment
Hide comment
@martinblech

martinblech Sep 30, 2014

Is this the same issue or a new one?

>>> s = u'Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,'
>>> print ftfy.fix_text_segment(s)
Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,

Source: http://184.107.166.66:8114/status.xsl

Is this the same issue or a new one?

>>> s = u'Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,'
>>> print ftfy.fix_text_segment(s)
Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,

Source: http://184.107.166.66:8114/status.xsl

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Sep 30, 2014

Member

That one's not an issue. Beneath the mojibake, that's exactly what the text says.

� in Windows-1252 is 0xEF 0xBF 0xBD, the UTF-8 encoding of �, aka U+FFFD REPLACEMENT CHARACTER. Whatever actual Unicode the string was supposed to contain has already been lost.

Member

rspeer commented Sep 30, 2014

That one's not an issue. Beneath the mojibake, that's exactly what the text says.

� in Windows-1252 is 0xEF 0xBF 0xBD, the UTF-8 encoding of �, aka U+FFFD REPLACEMENT CHARACTER. Whatever actual Unicode the string was supposed to contain has already been lost.

@rspeer rspeer changed the title from Broken Spanish text turns into differently broken Spanish text to Feature: detect mixups between two encodings that aren't UTF-8 Oct 2, 2014

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Oct 2, 2014

Member

There are several open issues that are really the same thing. I'm merging them all into this issue.

Member

rspeer commented Oct 2, 2014

There are several open issues that are really the same thing. I'm merging them all into this issue.

@martinblech

This comment has been minimized.

Show comment
Hide comment
@martinblech

martinblech Oct 3, 2014

@rspeer Cool! Let me know whether you'd like me to keep posting examples as I find them. I want to help but I don't want to spam :)

@rspeer Cool! Let me know whether you'd like me to keep posting examples as I find them. I want to help but I don't want to spam :)

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Oct 12, 2014

Member

The examples are helpful! I can use them as test cases.

Member

rspeer commented Oct 12, 2014

The examples are helpful! I can use them as test cases.

@rspeer rspeer referenced this issue Jun 16, 2015

Closed

Fix Failure #39

@jpluimers

This comment has been minimized.

Show comment
Hide comment
@jpluimers

jpluimers Jul 29, 2015

Related: the mixup of "v3/43/4r" (ASCII-printed high-byte characters) coming from "v¾¾r" (CP850) coming from "vóór" (Windows-1252). See http://stackoverflow.com/questions/17654898/which-encoding-failure-did-encode-v%C3%B3%C3%B3r-into-v3-43-4r

Related: the mixup of "v3/43/4r" (ASCII-printed high-byte characters) coming from "v¾¾r" (CP850) coming from "vóór" (Windows-1252). See http://stackoverflow.com/questions/17654898/which-encoding-failure-did-encode-v%C3%B3%C3%B3r-into-v3-43-4r

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Jul 29, 2015

Member

Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with.

I should, however, look into "the infamous CP850" and whether ftfy should consider it as a possibility, so that for example it could decode UTF-8 re-interpreted as CP850.

Member

rspeer commented Jul 29, 2015

Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with.

I should, however, look into "the infamous CP850" and whether ftfy should consider it as a possibility, so that for example it could decode UTF-8 re-interpreted as CP850.

@lrq3000

This comment has been minimized.

Show comment
Hide comment
@lrq3000

lrq3000 Feb 13, 2017

What about this:

a = '''Liège Avenue de l'Hôpital'''  # french sentence
print(ftfy.fix_text(a.decode('utf8')))

# Out: Liège Avenue de l'HĂ´pital, no change from input, where it should be: Liège Avenue de l'Hôpital

Does this fit into this issue? I could not find any way to correct this (using ftfy or any other method).

lrq3000 commented Feb 13, 2017

What about this:

a = '''Liège Avenue de l'Hôpital'''  # french sentence
print(ftfy.fix_text(a.decode('utf8')))

# Out: Liège Avenue de l'HĂ´pital, no change from input, where it should be: Liège Avenue de l'Hôpital

Does this fit into this issue? I could not find any way to correct this (using ftfy or any other method).

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Feb 13, 2017

Member

It's been encoded in UTF-8 and decoded in Windows-1250. Here's the code that specifically fixes it (written in a way that should work in Python 2 or 3):

>>> text = u"Liège Avenue de l'Hôpital"
>>> print(text.encode('windows-1250').decode('utf-8'))
Liège Avenue de l'Hôpital

So this is within the scope of ftfy, it's just not a possibility that it currently checks for. I'm aware that Windows-1250 is used somewhat frequently in Eastern Europe, and it's probably a bias in my data collection that I haven't seen many examples of it.

I will open a new issue for this.

Member

rspeer commented Feb 13, 2017

It's been encoded in UTF-8 and decoded in Windows-1250. Here's the code that specifically fixes it (written in a way that should work in Python 2 or 3):

>>> text = u"Liège Avenue de l'Hôpital"
>>> print(text.encode('windows-1250').decode('utf-8'))
Liège Avenue de l'Hôpital

So this is within the scope of ftfy, it's just not a possibility that it currently checks for. I'm aware that Windows-1250 is used somewhat frequently in Eastern Europe, and it's probably a bias in my data collection that I haven't seen many examples of it.

I will open a new issue for this.

@Veki2808

This comment has been minimized.

Show comment
Hide comment
@Veki2808

Veki2808 Sep 21, 2017

If we have something like this that's not problem
>>> print(ftfy.fix_text('ünicode'))
ünicode

But if we use mixed encoding types something like this i.e
>>> print(ftfy.fix_text('Hi to ℙℽ☂ℌϕℿ ünicode'))
Hi to ℙℽ☂ℌϕℿ ünicode

Expected to be(Hi to ℙℽ☂ℌϕℿ ünicode)

Why is this happening? Is this something that this library cannot handle?

Veki2808 commented Sep 21, 2017

If we have something like this that's not problem
>>> print(ftfy.fix_text('ünicode'))
ünicode

But if we use mixed encoding types something like this i.e
>>> print(ftfy.fix_text('Hi to ℙℽ☂ℌϕℿ ünicode'))
Hi to ℙℽ☂ℌϕℿ ünicode

Expected to be(Hi to ℙℽ☂ℌϕℿ ünicode)

Why is this happening? Is this something that this library cannot handle?

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Sep 21, 2017

Member

ftfy makes kind of arbitrary decisions about how to handle mixed encodings: it allows the encoding to change at line breaks, and it also decodes the most common mojibake sequences like • even when they're inconsistent with the surrounding line.

Encoding a combining umlaut as ̈ isn't common enough to fall into that second case.

Member

rspeer commented Sep 21, 2017

ftfy makes kind of arbitrary decisions about how to handle mixed encodings: it allows the encoding to change at line breaks, and it also decodes the most common mojibake sequences like • even when they're inconsistent with the surrounding line.

Encoding a combining umlaut as ̈ isn't common enough to fall into that second case.

@rspeer rspeer referenced this issue Jul 10, 2018

Closed

MacRoman-CP437 #105

@rspeer rspeer changed the title from Feature: detect mixups between two encodings that aren't UTF-8 to Feature: detect mixups between two single-byte encodings Jul 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment