ftfy: fixes text for you
>>> print(fix_encoding("(à¸‡'âŒ£')à¸‡")) (ง'⌣')ง
Full documentation: http://ftfy.readthedocs.org
- “My life is livable again!” — @planarrowspace
- “A handy piece of magic” — @simonw
- “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
- “Hat mir die Tage geholfen. Im Übrigen bin ich der Meinung, dass wir keine komplexen Maschinen mit Computern bauen sollten solange wir nicht einmal Umlaute sicher verarbeiten können. :D” — Bruno Ranieri
- “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow
- “9.2/10” — pylint
Developed at Luminoso
Luminoso makes groundbreaking software for text analytics that really understands what words mean, in many languages. Our software is used by enterprise customers such as Sony, Intel, Mars, and Scotts, and it's built on Python and open-source technologies.
We use ftfy every day at Luminoso, because the first step in understanding text is making sure it has the correct characters in it!
Luminoso is growing fast and hiring. If you're interested in joining us, take a look at our careers page.
What it does
ftfy fixes Unicode that's broken in various ways. It works in Python 2.7,
Python 3.2, or later.
The goal of
ftfy is to take in bad Unicode and output good Unicode, for use
in your Unicode-aware code. This is different from taking in non-Unicode and
outputting Unicode, which is not a goal of ftfy. It also isn't designed to
protect you from having to write Unicode-aware code. ftfy helps those who help
Of course you're better off if your input is decoded properly and has no glitches. But you often don't have any control over your input; it's someone else's mistake, but it's your problem now.
ftfy will do everything it can to fix the problem.
The most interesting kind of brokenness that ftfy will fix is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences (called "mojibake"):
- The word
schönmight appear as
- An em dash (
—) might appear as
- Text that was meant to be enclosed in quotation marks might end up
instead enclosed in
<9d>represents an unprintable character.
ftfy uses heuristics to detect and undo this kind of mojibake, with a very low rate of false positives.
fix_text is the main function of ftfy. This section is meant to give you a
taste of the things it can do.
fix_encoding is the more specific function
that only fixes mojibake.
Please read the documentation for more information on what ftfy does, and how to configure it for your needs.
>>> print(fix_text('This text should be in â€œquotesâ€\x9d.')) This text should be in "quotes". >>> print(fix_text('uÌˆnicode')) ünicode >>> print(fix_text('Broken text… it’s ﬂubberiﬁc!', ... normalization='NFKC')) Broken text... it's flubberific! >>> print(fix_text('HTML entities <3')) HTML entities <3 >>> print(fix_text('<em>HTML entities in HTML <3</em>')) <em>HTML entities in HTML <3</em> >>> print(fix_text('\001\033[36;44mI’m blue, da ba dee da ba ' ... 'doo…\033[0m', normalization='NFKC')) I'm blue, da ba dee da ba doo... >>> print(fix_text('ＬＯＵＤ ＮＯＩＳＥＳ')) LOUD NOISES >>> print(fix_text('ＬＯＵＤ ＮＯＩＳＥＳ', fix_character_width=False)) ＬＯＵＤ ＮＯＩＳＥＳ
ftfy is a package that can be installed using
pip install ftfy
You can also clone this Git repository and install it with
python setup.py install.
Who maintains ftfy?
Luminoso provides ftfy as free, open source software under the extremely permissive MIT license.
You can report bugs regarding ftfy on GitHub and we'll handle them.