Fixes mojibake and other glitches in Unicode text, after the fact.
Switch branches/tags
v5.5.1 v5.5.0 v5.3.0 v5.2.0 v5.0.1 v5.0.0 v4.4.1 v4.2.0 v4.1.0 v4.0.0 v3.4.0 v3.3.0 v3.2.0 v3.1.3 v3.1.2 v3.1.1 v3.1 v3.0.6 v3.0.4 v3.0.3 v3.0.2 v3.0.1 v3.0 v2.0.2 v2.0 staging-20181019 staging-20181005 staging-20180907 staging-20180615 staging-20180518 staging-20180126 staging-20180111 staging-20171201 staging-20170519 staging-20170505 staging-20170407 staging-20170324 staging-20170310 staging-20170224 staging-20170127 staging-20170112 staging-20161007 staging-20160811 staging-20160422 staging-20160226 staging-20160212 staging-20160129 staging-20160114 staging-20150814 staging-20150605 staging-20150522 staging-20150508 staging-20150115 staging-20141204 staging-20140912 staging-20140815 staging-20140711 staging-20140516 staging-20140502 code-review-20181019 code-review-20181005 code-review-20180907 code-review-20180615 code-review-20180518 code-review-20180126 code-review-20180111 code-review-20171215 code-review-20171201 code-review-20170519 code-review-20170505 code-review-20170407 code-review-20170324 code-review-20170310 code-review-20170224 code-review-20170127 code-review-20170112 code-review-20161007 code-review-20160811 code-review-20160422 code-review-20160226 code-review-20160212 code-review-20160129 code-review-20160114 code-review-20150814 code-review-20150605 code-review-20150522 code-review-20150508 code-review-20150130 code-review-20150115 code-review-20141204 code-review-20140912 code-review-20140815 code-review-20140711 code-review-20140627 code-review-20140516 code-review-20140321 code-review-20140207 code-review-20131127 code-review-20131115 code-review-20131101
Nothing to show
Clone or download
rspeer Merge pull request #118 from jacopofar/master
Fix career URL and use https
Latest commit 0665dde Oct 10, 2018

README.md

ftfy: fixes text for you

Travis PyPI package Docs

>>> print(fix_encoding("(ง'⌣')ง"))
(ง'')ง

Full documentation: https://ftfy.readthedocs.org

Testimonials

  • “My life is livable again!” — @planarrowspace
  • “A handy piece of magic” — @simonw
  • “Saved me a large amount of frustrating dev work” — @iancal
  • “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
  • “Hat mir die Tage geholfen. Im Übrigen bin ich der Meinung, dass wir keine komplexen Maschinen mit Computern bauen sollten solange wir nicht einmal Umlaute sicher verarbeiten können. :D” — Bruno Ranieri
  • “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow
  • “9.2/10” — pylint

Developed at Luminoso

Luminoso makes groundbreaking software for text analytics that really understands what words mean, in many languages. Our software is used by enterprise customers such as Sony, Intel, Mars, and Scotts, and it's built on Python and open-source technologies.

We use ftfy every day at Luminoso, because the first step in understanding text is making sure it has the correct characters in it!

Luminoso is growing fast and hiring. If you're interested in joining us, take a look at our careers page.

What it does

ftfy fixes Unicode that's broken in various ways.

The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code. This is different from taking in non-Unicode and outputting Unicode, which is not a goal of ftfy. It also isn't designed to protect you from having to write Unicode-aware code. ftfy helps those who help themselves.

Of course you're better off if your input is decoded properly and has no glitches. But you often don't have any control over your input; it's someone else's mistake, but it's your problem now.

ftfy will do everything it can to fix the problem.

Mojibake

The most interesting kind of brokenness that ftfy will fix is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences (called "mojibake"):

  • The word schön might appear as schön.
  • An em dash () might appear as —.
  • Text that was meant to be enclosed in quotation marks might end up instead enclosed in “ and â€<9d>, where <9d> represents an unprintable character.

ftfy uses heuristics to detect and undo this kind of mojibake, with a very low rate of false positives.

This part of ftfy now has an unofficial Web implementation by simonw: https://ftfy.now.sh/

Examples

fix_text is the main function of ftfy. This section is meant to give you a taste of the things it can do. fix_encoding is the more specific function that only fixes mojibake.

Please read the documentation for more information on what ftfy does, and how to configure it for your needs.

>>> print(fix_text('This text should be in “quotesâ€\x9d.'))
This text should be in "quotes".

>>> print(fix_text('ünicode'))
ünicode

>>> print(fix_text('Broken text&hellip; it&#x2019;s flubberific!',
...                normalization='NFKC'))
Broken text... it's flubberific!

>>> print(fix_text('HTML entities &lt;3'))
HTML entities <3

>>> print(fix_text('<em>HTML entities in HTML &lt;3</em>'))
<em>HTML entities in HTML &lt;3</em>

>>> print(fix_text('\001\033[36;44mI&#x92;m blue, da ba dee da ba '
...               'doo&#133;\033[0m', normalization='NFKC'))
I'm blue, da ba dee da ba doo...

>>> print(fix_text('LOUD NOISES'))
LOUD NOISES

>>> print(fix_text('LOUD NOISES', fix_character_width=False))
LOUD NOISES

Installing

ftfy is a Python 3 package that can be installed using pip:

pip install ftfy

(Or use pip3 install ftfy on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

If you're on Python 2.7, you can install an older version:

pip install 'ftfy<5'

You can also clone this Git repository and install it with python setup.py install.

Who maintains ftfy?

I'm Robyn Speer (rspeer@luminoso.com). I develop this tool as part of my text-understanding company, Luminoso, where it has proven essential.

Luminoso provides ftfy as free, open source software under the extremely permissive MIT license.

You can report bugs regarding ftfy on GitHub and we'll handle them.