# Exercise 02: concerns about characters

The following shows examples of how to use [codecs](https://docs.python.org/3/library/codecs.html) and [normalize unicode](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize), and draws heavily from the article [Metal umlat](https://en.wikipedia.org/wiki/Metal_umlaut).

In [None]:
x = "Rinôçérôse screams ﬂow not unlike an encyclopædia, \
'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D' to Spın̈al Tap."
type(x)

The variable `x` is a *string* in Python:

In [None]:
repr(x)

Its translation into [ASCII](http://www.asciitable.com/) is unusable by parsers:

In [None]:
ascii(x)

Encoding as [UTF-8](http://unicode.org/faq/utf_bom.html) doesn't help much:

In [None]:
x.encode('utf8')

Ignoring difficult characters is perhaps an even worse strategy:

In [None]:
x.encode('ascii','ignore')

However, one can *normalize* then encode…

In [None]:
unicodedata.normalize('NFKD', x).encode('ascii','ignore')

Even before this normalization and encoding, you may need to convert some characters explicitly **before** parsing. For example:

In [None]:
x = "The sky “above” the port … was the color of ‘cable television’ – tuned to the Weather Channel®"
ascii(x)

Then consider the results here:

In [None]:
unicodedata.normalize('NFKD', x).encode('ascii','ignore')

...which drops characters that may be important for parsing a sentence, so instead:

In [None]:
x = x.replace('“', '"').replace('”', '"')
x = x.replace("‘", "'").replace("’", "'")
x = x.replace('…', '...').replace('–', '-')
print(x)