# Normalization

>Some Unicode characters can be represented by more than one Unicode encoding.
They’ll look the same, but won’t compare the same because they have different internal byte sequences. For example, take the acute accented 'é' in 'café'. Let’s make a
single-character 'é' in multiple ways:

In [1]:
eacute1 = 'é' # UTF-8, pasted
eacute2 = '\u00e9' # Unicode code point
eacute3 = '\N{LATIN SMALL LETTER E WITH ACUTE}'
eacute4 = chr(233) # decimal byte value
eacute5 = chr(0xe9) # hex byte value
print(eacute1, eacute2, eacute3, eacute4, eacute5)
eacute1 == eacute2 == eacute3 == eacute4 == eacute5

é é é é é


True

>Try a few sanity checks:

In [3]:
import unicodedata
# unicodedata.name() => Returns the name assigned to the character chr as a string.
#  If no name is defined, default is returned, or, if not given, ValueError is raised.
unicodedata.name(eacute1)

# ord: Return the Unicode code point for a one-character string.
print(ord(eacute1)) # as a decimal integer
0xe9 # Unicode hex integer

233


233

>Now let’s make an accented e by combining a plain e with an acute accent:


In [4]:
eacute_combined1 = "e\u0301"
eacute_combined2 = "e\N{COMBINING ACUTE ACCENT}"
eacute_combined3 = "e" + "\u0301"
print(eacute_combined1, eacute_combined2, eacute_combined3)

é é é


In [5]:
eacute_combined1 == eacute_combined2 == eacute_combined3

True

In [6]:
len(eacute_combined1)

2

>We built a Unicode character from two characters, and it looks the same as the origi‐
nal 'é'. But as they say on Sesame Street, one of these things is not like the other:

In [7]:
eacute1 == eacute_combined1

False