Jadi kita wajib melakukan normalisasi sebagai berikut dimana text itu berupa sama tetapi format yang berbeda yang akan mempengaruhi kinerja model jadi dengan demikian lakukan normalisasi sebelum procces model

# Chanotical Equivalence

* Ç	 C◌̧, Combined character sequences
* 가	ᄀ ᅡ, Conjoined Korean characters

In [6]:
print("\u00C7")

Ç


In [10]:
"Ç" == "\u00C7"

True

In [8]:
"Ç" == "Ç"

False

# Compatibility equivalence

* ℌ	H	Font variant
* [NBSP]	[SPACE]	Both are linebreak sequences
* ①	1	Circled variant
* x²	x2	Superscript
* xⱼ	xj	Subscript
* ½	1/2	Fractions

In [4]:
"ℌ" == "H"

False

In [5]:
"½" == "1⁄2"  # note that 1⁄2 are the characters 1 ⁄ 2 placed together (they are automatically formatted)

False

# Decomposition and Composition 

Normal Form :

  Name    | Abbreviation | Description | Example

* Form D  | NFD  | Canonical decomposition                                       | Ç → C ̧
* Form C  | NFC  | Canoncial decomposition followed by canonical composition     | Ç → C ̧ → Ç
* Form KD | NFKD | Compatibility decomposition                                   | ℌ ̧ → H ̧
* Form KC | NFKC | Compatibility decomposition followed by canonical composition | ℌ ̧ → H ̧ → Ḩ

# NFD and NFC

In [11]:
import unicodedata

In [12]:
c_with_cedilla = "\u00C7"  # Latin capital C with cedilla (single character)
c_with_cedilla

'Ç'

In [18]:
c_plus_cedilla = "\u0043\u0327"  # \u0043 = Latin capital C, \u0327 = 'combining cedilla' (two characters)
c_plus_cedilla

'Ç'

In [19]:
# And we will find that these two version do not match when compared:

c_with_cedilla == c_plus_cedilla # deferent format but same 'Ç'

False

If we perform NFD on our C with cedilla character \u00C7, we decompose the character into it's smaller components, which are the Latin capital C character, and combining cedilla character \u0043 + \u0327. This means that if we compare an NFD normalized C with cedilla character to both the C character and the cedilla character, we will return true:

In [20]:
unicodedata.normalize('NFD', c_with_cedilla) == c_plus_cedilla

'Ç'

However, if we perform NFC on our C with cedilla character \u00C7, we decompose the character into the smaller components \u0043 + \u0327, and then compose them back to \u00C7, and so they will not match:

In [22]:
# akan di decompose jadi \u00C7 => \u0043\u0327 tetapi akan di decompose ulang menadi \u00C7 formatnya walaupun akan mengembalikan text yang sama  pada akhirnya

unicodedata.normalize('NFC', c_with_cedilla) == c_plus_cedilla

False

But if we switch the NFC encoding to instead be performed on our two characters \u0043 + \u0327, they will first be decomposed (which will do nothing as they are already decomposed), then compose them into the single \u00C7 character:

In [23]:
# akan di decompose jadi \u00C7 formatnya walaupun akan mengembalikan text yang sama  pada akhirnya

c_with_cedilla == unicodedata.normalize('NFC', c_plus_cedilla)

True

# NFKD and NFKC

In [24]:
"ℌ" == "H"

False

In [29]:
# akan di decompose jadi \u0043\u0327 formatnya walaupun akan mengembalikan text yang sama pada akhirnya

unicodedata.normalize('NFD', "ℌ") 

'ℌ'

In [30]:
unicodedata.normalize('NFKD', 'ℌ') # penjelasan ada di Decomposition and Composition 

'H'

In [31]:
fancy_h_with_caddila = "\u210B\u0327"
fancy_h_with_caddila

'ℋ̧'

In [35]:
h_with_caddila = "\u1e28"
h_with_caddila

'Ḩ'

In [38]:
unicodedata.normalize('NFKC', fancy_h_with_caddila) == h_with_caddila

True

In [33]:
unicodedata.normalize('NFKC', fancy_h_with_caddila)

'Ḩ'

In [34]:
unicodedata.normalize('NFKC', fancy_h_with_caddila).encode('utf-8')

b'\xe1\xb8\xa8'