# Tutorial on Using PyCantonese

This tutorial will include an intro to PyCantonese and jyutping, how to install PyCantonese, how to segment words in Cantonese strings, and how to parse Cantonese strings into jyutping. We will focus on using this tool to parse Cantonese characters and strings and convert them into jyutping romanized characters and strings. Jyutping is useful for Cantonese language learners to visualize tones and focus on building a vocabulary before learning characters. This task is trickier than it sounds and can often cause problems with translations because certain characters can have multiple pronunciations due to changes in tone. Context is often important to determine which tone is correct for a character and strings will need to be segmented.

## What is PyCantonese?

PyCantonese is a Python library authored by [Jackson L. Lee](https://jacksonllee.com/) that can be used for Cantonese linguistics and natural language processing (NLP) tasks. You can use PyCantonese for the following tasks:

* Accessing and searching corpus data
* Parsing and conversion tools for Jyutping romanization
* Parsing Cantonese text
* Removing stop words
* Word segmentation
* Part-of-speech tagging

In this tutorial, we will focus on parsing and conversion of Cantonese characters for Jyutping romanization.

## What is Jyutping?

Jyutping was developed by the Linguistic Society of Hong Kong as a romanization system for Cantonese for the purpose of representing tone and pronuncation of Chinese characters. Each Jyutping representation typically consists of an initial (generally a consonant), a final (generally a series of vowels and/or consonants), and a tone (represented as a number between 1 and 6). Tones are particularly important in Cantonese as the tone can change the meaning of a word entirely (e.g., sik1 means "to know" and sik3 means "to eat"). Context is often important for machine translation to identify the correct tone for a word in Cantonese, which is something that PyCantonese is able to handle.

It is important to note that Jyutping is not the only method of romanization for Cantonese. Once characters have been romanized to Jyutping in PyCantonese, the tool can then convert the Jyutping into Yale or TIPA. We will not cover that in this tutorial; see the following page for more information:

* [Yale Conversion](https://pycantonese.org/jyutping.html#jyutping-to-yale-conversion)
* [TIPA Conversion](https://pycantonese.org/jyutping.html#jyutping-to-tipa-conversion)

## Download and Install

Let's start by downloading the PyCantonese library.

In [6]:
pip install --upgrade pycantonese

Note: you may need to restart the kernel to use updated packages.


Next, let's import it.

In [8]:
import pycantonese

## Romanizing Chinese Characters to Jyutping

PyCantonese has a built-in function called `characters_to_jyutping()` that takes a string in Cantonese represented by Chinese characters and returns the Jyutping romanization. This function can take either a single character or a string of characters. Word segmentation happens in the background of this function and the Jyutping is returned as a list of tuples. Let's run an example from the PyCantonese website.

In [10]:
pycantonese.characters_to_jyutping('香港人講廣東話')  # "Hongkongers speak Cantonese"

[('香港人', 'hoeng1gong2jan4'), ('講', 'gong2'), ('廣東話', 'gwong2dung1waa2')]

PyCantonese is also able to handle tone change (a.k.a. 變音 *pinjam*) which can occur depending on the context in which a Cantonese word or morpheme appears. Let's look at an example from the PyCantonese website.

In [12]:
# The correct pronunciation of 蛋 is with tone 2 (high-rising) as a standalone word.
pycantonese.characters_to_jyutping('蛋')  # egg

[('蛋', 'daan2')]

In [13]:
# The correct pronunciation of 蛋 is with tone 6 (low-level) in 蛋糕.
pycantonese.characters_to_jyutping('蛋糕')  # cake

[('蛋糕', 'daan6gou1')]

## Romanizing a corpus

PyCantonese uses [PyLangAcq](https://pylangacq.org/) to parse corpus data in the [CHAT](https://pylangacq.org/transcriptions.html#chat-format) format. The [Hong Kong Cantonese Corpus (HKCanCor)](https://github.com/fcbond/hkcancor) comes built in to PyCantonese and can be called using the `hkcancor()` function. The HKCanCor has been segmented by word.

In [15]:
# Call the built-in HKCanCor corpus
hkcancor = pycantonese.hkcancor()

In [16]:
hkcancor.n_files()  # number of data files

58

In [17]:
len(hkcancor.words()) # number of words as segmented from all the utterances

153654

In [18]:
hkcancor.words(by_utterances=True)[:10] # list of the first 10 utterances, segmented by word

[['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?'],
 ['你', '老公', '有冇', '平', '機票', '啊', '?'],
 ['平', '機票', '要', '淡季', '先', '有得', '平', '𡃉', '喎', '.'],
 ['而家', '旺', '-', '.'],
 ['冇得', '去', '嗱', '.'],
 ['吓', '?'],
 ['而家', '旺季', '.'],
 ['通常', '都', '係', '貴', '𡃉', '喎', ',', '啲', '機票', '.'],
 ['係', '咩', '?'],
 ['我',
  '聽',
  '朋友',
  '講',
  '話',
  '去',
  ',',
  '誒',
  ',',
  'Orlando',
  '嗰個',
  '舊',
  '-',
  '嗰個',
  '迪士尼',
  '呢',
  ',',
  '廿五',
  '週年',
  '喎',
  '.']]

We could pass just the segmented words using `words()` to `characters_to_jyutping()`.

In [20]:
for word in hkcancor.words()[:20]: # iterating through the first 20 words
    print(pycantonese.characters_to_jyutping(word))

[('喂', 'wai3')]
[('遲', 'ci4')]
[('啲', 'di1')]
[('去', 'heoi3')]
[('唔', 'm4')]
[('去', 'heoi3')]
[('旅行', 'leoi5hang4')]
[('啊', 'aa3')]
[('?', None)]
[('你', 'nei5')]
[('老公', 'lou5gung1')]
[('有冇', 'jau5mou5')]
[('平', 'peng4')]
[('機票', 'gei1piu3')]
[('啊', 'aa3')]
[('?', None)]
[('平', 'peng4')]
[('機票', 'gei1piu3')]
[('要', 'jiu3')]
[('淡季', 'daam6gwai3')]


Or we could pass entire utterances to `characters_to_jyutping()`.

In [22]:
utterance = hkcancor.words(by_utterances=True)
utterance_list = []

for i in utterance:
    sentence = " ".join(i) # Concatenate segmented words in an utternance to a string 
    utterance_list.append(sentence) # Add that string to a list

for j in utterance_list[:10]:
    print(pycantonese.characters_to_jyutping(j))

[('喂', 'wai3'), ('遲啲', 'ci4di1'), ('去', 'heoi3'), ('唔', 'm4'), ('去', 'heoi3'), ('旅行', 'leoi5hang4'), ('啊', 'aa3'), ('?', None)]
[('你', 'nei5'), ('老公', 'lou5gung1'), ('有冇', 'jau5mou5'), ('平', 'peng4'), ('機票', 'gei1piu3'), ('啊', 'aa3'), ('?', None)]
[('平', 'peng4'), ('機票', 'gei1piu3'), ('要', 'jiu3'), ('淡季', 'daam6gwai3'), ('先有', 'sin1jau5'), ('得', 'dak1'), ('平', 'peng4'), ('𡃉', 'gaa3'), ('喎', 'wo3'), ('.', None)]
[('而家', 'ji4gaa1'), ('旺', 'wong6'), ('-', None), ('.', None)]
[('冇得', 'mou5dak1'), ('去', 'heoi3'), ('嗱', 'laa4'), ('.', None)]
[('吓', 'haa2'), ('?', None)]
[('而家', 'ji4gaa1'), ('旺季', 'wong6gwai3'), ('.', None)]
[('通常', 'tung1soeng4'), ('都', 'dou1'), ('係', 'hai6'), ('貴', 'gwai3'), ('𡃉', 'gaa3'), ('喎', 'wo3'), (',', None), ('啲', 'di1'), ('機票', 'gei1piu3'), ('.', None)]
[('係', 'hai6'), ('咩', 'me1'), ('?', None)]
[('我', 'ngo5'), ('聽', 'teng1'), ('朋友', 'pang4jau5'), ('講話', 'gong2waa2'), ('去', 'heoi3'), (',', None), ('誒', 'e6'), (',', None), ('Orlando', None), ('嗰個', 'go2go3'), ('舊', 

Notice that PyCantonese accounts for punctuation by labeling it "None".

## Conclusion

There is much more you can do with PyCantonese; this was just a taste! For further information on how you can use this tool for NLP tasks, see the [PyCantonese website](https://pycantonese.org/quickstart.html).