# Tutorial on Using PyCantonese

This tutorial will include an intro to PyCantonese and jyutping, how to install PyCantonese, how to segment words in Cantonese strings, and how to parse Cantonese strings into jyutping. We will focus on using this tool to parse Cantonese characters and strings and convert them into jyutping romanized characters and strings. Jyutping is useful for Cantonese language learners to visualize tones and focus on building a vocabulary before learning characters. This task is trickier than it sounds and can often cause problems with translations because certain characters can have multiple pronunciations due to changes in tone. Context is often important to determine which tone is correct for a character and strings will need to be segmented.

## What is PyCantonese

PyCantonese is a Python library authored by [Jackson L. Lee](https://jacksonllee.com/) that can be used for Cantonese linguistics and natural language processing (NLP) tasks. You can use PyCantonese for the following tasks:

* Accessing and searching corpus data
* Parsing and conversion tools for Jyutping romanization
* Parsing Cantonese text
* Removing stop words
* Word segmentation
* Part-of-speech tagging

In this tutorial, we will focus on parsing and conversion of Cantonese characters for Jyutping romanization.

## What is Jyutping?

Jyutping was developed by the Linguistic Society of Hong Kong as a romanization system for Cantonese for the purpose of representing tone and pronuncation of Chinese characters. Each Jyutping representation typically consists of an initial (generally a consonant), a final (generally a series of vowels and/or consonants), and a tone (represented as a number between 1 and 6). Tones are particularly important in Cantonese as the tone can change the meaning of a word entirely (e.g., sik1 means "to know" and sik3 means "to eat"). Context is often important for machine translation to identify the correct tone for a word in Cantonese, which is something that PyCantonese is able to handle.

It is important to note that Jyutping is not the only method of romanization for Cantonese. Once characters have been romanized to Jyutping in PyCantonese, the tool can then convert the Jyutping into Yale or TIPA. We will not cover that in this tutorial; see the following page for more information:

* [Yale Conversion](https://pycantonese.org/jyutping.html#jyutping-to-yale-conversion)
* [TIPA Conversion](https://pycantonese.org/jyutping.html#jyutping-to-tipa-conversion)

## Download and Install

Let's start by downloading the PyCantonese library.

In [14]:
pip install --upgrade pycantonese

Note: you may need to restart the kernel to use updated packages.


Next, let's import it.

In [16]:
import pycantonese

## Romanizing Chinese Characters to Jyutping

PyCantonese has a built-in function called `characters_to_jyutping()` that takes a string in Cantonese represented by Chinese characters and returns the Jyutping romanization. Word segmentation happens in the background of this function and the Jyutping is returned as a list of tuples. Let's run an example from the PyCantonese website.

In [23]:
pycantonese.characters_to_jyutping('香港人講廣東話')  # "Hongkongers speak Cantonese"

[('香港人', 'hoeng1gong2jan4'), ('講', 'gong2'), ('廣東話', 'gwong2dung1waa2')]