Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 63 additions & 31 deletions notebooks/pythainlp-get-started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# PyThaiNLP Get Started"
"# PyThaiNLP Get Started\n",
"\n",
"Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Collation"
"## Collation\n",
"\n",
"Sorting according to Thai dictionary."
]
},
{
Expand Down Expand Up @@ -40,7 +44,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Date and Time Format"
"## Date and Time Format\n",
"\n",
"Get Thai day and month names with Buddhist Era."
]
},
{
Expand Down Expand Up @@ -80,7 +86,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thai Character Cluster (TCC) and Extended TCC"
"### Thai Character Cluster (TCC) and Extended TCC\n",
"\n",
"According to [Character Cluster Based Thai Information Retrieval](https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval) (Theeramunkong et al. 2004)."
]
},
{
Expand Down Expand Up @@ -167,7 +175,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sentence and Word"
"### Sentence and Word\n",
"\n",
"Default word tokenizer (\"newmm\") use maximum matching algorithm."
]
},
{
Expand Down Expand Up @@ -195,6 +205,13 @@
"print(\"word_tokenize, without whitespace:\", word_tokenize(text, whitespaces=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other algorithm can be chosen. We can also create a tokenizer with custom dictionary."
]
},
{
"cell_type": "code",
"execution_count": 8,
Expand Down Expand Up @@ -223,6 +240,14 @@
"print(\"custom:\", custom_tokenizer.word_tokenize(text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Default word tokenizer use a word list from pythainlp.corpus.common.thai_words().\n",
"We can get that list, add/remove words, and create new tokenizer from the modified list."
]
},
{
"cell_type": "code",
"execution_count": 9,
Expand Down Expand Up @@ -332,7 +357,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Soundex"
"## Soundex\n",
"\n",
"\"Soundex is a phonetic algorithm for indexing names by sound.\" ([Wikipedia](https://en.wikipedia.org/wiki/Soundex)). PyThaiNLP provides three kinds of Thai soundex."
]
},
{
Expand All @@ -344,28 +371,19 @@
"name": "stdout",
"output_type": "stream",
"text": [
"บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n",
"บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n",
"มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
"มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
"มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n",
"ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n",
"รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
"รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
" - lk82: - udom83: - metasound: \n"
"True\n",
"True\n",
"True\n"
]
}
],
"source": [
"from pythainlp.soundex import lk82, metasound, udom83\n",
"\n",
"texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n",
"for text in texts:\n",
" print(\n",
" \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n",
" text, lk82(text), udom83(text), metasound(text)\n",
" )\n",
" )"
"# check equivalence\n",
"print(lk82(\"รถ\") == lk82(\"รด\"))\n",
"print(udom83(\"วรร\") == udom83(\"วัน\"))\n",
"print(metasound(\"นพ\") == metasound(\"นภ\"))"
]
},
{
Expand All @@ -377,17 +395,26 @@
"name": "stdout",
"output_type": "stream",
"text": [
"True\n",
"True\n",
"True\n"
"บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n",
"บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n",
"มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
"มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
"มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n",
"ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n",
"รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
"รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
" - lk82: - udom83: - metasound: \n"
]
}
],
"source": [
"# check equivalence\n",
"print(lk82(\"รถ\") == lk82(\"รด\"))\n",
"print(udom83(\"วรร\") == udom83(\"วัน\"))\n",
"print(metasound(\"นพ\") == metasound(\"นภ\"))"
"texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n",
"for text in texts:\n",
" print(\n",
" \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n",
" text, lk82(text), udom83(text), metasound(text)\n",
" )\n",
" )"
]
},
{
Expand All @@ -396,7 +423,7 @@
"source": [
"## Spellchecking\n",
"\n",
"Default spellchecker use Peter Norvig's algorithm together with word frequency from Thai National Corpus (TNC)"
"Default spellchecker uses [Peter Norvig's algorithm](http://www.norvig.com/spell-correct.html) together with word frequency from Thai National Corpus (TNC)"
]
},
{
Expand Down Expand Up @@ -603,7 +630,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Named-Entity Tagging"
"## Named-Entity Tagging\n",
"\n",
"The tagger use BIO scheme:\n",
"- B - beginning of entity\n",
"- I - inside entity\n",
"- O - outside entity"
]
},
{
Expand Down