Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
170 commits
Select commit Hold shift + click to select a range
8dbd79a
Merge pull request #220 from PyThaiNLP/dev
bact May 9, 2019
f2461c7
specify that package is not in the same directory
Jun 6, 2019
f2778c2
todo_tokenize_2: provide example for word_tokenize
Jun 6, 2019
ec6e550
todo_tokenize_2: formatting example
Jun 6, 2019
ddd06f5
Revert "todo_tokenize_2: formatting example"
Jun 6, 2019
36bad18
todo_tokenize_2.2: provide example for word_tokenize
Jun 6, 2019
b188cc4
todo_tokenize_3: provide example for syllabus_tokenize
Jun 6, 2019
e438b5c
todo_tokenize_2.3: add return type
Jun 6, 2019
0248257
todo_tokenize_1: provide example for sent_tokenize
Jun 6, 2019
9521661
todo_tokenize_3: formatting
Jun 6, 2019
4246db2
todo_tokenize_4: provide example for subword_tokenize
Jun 6, 2019
dcae38d
todo_tokenize_3: formatting
Jun 6, 2019
db7d485
todo_tokenize_11: fix docstring format in tcc.py
Jun 6, 2019
0b35f44
todo_tokenize_5: briefly explain the algorithm of maximum matching (n…
Jun 7, 2019
99c0fcb
todo_tokenize_9: briefly explain the algorithm of multicut and cite …
Jun 7, 2019
3c8d3a9
todo_tokenize_tcc: format module docstring
Jun 7, 2019
1907404
todo_tokenize_tcc: add return type
Jun 7, 2019
2ddf640
todo_tokenize_init: format docstring for refering a library
Jun 7, 2019
f08af28
todo_tokenize_init: format docstring for referring a library
Jun 7, 2019
705a3bb
todo_tokenize_init: remove typos
Jun 7, 2019
d2c8b9a
todo_tokenize_Tokenizer: add docstring and example
Jun 7, 2019
e7073dc
todo_tokenize: shows all module in the documentation page
Jun 7, 2019
099420b
todo_tokenize: briefly explain the algorithm of deepcut, longest, pyi…
Jun 7, 2019
190fb1b
todo_tokenize_7: briefly explain the algorithm of etcc and cite the …
Jun 7, 2019
418ca12
todo_tokenize_pyicu: formatting and explain briefly
Jun 7, 2019
43da122
todo_tokenize_init: format example docstring
Jun 7, 2019
93104a7
todo_tokenize_init: format example docstring
Jun 7, 2019
5516149
todo_tag_pos_tag: move default engine to the first in the list
Jun 7, 2019
707d77d
todo_tag: provide list of NER, POS tag
Jun 7, 2019
9771458
todo_tag: fix typo
Jun 7, 2019
0215f33
todo_tag: fix typo
Jun 7, 2019
5ee9ba6
todo_tag: format rst
Jun 7, 2019
a9ecb04
group tokenize engine into one subsection
Jun 7, 2019
f49b261
todo_tag_2: provide examples for pos_tag
Jun 7, 2019
614c6c7
todo_tokenize: format docstring
Jun 7, 2019
ecf1935
todo_tag_3: provide examples for pos_tag_sents
Jun 7, 2019
c4b284e
todo_tag_2: fix typo for `pos_tag`
Jun 7, 2019
9d87c2e
todo_tag_4: provide examples for pos_tag_provinces
Jun 7, 2019
c187932
odo_tag_5: formatting docstring examples for named_entity.ThaiNameTagger
Jun 7, 2019
28f16ae
fix typo
Jun 7, 2019
20f734d
todo_tag_9: briefly explain orchid and cite the reference
Jun 7, 2019
07804fb
todo_word_vector: format names of package.
Jun 9, 2019
3e8f9d2
todo_word_util_12: provide arabic_digit_to_thai_digit
Jun 9, 2019
1b9e54a
todo_word_util_10: provide example for bahttext
Jun 9, 2019
b94e9bd
todo_word_util_14: provide example for num_to_thaiword
Jun 9, 2019
e29289c
todo_word_util_14: add return type for `num_to_thaiword`
Jun 9, 2019
5b777aa
todo_word_util_15: provide example for `deletetone`
Jun 10, 2019
5fa220a
todo_word_util_16: provide example for eng_to_thai
Jun 10, 2019
8452cfc
todo_word_util_17: provide example for thai_to_eng
Jun 10, 2019
b862aa1
todo_word_util_1: provide example for thaicheck
Jun 10, 2019
81493a3
todo_word_util_1: fix typo
Jun 10, 2019
721ce28
todo_word_util_2: provide example for thaiword_to_num
Jun 10, 2019
3e13370
todo_word_util_17: provide example for `thai_digit_to_arabic_digit`
Jun 10, 2019
a4e9462
todo_word_util_4: provide example for rank
Jun 10, 2019
34643f1
todo_word_util_7: provide example for normalize
Jun 10, 2019
79b7e49
todo_word_util_8: provide example for countthai
Jun 10, 2019
972cf3c
todo_word_util_6: provide example for now_reign_year
Jun 10, 2019
fd6c1c8
todo_word_util_18: provide example for reign_year_to_ad
Jun 10, 2019
a9ba278
todo_word_util_9: provide example for find_keyword
Jun 10, 2019
ca8d29a
todo_word_util_9: format example code
Jun 10, 2019
3ffc903
todo_word_util_8: fix mispelling
Jun 10, 2019
63f0847
todo_word_util_11: provide example for collate and briefly explain
Jun 10, 2019
30edb7e
todo_word_util_12: rewrite docstrting for isthaichar
Jun 10, 2019
eb5cb5c
todo_word_util_19: rewrite docstrting for isthai
Jun 10, 2019
35091e0
todo_word_util_20: provide example for text_to_arabic_digit
Jun 10, 2019
00227cc
todo_word_util_21: provide example for text_to_thai_digit
Jun 10, 2019
15bf366
todo_word_util_23: format docstring for thai_strftime
Jun 10, 2019
44f7fac
todo_word_util_22: provide example for thai_strftime
Jun 10, 2019
d2182a9
todo_ulmfit_1: provide example for document_vector
Jun 10, 2019
d56897c
todo_ulmfit_3: explain and show example for pythainlp.ulmfit,ThaiToke…
Jun 10, 2019
d4a16b2
fix typos
Jun 10, 2019
d49ec93
todo_tag_10: briefly explain orchid_ud and cite the reference
Jun 10, 2019
455006c
format .rst file
Jun 10, 2019
5747b9d
todo_tag_6/7/8 briefly explain unigram, perceptron, and artagger
Jun 10, 2019
785e346
fix typo
Jun 10, 2019
f8fdc32
todo_soundex_7: briefly explain metasound
Jun 13, 2019
f155458
todo_soundex_1: provide more examples for metasound
Jun 13, 2019
5b5cc0b
todo_soundex_6: briefly explain udom82
Jun 13, 2019
e3e42fa
todo_soundex_2: provide examples for udom83
Jun 13, 2019
c3db926
todo_soundex_3: provide examples for lk82
Jun 13, 2019
030ff1a
todo_soundex_5: briefly explain lk82
Jun 13, 2019
c0f4c52
todo_soundex_5: briefly explain lk82
Jun 13, 2019
3cbc06a
todo_soundex_6: briefly explain udom82
Jun 13, 2019
f150acc
todo_soundex_4: provide examples for soundex
Jun 13, 2019
38d1066
todo_spell_3: provide examples for correct
Jun 13, 2019
b17fa95
todo_spell_4: provide examples for spell
Jun 13, 2019
41ed2ce
todo_spell_12: explain spell
Jun 13, 2019
271fe7a
todo_spell_12: format docstring
Jun 13, 2019
eba7db7
todo_spell_4: provide examples for spell
Jun 13, 2019
6a03e1f
todo_spell_4: provide examples for spell
Jun 13, 2019
bfa3894
todo_spell_6: provide examples for NorvigSpellChecker.dictionary
Jun 13, 2019
9db80de
todo_spell_7: provide examples for NorvigSpellChecker.freq
Jun 13, 2019
74569dc
todo_spell_10: provide examples for NorvigSpellChecker.spell
Jun 13, 2019
ca7cbdd
todo_spell_9: provide examples for `NorvigSpellChecker.prob`
Jun 13, 2019
eed855f
todo_spell_8: provide examples for NorvigSpellChecker.known
Jun 13, 2019
ce8af0e
todo_spell_11: briefly explain constant variable `DEFAULT_SPELL_CHEC…
Jun 13, 2019
835f29d
todo_spell_2: cite the reference of Peter Norvig’s algorithm
Jun 13, 2019
3df94b5
todo_spell_1: briefly explain Peter Norvig’s algorithm
Jun 13, 2019
9f09737
todo_spell_2: cite the reference of Peter Norvig’s algorithm
Jun 13, 2019
8203c01
add newline at the end
Jun 13, 2019
a87c892
todo_transliterate_5: add reference
Jun 13, 2019
a24ab39
todo_transliterate_1: format docstring for romanize
Jun 13, 2019
e28ca75
todo_transliterate_3: provide examples for romanize
Jun 13, 2019
35c12ed
todo_transliterate_2: format docstring for transliterate
Jun 13, 2019
5efea33
todo_transliterate_4: provide examples for transliterate
Jun 13, 2019
a7993cf
todo_corpus_5: provide link to countries
Jun 13, 2019
6e22a0e
todo_corpus_6: provide link to provinces
Jun 13, 2019
0fa6d26
todo_corpus_3: provide link to thai_syllables
Jun 13, 2019
c144d2c
todo_corpus_2: provide link to thai_words
Jun 13, 2019
8ac8db0
todo_corpus_1: provide link to thai_stopwords
Jun 13, 2019
a0d13a9
todo_corpus_4: provide link to thai_negations
Jun 13, 2019
ff05106
todo_corpus_9: provide example for corpus.get_corpus
Jun 13, 2019
80208f4
todo_corpus_10: provide example for `corpus.get_corpus_path`
Jun 13, 2019
f30da20
- [ ] todo_corpus_7: provide example for `corpus.download`
Jun 13, 2019
c265ace
todo_corpus_8: provide example for `corpus.remove`
Jun 13, 2019
6f3efe2
todo_corpus_25: briefly explain conceptnet.edges
Jun 13, 2019
80eafdb
todo_corpus_11: provide examples for conceptnet.edges
Jun 13, 2019
81bb1a0
add definition for corpus page
Jun 13, 2019
ba34062
todo_tools_1: provide example for tools.get_full_data_path
Jun 13, 2019
648796a
todo_corpus_13: provide examples for `wordnet.synsets`
Jun 14, 2019
fa1aed3
todo_corpus_12: provide examples for wordnet.synset
Jun 14, 2019
3b59172
todo_corpus_14: provide examples for wordnet.all_lemma_names
Jun 14, 2019
0056667
todo_corpus_15: provide examples for wordnet.all_synsets
Jun 14, 2019
89279ca
todo_corpus_16: provide examples for `wordnet.langs`
Jun 14, 2019
a32158f
todo_corpus_17-to-25:
Jun 14, 2019
3763a43
todo_tools_2: provide example for tools.get_pythainlp_data_path
Jun 14, 2019
1d7254f
todo_tools_1,3:
Jun 14, 2019
9927d5a
todo_summarize_1: provide examples for `summarize.summarize`
Jun 14, 2019
4580bdc
todo_summarize_2: briefly explain functionality of summarize.summarize
Jun 14, 2019
3d9b919
corrects the mapping of ORCHID pos tag and UD tags in .rst document o…
Jun 14, 2019
b724ca6
todo_tag_12: briefly explain each tagger engines at the another section.
Jun 14, 2019
e313116
fix pep8 issue
Jun 16, 2019
e965d3e
fix pep8 issue
Jun 16, 2019
52dce83
fix pep8 issue
Jun 16, 2019
a1db4fe
fix warning for sphinx-build
Jun 16, 2019
94ae062
remove trailing whitespace
Jun 16, 2019
0b8aac5
fix pep8 issues
Jun 16, 2019
e9018dd
fix pep8 issue
Jun 16, 2019
5fea9fd
fix pep8 issue
Jun 16, 2019
b530b47
fix pep8 issue
Jun 16, 2019
ba96c4b
fix pep8 issue
Jun 16, 2019
c15aa64
fix pep8 issue
Jun 16, 2019
2de7c1e
fix pep8 issue
Jun 16, 2019
d5390d7
fix pep8 issue
Jun 16, 2019
884eaed
fix pep8 issue
Jun 16, 2019
e7526c4
fix pep8 issues
Jun 16, 2019
987cc28
fix pep8 issues
Jun 16, 2019
05cb093
fix pep8 issues
Jun 16, 2019
5d8c538
fix pep8 issues
Jun 16, 2019
7ed601c
fix pep8 issues
Jun 16, 2019
5abc8a7
format code
Jun 16, 2019
70c37a2
format docstring
Jun 16, 2019
ad2e8ab
add special directive for warning notes
Jun 16, 2019
af06038
format docstrings
Jun 16, 2019
22b55bc
format docstrings
Jun 16, 2019
f1ece8c
fix pep8 issues
Jun 16, 2019
ae7dc22
fix pep8 issues
Jun 16, 2019
86f1d3e
fix pep8 issues, invalid escape sequence
Jun 16, 2019
f98caee
format docstring
Jun 16, 2019
e2321ec
fix typo
Jun 16, 2019
bdc535d
format docstring
Jun 16, 2019
bdbbf2e
add rerferences sectttion for word_vector package
Jun 16, 2019
5753e17
format docstring
Jun 16, 2019
59618f8
format docstring
Jun 16, 2019
65c7f71
fix typo
Jun 16, 2019
256927b
Edit the term to "Tokenzation Engines"
Jun 21, 2019
82b6bea
format docstring
Jun 21, 2019
9f1c2fd
format .rst files
Jun 21, 2019
e45d556
fix sphinx warning by adding a blank line
Jun 21, 2019
8cecb1a
add function description to `pythainlp.tokenize.word_tokenize`
Jun 21, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/api/corpus.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,9 @@ Wordnet
.. autofunction:: pythainlp.corpus.wordnet.wup_similarity
.. autofunction:: pythainlp.corpus.wordnet.morphy
.. autofunction:: pythainlp.corpus.wordnet.custom_lemmas

Definition
++++++++++

Synset
a set of synonyms that share a common meaning.
15 changes: 14 additions & 1 deletion docs/api/soundex.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

pythainlp.soundex
====================================
The :class:`pythainlp.soundex` is soundex for thai.
The :class:`pythainlp.soundex` is soundex for Thai.

Modules
-------
Expand All @@ -11,3 +11,16 @@ Modules
.. autofunction:: lk82
.. autofunction:: udom83
.. autofunction:: metasound

References
----------

.. [metasound] Snae & Brückner. (2009). Novel Phonetic Name Matching Algorithm with a Statistical
Ontology for Analysing Names Given in Accordance with Thai Astrology.
https://pdfs.semanticscholar.org/3983/963e87ddc6dfdbb291099aa3927a0e3e4ea6.pdf

.. [udom83] Wannee Udompanich (1983). Search Thai sound-alike string using homonymic approach.
Master Thesis. Chulalongkorn University, Thailand.

.. [lk82] วิชิต หล่อจีระชุณห์กุล และ เจริญ คุวินทร์พันธุ์. โปรแกรมการสืบค้นคำไทยตามเสียงอ่าน (Thai Soundex).
http://guru.sanook.com/1520/
9 changes: 8 additions & 1 deletion docs/api/spell.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ Modules
.. autofunction:: correct
.. autofunction:: spell
.. autoclass:: NorvigSpellChecker
:special-members:
:members:
.. autodata:: DEFAULT_SPELL_CHECKER
:annotation: = Default instance of standard NorvigSpellChecker, using word list from Thai National Corpus
:annotation: = Default instance of standard NorvigSpellChecker, using word list from Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/

References
----------

.. [norvig_spellchecker] Peter Norvig. "How to Write a Spelling Corrector".
Available at: http://norvig.com/spell-correct.html
208 changes: 207 additions & 1 deletion docs/api/tag.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,188 @@

pythainlp.tag
=====================================
The :class:`pythainlp.tag` contains functions that are used to tag different parts of a text.
The :class:`pythainlp.tag` contains functions that are used to tag different parts of a text including
Part-of-Speech (POS) tags, and Named Entity Recognition (NER) tag.

For the POS tags, there are two set of tags including `Universal Dependencies (UD) <https://universaldependencies.org/>`_ and ORCHID [Sornlertlamvanich_2000]_ POS tags.

The following table shows the list of Part-of-Speech (POS) tags according to Universal Dependencies (UD) POS tags:

============ ========================== =============================
Abbreviation Part-of-Speech tag Examples
============ ========================== =============================
ADJ Adjective ใหม่, พิเศษ , ก่อน, มาก, สูง
ADP Adposition แม้, ว่า, เมื่อ, ของ, สำหรับ
ADV Adverb ก่อน, ก็, เล็กน้อย, เลย, สุด
AUX Auxiliary เป็น, ใช่, คือ, คล้าย
CCONJ Coordinating conjunction แต่, และ, หรือ
DET Determiner ที่, นี้, ซึ่ง, ทั้ง, ทุก, หลาย
INTJ Interjection อุ้ย, โอ้ย
NOUN Noun กำมือ, พวก, สนาม, กีฬา, บัญชี
NUM Numeral 5,000, 103.7, 2004, หนึ่ง, ร้อย
PART Particle มา ขึ้น ไม่ ได้ เข้า
PRON Pronoun เรา, เขา, ตัวเอง, ใคร, เธอ
PROPN Proper noun โอบามา, แคปิตอลฮิล, จีโอพี, ไมเคิล
PUNCT Punctuation (, ), ", ', :
SCONJ Subordinating conjunction หาก
VERB Verb เปิด, ให้, ใช้, เผชิญ, อ่าน
============ ========================== =============================

The following table shows the list of Part-of-Speech (POS) tags according to ORCHID POS tags from the paper:

============ ================================================= =================================
Abbreviation Part-of-Speech tag Examples
============ ================================================= =================================
NPRP Proper noun วินโดวส์ 95, โคโรน่า, โค้ก
NCNM Cardinal number หนึ่ง, สอง, สาม, 1, 2, 10
NONM Ordinal number ที่หนึ่ง, ที่สอง, ที่สาม, ที่1, ที่2
NLBL Label noun 1, 2, 3, 4, ก, ข, a, b
NCMN Common noun หนังสือ, อาหาร, อาคาร, คน
NTTL Title noun ครู, พลเอก
PPRS Personal pronoun ‡ คุณ, เขา, ฉัน
PDMN Demonstrative pronoun œ นี่, นั้น, ที่นั่น, ที่นี่
PNTR Interrogative pronoun ใคร, อะไร, อย่างไร
PREL Relative pronoun š ที่, ซึ่ง, อัน, ผู้
VACT Active verb šÎµŠµœ, ทำงาน, ร้องเพลง, กิน
VSTA Stative verb เห็น, รู้, คือ
VATT Attributive verb อ้วน, ดี, สวย
XVBM Pre-verb auxiliary, before negator "ไม่" เกิด, เกือบ, กำลัง
XVAM Pre-verb auxiliary, after negator "ไม่" ค่อย, น่า, ได้
XVMM Pre-verb, before or after negator "ไม่" ควร, เคย, ต้อง
XVBB Pre-verb auxiliary, in imperative mood กรุณา, จง, เชิญ, อย่า, ห้าม
XVAE Post-verb auxiliary Å ไป, มา, ขึ้น
DDAN | Definite determiner, after noun without ยี่, นั่น, โน่น, ทั้งหมด
| classifier in between
DDAC | Definite determiner, allowing classifier นี้, นั้น, โน้น, นู้น
| in between
DDBQ | Definite determiner, between noun and ทั้ง, อีก, เพียง
| classifier or preceding quantitative expression
DDAQ | Definite determiner, พอดี, ถ้วน
| following quantitative expression
DIAC | Indefinite determiner, following noun; allowing ไหน, อื่น, ต่างๆ
| classifier in between
DIBQ | Indefinite determiner, between noun and บาง, ประมาณ, เกือบ
| classifier or preceding quantitative expression
DIAQ | Indefinite determiner, กว่า, เศษ
| following quantitative expression
DCNM Determiner, cardinal number expression **หนึ่ง**\ คน, เสือ, **2** ตัว
DONM Determiner, ordinal number expression ที่หนึ่ง, ที่สอง, ที่สุดท้สย
ADVN Adverb with normal form เก่ง, เร็ว, ช้า, สม่ำเสมอ
ADVI Adverb with iterative form เร็วๆ, เสทอๆ, ช้าๆ
ADVP Adverb with prefixed form โดยเร็ว
ADVS Sentential adverb โดยปกติ, ธรรมดา
CNIT Unit classifier ตัว, คน, เล่ม
CLTV Collective classifier | คู่, กลุ่ม, ฝูง, เชิง, ทาง,
| ด้าน, แบบ, รุ่น
CMTR Measurement classifier กิโลกรัม, แก้ว, ชั่วโมง
CFQC Frequency classifier ‡ ครั้ง, เที่ยว
CVBL Verbal classifier ม้วน, มัด
JCRG Coordinating conjunction และ, หรือ, แต่
JCMP Comparative conjunction „ กว่า, เหมือนกับ, เท่ากับ
JSBR Subordinating conjunction เพราะว่า, เนื่องจาก ที่, แม้ว่า, ถ้า
RPRE Preposition ‹ จาก, ละ, ของ, ใต้, บน
INT Interjection โอ้บ, โอ้, เออ, เอ๋, อ๋อ
FIXN Nominal prefix **การ**\ ทำงาน, **ความ**\ สนุนสนาน
FIXV Adverbial prefix **อย่าง**\ เร็ว
EAFF Ending for affirmative sentence จ๊ะ, จ้ะ, ค่ะ, ครับ, นะ, น่า, เถอะ
EITT Ending for interrogative sentence หรือ, เหรอ, ไหม, มั้ย
NEG Negator ไม่, มิได้, ไม่ได้, มิ
PUNC Punctuation (, ), “, ,, ;
============ ================================================= =================================

ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.

The following table shows the mapping of Part-of-Speech (POS) tags from ORCHID POS tags to UD POS tags:

=============== =======================
ORCHID POS tags Coresponding UD POS tag
=============== =======================
NOUN NOUN
NCMN NOUN
NTTL NOUN
CNIT NOUN
CLTV NOUN
CMTR NOUN
CFQC NOUN
CVBL NOUN
VACT VERB
VSTA VERB
PROPN PROPN
NPRP PROPN
ADJ ADJ
NONM ADJ
VATT ADJ
DONM ADJ
ADV ADV
ADVN ADV
ADVI ADV
ADVP ADV
ADVS ADV
INT INTJ
PRON PRON
PPRS PRON
PDMN PRON
PNTR PRON
DET DET
DDAN DET
DDAC DET
DDBQ DET
DDAQ DET
DIAC DET
DIBQ DET
DIAQ DET
NUM NUM
NCNM NUM
NLBL NUM
DCNM NUM
AUX AUX
XVBM AUX
XVAM AUX
XVMM AUX
XVBB AUX
XVAE AUX
ADP ADP
RPRE ADP
CCONJ CCONJ
JCRG CCONJ
SCONJ SCONJ
PREL SCONJ
JSBR SCONJ
JCMP SCONJ
PART PART
FIXN PART
FIXV PART
EAFF PART
EITT PART
AITT PART
NEG PART
PUNCT PUNCT
PUNC PUNCT
=============== =======================

For the NER, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NER for each words.
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would be tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" as "B-PERSON", "I-PERSON", "I-PERSON", "O", and "O" respectively.

The *B-* prefix indicates begining token for a chunk of person name, "บารัค โอบามา" and *I-* prefix indicates the intermediate token. However, the term *O* indicates that a token not belong to any NER chunk.

The following table shows the list of Named Entity Recognition (NER) tags:

============================ =================================
Named Entity Recognition tag Examples
============================ =================================
DATE 2/21/2004, 16 ก.พ., จันทร์
TIME 16.30 น., 5 วัน, 1-3 ปี
EMAIL info@nrpsc.ac.th
LEN 30 กิโลเมตร, 5 กม.
LOCATION ไทย, จ.ปราจีนบุรี, กำแพงเพชร
ORGANIZATION กรมวิทยาศาสตร์การแพทย์, อย.
PERSON น.พ.จรัล, นางประนอม ทองจันทร์
PHONE 1200, 0 2670 8888
URL http://www.bangkokhealth.com/
ZIP 10400, 11130
Money 2.7 ล้านบาท, 2,000 บาท
LAW พ.ร.บ.โรคระบาด พ.ศ.2499, รัฐธรรมนูญ
============================ =================================

Modules
-------
Expand All @@ -12,3 +193,28 @@ Modules
.. autofunction:: tag_provinces
.. autoclass:: pythainlp.tag.named_entity.ThaiNameTagger
:members: get_ner

Tagger Engines
--------------

perceptron
++++++++++

Perceptron tagger is the part-of-speech tagging using the averaged, structured perceptron algorithm.

unigram
+++++++

Unigram tagger doesn't take the ordering of words in the list into account.

artagger
++++++++

`artagger <https://github.com/franziz/artagger>`_ is an implementation of `RDRPOSTagger <https://github.com/datquocnguyen/RDRPOSTagger>`_ for tagging POS in Thai language.

References
----------

.. [Sornlertlamvanich_2000] Takahashi, Naoto & Isahara, Hitoshi & Sornlertlamvanich, Virach. (2000).
Building a Thai part-of-speech tagged corpus (ORCHID).
ournal of the Acoustical Society of Japan (E). 20. 10.1250/ast.20.189.
35 changes: 30 additions & 5 deletions docs/api/tokenize.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,39 @@ Modules
.. autoclass:: Tokenizer
:members:

NEWMM
-----
Tokenization Engines
--------------------

newmm
+++++
.. automodule:: pythainlp.tokenize.newmm
.. autofunction:: pythainlp.tokenize.newmm.segment

TCC
---
Thai Character Cluster

longest
+++++++
.. automodule:: pythainlp.tokenize.longest

multi_cut
+++++++++
.. automodule:: pythainlp.tokenize.multi_cut

pyicu
+++++
.. automodule:: pythainlp.tokenize.pyicu

deepcut
+++++++
.. automodule:: pythainlp.tokenize.deepcut

tcc
+++
.. automodule:: pythainlp.tokenize.tcc

.. autofunction:: pythainlp.tokenize.tcc.segment
.. autofunction:: pythainlp.tokenize.tcc.tcc
.. autofunction:: pythainlp.tokenize.tcc.tcc_pos

etcc
++++
.. automodule:: pythainlp.tokenize.etcc
9 changes: 8 additions & 1 deletion docs/api/transliterate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,11 @@ Modules
-------

.. autofunction:: romanize
.. autofunction:: transliterate
.. autofunction:: transliterate

References
----------

.. [rtgs_transcription] Nitaya Kanchanawan. (2006). Romanization, Transliteration, and Transcription for the Globalization of the Thai Language.
The Journal of the Royal Institute of Thailand.
Available at: http://www.royin.go.th/wp-content/uploads/royin-ebook/276/FileUpload/758_6484.pdf
8 changes: 7 additions & 1 deletion docs/api/word_vector.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The :class:`word_vector` contains functions that makes use of a pre-trained vect

Dependencies
------------
Installation of `numpy` and `gensim` is required.
Installation of :mod:`numpy` and :mod:`gensim` is required.

Modules
-------
Expand All @@ -16,3 +16,9 @@ Modules
.. autofunction:: doesnt_match
.. autofunction:: similarity
.. autofunction:: sentence_vectorizer

References
----------

.. [OmerLevy_YoavGoldberg_2014] Omer Levy and Yoav Goldberg.
Linguistic Regularities in Sparse and Explicit Word Representations, 2014.
6 changes: 3 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from datetime import datetime

# -- Project information -----------------------------------------------------
Expand Down
Loading