# [世界のUniversal Dependenciesと係り受け解析ツール群](http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/publications/2021-06-22.pdf)
## [タイ語UDを用いた係り受け解析器の自作](https://koichiyasuoka.github.io/deplacy/demo/2021-06-22/)
### [SuPar](https://github.com/yzhangcs/parser)と[bert-base-thai](https://huggingface.co/monsoon-nlp/bert-base-thai)と[PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)を用いる場合


必要なパッケージと各conlluを準備

In [ ]:
!test -d UD_Thai-PUD || git clone --depth=1 https://github.com/universaldependencies/UD_Thai-PUD
with open("UD_Thai-PUD/th_pud-ud-test.conllu", "r", encoding="utf-8") as f:
  r=f.read()
with open("train.conllu", "w", encoding="utf-8") as f1:
  with open("test.conllu", "w", encoding="utf-8") as f2:
    with open("dev.conllu", "w", encoding="utf-8") as f3:
      f=[f1]*8+[f2, f3]
      for i,s in enumerate(r.strip().split("\n\n")):
        print(s, "", sep="\n", file=f[i%len(f)])
!pip install supar pythainlp deplacy

my.suparを作成 (GPUで2時間程度)

In [ ]:
!biaffine-dep train -b -d 0 -c biaffine-dep-en -p my.supar -f bert --bert monsoon-nlp/bert-base-thai --embed= --train train.conllu --dev dev.conllu --test test.conllu

my.suparで係り受け解析

In [ ]:
from supar import Parser
prs = Parser.load("my.supar")
nlp = lambda x: prs.predict([x], lang=None).sentences[0]
doc = nlp(["ไม่","เข้า","ถ้า","เสือ","ย่อม","ไม่","ได้","ลูก","เสือ"])
print(doc)
import deplacy
deplacy.serve(doc,port=None)

PyThaiNLPで単語切り

In [ ]:
from supar import Parser
from pythainlp.tokenize import word_tokenize
prs = Parser.load("my.supar")
nlp = lambda x: prs.predict([word_tokenize(x)], lang=None).sentences[0]
doc = nlp("ไม่เข้าถ้าเสือ ย่อมไม่ได้ลูกเสือ")
print(doc)
import deplacy
deplacy.serve(doc,port=None)

UPOS・MISCを追加

In [ ]:
from supar import Parser
from pythainlp.tokenize import word_tokenize
from pythainlp.tag import pos_tag
prs = Parser.load("my.supar")
def nlp(sentence):
  s = word_tokenize(sentence)
  d = prs.predict([[t for t in s if not t.isspace()]], lang=None).sentences[0]
  d.values[3] = [p for t,p in pos_tag(s, corpus="orchid_ud") if not t.isspace()]
  m = [i-j-1 for j,i in enumerate([i for i,t in enumerate(s) if t.isspace()])]
  d.values[9] = ["_" if i in m else "SpaceAfter=No" for i in range(len(s)-len(m))]
  return d
doc = nlp("ไม่เข้าถ้าเสือ ย่อมไม่ได้ลูกเสือ")
print(doc)
import deplacy
deplacy.serve(doc,port=None)