# [Universal Dependencies와 BERT/RoBERTa 모델을 통한 고전 중국어 정보처리](http://www.han-character.education/home/index.php/2022/11/18/20221118/)

## 古典中国語RoBERTaの穴埋めゲーム

In [ ]:
!pip install transformers
from transformers import pipeline
fmp=pipeline("fill-mask","KoichiYasuoka/roberta-classical-chinese-base-char")
prd=fmp("勞[MASK]者治人")
print("\n".join("{:8} {:.3f}".format(t["token_str"],t["score"]) for t in prd))

## 系列ラベリングによる品詞付与・単語切り

In [ ]:
!pip install deplacy transformers
class SeqL(object):
  def __init__(self,bert):
    from transformers import pipeline
    self.tagger=pipeline(task="ner",model=bert)
  def __call__(self,text):
    w=[(t["start"],t["end"],t["entity"]) for t in self.tagger(text)]
    u="# text = "+text.replace("\n"," ")+"\n"
    for i,(s,e,p) in enumerate(w,1):
      m,q="_" if i<len(w) and e<w[i][0] else "SpaceAfter=No",p.split("|")
      f="_" if p.find("=")<0 else "|".join(t for t in q if t.find("=")>0)
      u+="\t".join([str(i),text[s:e],"_",q[0],"_",f,"_","_","_",m])+"\n"
    return u+"\n"
nlp=SeqL("KoichiYasuoka/roberta-classical-chinese-base-ud-goeswith")
doc=nlp("未聞好學者也")
import deplacy
deplacy.serve(doc,port=None)

## 系列ラベリングによる隣接行列logits導出

In [ ]:
!pip install transformers
import torch,numpy
from transformers import AutoTokenizer,AutoModelForTokenClassification
brt="KoichiYasuoka/roberta-classical-chinese-base-ud-goeswith"
txt="未聞好學者也"
tkz=AutoTokenizer.from_pretrained(brt)
mdl=AutoModelForTokenClassification.from_pretrained(brt)
v,l=tkz(txt,return_offsets_mapping=True),mdl.config.id2label
w=[t for t,(s,e) in zip(v["input_ids"],v["offset_mapping"]) if s<e]
u=[txt[s:e] for s,e in v["offset_mapping"] if s<e]
cls,msk,sep=tkz.cls_token_id,tkz.mask_token_id,tkz.sep_token_id
x=[[cls]+w[:i]+[msk]+w[i+1:]+[sep,j] for i,j in enumerate(w)]
with torch.no_grad():
  m=mdl(input_ids=torch.tensor(x)).logits.numpy()[:,1:-2,:]
r=[1 if i==0 else -1 if l[i].endswith("|root") else 0 for i in range(len(l))]
m+=numpy.where(numpy.add.outer(numpy.identity(m.shape[0]),r)==0,0,numpy.nan)
g=mdl.config.label2id["X|_|goeswith"]
r=numpy.tri(m.shape[0])
for i in range(r.shape[0]):
  for j in range(i+2,r.shape[1]):
    r[i,j]=r[i,j-1] if numpy.nanargmax(m[i,j-1])==g else 1
m[:,:,g]+=numpy.where(r==0,0,numpy.nan)
d,p=numpy.nanmax(m,axis=2),numpy.nanargmax(m,axis=2)
print(" ".join(x.rjust(12-len(x)) for x in u))
for i,j in enumerate(u):
  print("\n"+" ".join("{:12.3f}".format(x) for x in d[i])," ",j)
  print(" ".join(l[x].split("|")[-1][:12].rjust(12) for x in p[i]))