# Tokenization and Part-of-Speech

`hanlp` documentation of
- [tok](https://hanlp.hankcs.com/docs/api/hanlp/pretrained/tok.html)
- [pos](https://hanlp.hankcs.com/docs/api/hanlp/pretrained/pos.html)

In [None]:
!pip install hanlp

Collecting hanlp
  Downloading hanlp-2.1.3-py3-none-any.whl.metadata (13 kB)
Collecting hanlp-common>=0.0.23 (from hanlp)
  Downloading hanlp_common-0.0.23.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hanlp-downloader (from hanlp)
  Downloading hanlp_downloader-0.0.25.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hanlp-trie>=0.0.4 (from hanlp)
  Downloading hanlp_trie-0.0.5.tar.gz (6.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pynvml (from hanlp)
  Downloading pynvml-13.0.1-py3-none-any.whl.metadata (5.6 kB)
Collecting toposort==1.5 (from hanlp)
  Downloading toposort-1.5-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting phrasetree>=0.0.9 (from hanlp-common>=0.0.23->hanlp)
  Downloading phrasetree-0.0.9.tar.gz (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
import hanlp
tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)
pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL)

Downloading https://file.hankcs.com/hanlp/tok/coarse_electra_small_20220616_012050.zip to /root/.hanlp/tok/coarse_electra_small_20220616_012050.zip
Decompressing /root/.hanlp/tok/coarse_electra_small_20220616_012050.zip to /root/.hanlp/tok
Downloading https://file.hankcs.com/hanlp/utils/char_table_20210602_202632.json.zip to /root/.hanlp/utils/char_table_20210602_202632.json.zip
Decompressing /root/.hanlp/utils/char_table_20210602_202632.json.zip to /root/.hanlp/utils
Downloading https://file.hankcs.com/hanlp/transformers/electra_zh_small_20210706_125427.zip to /root/.hanlp/transformers/electra_zh_small_20210706_125427.zip
Decompressing /root/.hanlp/transformers/electra_zh_small_20210706_125427.zip to /root/.hanlp/transformers
Downloading https://file.hankcs.com/hanlp/pos/pos_ctb_electra_small_20220215_111944.zip to /root/.hanlp/pos/pos_ctb_electra_small_20220215_111944.zip
Decompressing /root/.hanlp/pos/pos_ctb_electra_small_20220215_111944.zip to /root/.hanlp/pos


In [None]:
test = [
    '師者，所以傳道、受業、解惑也',
    '故人西辭黃鶴樓，煙花三月下揚州。孤帆遠影碧空盡，惟見長江天際流。',
    '商品和服务。',
    '晓美焰来到北京立方庭参观自然语义科技公司'
]

In [None]:
tok_test = tok(test)
for i in tok_test:
  print(i)

['師', '者', '，', '所以', '傳道', '、', '受業', '、', '解惑', '也']
['故人', '西', '辭', '黃鶴樓', '，', '煙花', '三月', '下', '揚州', '。', '孤', '帆', '遠', '影', '碧空', '盡', '，', '惟', '見', '長江', '天際', '流', '。']
['商品', '和', '服务', '。']
['晓美焰', '来到', '北京立方庭', '参观', '自然语义科技公司']


In [None]:
for i in tok_test:
  print(pos(i))

['NN', 'SP', 'PU', 'AD', 'VV', 'PU', 'VV', 'PU', 'VV', 'SP']
['NN', 'NR', 'VV', 'NR', 'PU', 'NN', 'NT', 'VV', 'NR', 'PU', 'JJ', 'NN', 'JJ', 'NN', 'NN', 'VV', 'PU', 'AD', 'VV', 'NR', 'NN', 'VV', 'PU']
['NN', 'CC', 'NN', 'PU']
['NR', 'VV', 'NR', 'VV', 'NN']


# Text-to-speech

In [None]:
!pip install edge-tts

Collecting edge-tts
  Downloading edge_tts-7.2.7-py3-none-any.whl.metadata (5.5 kB)
Downloading edge_tts-7.2.7-py3-none-any.whl (30 kB)
Installing collected packages: edge-tts
Successfully installed edge-tts-7.2.7


In [None]:
test = [
    '師者，所以傳道、受業、解惑也',
    '故人西辭黃鶴樓，煙花三月下揚州。孤帆遠影碧空盡，惟見長江天際流。',
    '商品和服务。',
    '晓美焰来到北京立方庭参观自然语义科技公司'
]

In [None]:
import edge_tts
import os

# Create the output directory if it doesn't exist
output_dir = "autio_out"
os.makedirs(output_dir, exist_ok=True)

for i in range(len(test)):
    communicate = edge_tts.Communicate(test[i], "zh-CN-YunjianNeural")
    await communicate.save(os.path.join(output_dir, f"test{i}.mp3"))

In [None]:
from IPython.display import Audio, display

for i in range(len(test)):
    sound_file = os.path.join(output_dir, f"test{i}.mp3")

    # Display the audio player in the output cell
    display(Audio(sound_file, autoplay=False))