Skip to content

Latest commit

 

History

History
63 lines (50 loc) · 2 KB

unidic-dicgen.md

File metadata and controls

63 lines (50 loc) · 2 KB

Dictionary generation for UniDic users

WARNING: This section takes several hours or days.

Prepare the base dictionary

git clone NEologd

In this section, we always work in the docker container we have just created.

WORKDIR=/path/to/your/work/dir
cd $WORKDIR # move to the working directory
git clone --depth 1 https://github.com/neologd/mecab-unidic-neologd/

Extract the NEologd vocabulary file and apply a patch

First, extract the csv file of NEologd dictionary using unxz command.

# if your system has the unxz command
unxz -k `ls mecab-unidic-neologd/seed/*.xz | tail -n 1`
# otherwise
docker run -v $(pwd):/root/workspace tdmelodic:latest \
    unxz -k `ls mecab-unidic-neologd/seed/*.xz | tail -n 1`

This will generate a CSV file named mecab-unidic-user-dict-seed.yyyymmdd.csv. Then, apply the patch to the NEologd dictionary which we have just extracted, as follows. This creates a dictionary file neologd_modified.csv in the /tmp directory of the docker instance.

docker run -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-neologd-patch \
    --input `ls mecab-unidic-neologd/seed/mecab-unidic-user-dict-seed*.csv | tail -n 1` \
    --output /tmp/neologd_modified.csv

Inference

WARNING! THIS TAKES MUCH TIME! (FYI: It took about 2.5 hours in a MacBookPro, 5 hours in our Linux server.)

Now let generate the accent dictionary. It estimates the accent of the words listed in NEologd dictionary by a machine learning -based technique.

docker run -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-convert \
    --input /tmp/neologd_modified.csv \
    --output ${WORKDIR}/tdmelodic_original.csv
cp ${WORKDIR}/tdmelodic_original.csv ${WORKDIR}/tdmelodic.csv # backup

Postprocess

Unigram costs can be fixed using the following script.

cp ${WORKDIR}/tdmelodic.csv ${WORKDIR}/tdmelodic.csv.bak
docker run -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-modify-unigram-cost \
    -i tdmelodic.csv.bak \
    -o tdmelodic.csv