Skip to content

Latest commit

 

History

History
71 lines (57 loc) · 2.22 KB

unidic-dicgen.md

File metadata and controls

71 lines (57 loc) · 2.22 KB

Dictionary generation for UniDic users

WARNING: This section takes several hours or days.

Prepare the base dictionary

git clone NEologd

First, download the NEologd dictionary as follows.

WORKDIR=/path/to/your/work/dir
cd $WORKDIR # move to the working directory
git clone --depth 1 https://github.com/neologd/mecab-unidic-neologd/

Extract the NEologd vocabulary file and apply a patch

Then, extract the csv file of NEologd dictionary using unxz command.

# if your system has the unxz command
unxz -k `ls mecab-unidic-neologd/seed/*.xz | tail -n 1`
# otherwise
docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    unxz -k `ls mecab-unidic-neologd/seed/*.xz | tail -n 1`

This will generate a CSV file named mecab-unidic-user-dict-seed.yyyymmdd.csv. Then, apply the patch to the NEologd dictionary which we have just extracted, as follows. This creates a dictionary file neologd_modified.csv in the working directory.

docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-neologd-preprocess \
    --input `ls mecab-unidic-neologd/seed/mecab-unidic-user-dict-seed*.csv | tail -n 1` \
    --output neologd_modified.csv \
    --no-rmdups --no-rm_wrong_yomi

--no-rmdups, --no-rm_wrong_yomi are options whether or not to remove certain words. These options can be found with the following command.

docker run --rm tdmelodic:latest tdmelodic-neologd-preprocess -h

Inference

WARNING! THIS TAKES MUCH TIME! (FYI: It took about 2.5 hours in a MacBookPro, 5 hours in our Linux server.)

Now let generate the accent dictionary. It estimates the accent of the words listed in NEologd dictionary by a machine learning -based technique.

docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-convert \
    -m unidic \
    --input neologd_modified.csv \
    --output tdmelodic_original.csv
cp ${WORKDIR}/tdmelodic_original.csv ${WORKDIR}/tdmelodic.csv # backup

Postprocess

Unigram costs can be fixed using the following script.

cp ${WORKDIR}/tdmelodic.csv ${WORKDIR}/tdmelodic.csv.bak
docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-modify-unigram-cost \
    -i tdmelodic.csv.bak \
    -o tdmelodic.csv