WARNING: This section takes several hours or days.
First, download the NEologd dictionary as follows.
WORKDIR=/path/to/your/work/dir
cd $WORKDIR # move to the working directory
git clone --depth 1 https://github.com/neologd/mecab-unidic-neologd/
Then, extract the csv file of NEologd dictionary using unxz
command.
# if your system has the unxz command
unxz -k `ls mecab-unidic-neologd/seed/*.xz | tail -n 1`
# otherwise
docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
unxz -k `ls mecab-unidic-neologd/seed/*.xz | tail -n 1`
This will generate a CSV file named mecab-unidic-user-dict-seed.yyyymmdd.csv
.
Then, apply the patch to the NEologd dictionary which we have just extracted, as follows.
This creates a dictionary file neologd_modified.csv
in the working directory.
docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
tdmelodic-neologd-preprocess \
--input `ls mecab-unidic-neologd/seed/mecab-unidic-user-dict-seed*.csv | tail -n 1` \
--output neologd_modified.csv \
--no-rmdups --no-rm_wrong_yomi
--no-rmdups
, --no-rm_wrong_yomi
are options whether or not to remove certain words.
These options can be found with the following command.
docker run --rm tdmelodic:latest tdmelodic-neologd-preprocess -h
WARNING! THIS TAKES MUCH TIME! (FYI: It took about 2.5 hours in a MacBookPro, 5 hours in our Linux server.)
Now let generate the accent dictionary. It estimates the accent of the words listed in NEologd dictionary by a machine learning -based technique.
docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
tdmelodic-convert \
-m unidic \
--input neologd_modified.csv \
--output tdmelodic_original.csv
cp ${WORKDIR}/tdmelodic_original.csv ${WORKDIR}/tdmelodic.csv # backup
Unigram costs can be fixed using the following script.
cp ${WORKDIR}/tdmelodic.csv ${WORKDIR}/tdmelodic.csv.bak
docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
tdmelodic-modify-unigram-cost \
-i tdmelodic.csv.bak \
-o tdmelodic.csv