Library for Korean morpheme and word vector representation.
For training,
- Python 3
- Tensorflow
- numpy, scipy
- Konlpy (Twitter)
For test and visualization,
- gensim
- sklearn
- matplotlib
In order to learn morpheme vectors, do:
$ python3 train.py <input_corpus>
<input_corpus> format : one sentence = one line
$ python3 train.py -h
usage: train.py [-h] [--embedding_size EMBEDDING_SIZE]
[--window_size WINDOW_SIZE] [--min_count MIN_COUNT]
[--num_sampled NUM_SAMPLED] [--learning_rate LEARNING_RATE]
[--sampling_rate SAMPLING_RATE] [--epochs EPOCHS]
[--batch_size BATCH_SIZE]
input
positional arguments:
input input text file for training: one sentence per line
optional arguments:
-h, --help show this help message and exit
--embedding_size EMBEDDING_SIZE
embedding vector size (default=150)
--window_size WINDOW_SIZE
window size (default=5)
--min_count MIN_COUNT
minimal number of word occurences (default=5)
--num_sampled NUM_SAMPLED
number of negatives sampled (default=50)
--learning_rate LEARNING_RATE
learning rate (default=1.0)
--sampling_rate SAMPLING_RATE
rate for subsampling frequent words (default=0.0001)
--epochs EPOCHS number of epochs (default=3)
--batch_size BATCH_SIZE
batch size (default=150)
$ python3
>>>> from gensim.models.keyedvectors import KeyedVectors
>>>> pos_vectors = KeyedVectors.load_word2vec_format('pos.vec', binary=False)
>>>> pos_vectors.most_similar("('대통령','Noun')")
A word vector is defined by sum of its morphemes' vectors.
$ python3
>>>> from konlpy.tag import Twitter
>>>> import numpy as np
>>>> twitter = Twitter()
>>>> word = "대통령이"
>>>> pos_list = twitter.pos(word, norm=True)
>>>> word_vector = np.sum([pos_vectors.word_vec(str(pos).replace(" ", "")) for pos in pos_list], axis=0)
- Word Similarity Test : Translated WordSim 353 Dataset into Korean. Translation ambiguous words were excluded.
- WordSim 353 Dataset : http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
- Word Analogy Test : Created Semantic Pair 420 questions + Syntactic Pair 840 questions.
Word similarity test using kor_ws353.csv.
$ python3 test/similarity_test.py pos.vec
Word analogy test using kor_analogy_semantic.txt.
$ python3 test/analogy_test.py pos.vec
Visualize the learned embeddings on two dimensional space using PCA.
$ python3 test/visualization.py pos.vec --words 밥 밥을 물 물을
Morpheme vectors are trained on Naver news corpus (218M tokens) using our model. You can download pre-trained morpheme vectors here : http://mmlab.snu.ac.kr/~djlee/pos.vec
$ python3
>>>> from gensim.models.keyedvectors import KeyedVectors
>>>> pos_vectors = KeyedVectors.load_word2vec_format('pos.vec', binary=False)
>>>> pos_vectors.most_similar("('대통령','Noun')")
>>>> pos_vectors.most_similar(positive=["('도쿄','Noun')", "('프랑스','Noun')"], negative=["('일본','Noun')"])